Social psychologists are known for deception. First, they deceived their participants about the purpose of a study as in the famous Milgram experiment. Then, they deceived themselves that their studies produce robust and replicable results. After it became apparent that less than 25% of published results in social psychology can be replicated, they are now deceiving readers to maintain the illusion that they are a science.
The latest blatant attempt at deception is Fabrigar, Wegener, and Petty’s article “A Validity-Based Framework for Understanding Replication in Psychology” published in PSPB which is edited by Chris Crandall, who has been defending shoddy practices and questionable results on social media for the past decade.
The authors first deception is that they fail to mention the extent of the replication crisis in social psychology. A comprehensive replication attempt found that only 25% of results in social psychology could be replicated (Open Science Collaboration, 2015). There also has been no other representative samples of social psychology studies. Nevertheless, the authors imply that the result was only sometimes less than 50%. This dishonest presentation of the facts has been used by several prominent social psychologists to avoid stating the fact that only a quarter of published results is expected to replicate (cf. Schimmack, 2020a).
Next, the authors note that researchers make different attributions about the causes of replication failures. Some authors assume that the low replication rate shows that original results were produced with questionable research practices that inflate effect sizes and make it unlikely that a replication study will be successful (John et al., 2012). Other researchers defend original studies and blame replication failures on problems with the replication studies. However, the authors fail to mention that there is strong support for the first explanation and very little support for the second explanation (Schimmack, 2020a).
It is unscientific and deceptive to hide relevant data from an article on a topic that can be examined empirically. The argument whether we can trust original studies or not like an argument about ice cream flavors. Regarding the replication crisis, there is a correct answer and empirical data clearly show that the correct answer is that questionable research practices were abused to present everything as statistically significant, even time-reversed stimulation by erotic stimuli (Bem, 2011; cf. Schimmack, 2018). Why should anybody trust social psychologists if they are not able to admit to their mistakes and learn from them.
The deception does not end here. The authors claim that replication failures can be attributed to four potential problems: statistical conclusion validity, internal validity, construct validity, and external validity. This sounds super scientific, but is just bullshit.
Internal validity is about causality and if an experiment is replicated with an experiment both studies have internal validity. So, a replication failure cannot be interpreted to low internal validity in the replication study.
External validity is the question whether a laboratory experiment shows results that can be generalized to the real world. An independent criticism of experimental social psychology is that many experiments lack external validity, but this is true for original and replication studies. Based on concerns about external validity, social psychology should do less experiments, but this has nothing to do with the replication crisis.
Construct validity has to do with the ability of an experimental manipulation to manipulate the variable of interest (e.g. mood) and the amount of variance in a measure that reflects the construct that is supposed to be manipulated (e.g., prejudice). Once more, construct validity is a property of original and replication studies. So, construct validity also has nothing to do with the replication crisis. However, construct validity is a problem in social psychology because many measures have not been properly validated (Schimmack, 2020b), a problem that is not unique to social psychology (Schimmack, 2020c).
This leaves only statistical conclusion validity as a viable explanation for replication failures, but the term statistical conclusion validity is rare and its meaning is unclear. The authors explain:
In short, statistical conclusion validity boils down to not making a type-I error or a type-II error. However, it is problematic to talk about replication failures in terms of these two errors when the null-hypothesis is specified as an effect size of zero; the nil-hypothesis (Cohen, 1994′ Schimmack, 2020a). Let’s use a simple example. Let’s say that some experimental manipulation outside of participants behaviour has a very small effect on their behaviour, d = .05. As we are assuming a non-null effect size, we know that there is an effect and that the nil-hypothesis is false. Therefore, studies that test this hypothesis can only make a type-II error. Now assume that a researcher conducts a study with N = 30 participants. This study has a probability of 5.2% to produce a significant result with the classic criterion of p < .05 (two-tailed). So, we would expect a non-significant result. However, using a variety of statistical tricks, known as questionable research practices, a researcher can inflate the effect size and increase the chance of obtaining a significant results to 60% or more (Simmons et al., 2011). Thus, it does not require a lot of resources to produce “evidence” for the effect. Now let’s consider a researcher who does attempt to replicate these findings, but without statistical tests. This researcher is very unlikely (1 out of 40 times) to produce a significant result that matches the original result (effect size in the same direction & p < .05). So, this researcher will publish a replication failure.
Based on social psychologists logic, the replication failure is the wrong result because it fails to provide evidence against the nil-hypothesis when the nil-hypothesis is false, while the original study showed the correct result. The problem with this warped logic is that the original study used deception to produce evidence against the false nil-hypothesis. It is deceptive to claim that the probability of a type-I error is no more than 5%, when questionable research practices were used. This problem is ignored when we focus on the type-II error in the replication study.
What is fundamentally wrong with experimental social psychology is the idea that falsifying the nil-hypothesis is sufficient to make scientific advances. It is sad that social psychologists in 2020 can still publish an article that maintains this illusion. Using questionable research practices to produce p-values less than .05 in a test of nil-hypothesis is not a sound scientific method. As long as social psychologists deceive themselves that it is, it is not a science. Defund social psychology until they clean up their act.
John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the Prevalence of Questionable Research Practices With Incentives for Truth Telling. Psychological Science, 23(5), 524–532. https://doi.org/10.1177/0956797611430953
Schimmack, U. (2018). Why the Journal of Personality and Social Psychology Should Retract Article DOI: 10.1037/a0021524 “Feeling the Future: Experimental evidence for anomalous retroactive influences on cognition and affect” by Daryl J. Bem https://replicationindex.com/2018/01/05/bem-retraction/
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359 –1366. http://dx.doi.org/10.1177/0956797611417632
2.17.2020 [the blog post has been revised after I received reviews of the ms. The reference list has been expanded to include all major viewpoints and influential articles. If you find something important missing, please let me know.]
7.2.2020 [the blog post has been edited to match the print version behind the paywall]
You can email me to request a copy of the printed article (email@example.com)
Citation: Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne. Advance online publication. https://doi.org/10.1037/cap0000246
Bem’s (2011) article triggered a string of replication failures in social psychology. A major replication project found that only 25% of results in social psychology could be replicated. I examine various explanations for this low replication rate and found most of them lacking in empirical support. I then provide evidence that the use of questionable research practices accounts for this result. Using z-curve and a representative sample of focal hypothesis tests, I find that the expected replication rate for social psychology is between 20% and 45%. I argue that quantifying replicability can provide an incentive to use good research practices and to invest more resources in studies that produce replicable results. The replication crisis in social psychology provides important lessons for other disciplines in psychology that have avoided to take a closer look at their research practices.
Keywords: Replication, Replicability, Replicability Crisis, Expected Replication Rate, Expected Discovery Rate, Questionable Research Practices, Power, Social Psychology
The 2010s started with a bang. Journal clubs were discussing the preprint of Bem’s (2011) article “Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect.” Psychologists were confronted with a choice. Either they had to believe in anomalous effects or they had to believe that psychology was an anomalous science. Ten years later, it is pos- sible to look back at Bem’s article with the hindsight of 2020. It is now clear that Bem used questionable practices to produce false evidence for his outlandish claims (Francis, 2012; Schim- mack, 2012, 2018b, 2020). Moreover, it has become apparent that these practices were the norm and that many other findings in social psychology cannot be replicated. This realisation has led to initiatives to change research practices that produce more credible and replicable results. The speed and the extent of these changes has been revolutionary. Akin to the cognitive revolution in the 1960s and the affective revolution in the 1980s, the 2010s have witnessed a method revolution. Two new journals were created that focus on methodological problems and improvements of research practices: Meta-Psychology and Advances in Methods and Practices in Psychological Science.
In my review of the method revolution, I focus on replication failures in experimental social psychology and the different explanations for these failures. I argue that the use of questionable research practices accounts for many replication failures, and I examine how social psychologists have responded to evidence that questionable research practices (QRPs) undermine the trustworthiness of social psychological results. Other disciplines may learn from these lessons and may need to reform their research practices in the coming decade.
Arguably, the most important development in psychology has been the publication of replication failures. When Bem (2011) published his abnormal results supporting paranormal phenomena, researchers quickly failed to replicate these sensational results. However, they had a hard time publishing these results. The editor of the journal that published Bem’s findings, the Journal of Personality and Social Psychology (JPSP), did not even send the article out for review. This attempt to suppress negative evidence failed for two reasons. First, online-only journals with unlimited journal space like PLoSOne or Frontiers were willing to publish null results (Ritchie, Wiseman, & French, 2012). Second, the decision to reject the replication studies was made public and created a lot of attention because Bem’s article had attracted so much attention (Aldhous, 2011). In response to social pressure, JPSP did publish a massive replication failure of Bem’s results (Galak, LeBoeuf, Nelson, & Simmons, 2012).
Over the past decade, new article formats have evolved that make it easier to publish results that fail to confirm theoretical predictions such as registered reports (Chambers, 2013) and registered replication reports (Association for Psychological Science, 2015). Registered reports are articles that are accepted for publication before the results are known, thus avoiding the problem of publishing only confirmatory findings. Scheel, Schijen, and Lakens (2020) found that this format reduced the rate of significant results from over 90% to about 50%. This difference suggests that the normal literature has a strong bias to publish significant results (Bakker, van Dijk, & Wicherts, 2012; Sterling, 1959; Sterling, Rosenbaum, & Weinkam, 1995).
Registered replication reports are registered reports that aim to replicate an original study in a high-powered study with many laboratories. Most registered replication reports have produced replication failures (Kvarven, Strømland, & Johannesson, 2020). These failures are especially stunning because registered replication reports have a much higher chance to produce a significant result than the original studies with much smaller samples. Thus, the failure to replicate ego depletion (Hagger et al., 2016) or facial feedback (Acosta et al., 2016) effects was shocking.
Replication failures of specific studies are important for specific theories, but they do not examine the crucial question of whether these failures are anomalies or symptomatic of a wider problem in psychological science. Answering this broader question requires a representative sample of studies from the population of results published in psychology journals. Given the diversity of psychology, this is a monumental task.
A first step toward this goal was the Reproducibility Project that focused on results published in three psychology journals in the year 2008. The journals represented social/personality psychology (JPSP), cognitive psychology (Journal of Experimental Psychology: Learning, Memory, and Cognition), and all areas of psychology (Psychological Science). Although all articles published in 2008 were eligible, not all studies were replicated, in part because some studies were very expensive or difficult to replicate. In the end, 97 studies with significant results were replicated. The headline finding was that only 37% of the replication studies replicated a statistically significant result.
This finding has been widely cited as evidence that psychology has a replication problem. However, headlines tend to blur over the fact that results varied as a function of discipline. While the success rate for cognitive psychology was 50% and even higher for within-subject designs with many observations per participant, the success rate was only 25% for social psychology and even lower for the typical between-subjects design that was employed to study ego depletion, facial feedback, or other prominent topics in social psychology.
These results do not warrant the broad claim that psychology has a replication crisis or that most results published in psychology are false. A more nuanced conclusion is that social psychology has a replication crisis and that methodological factors account for these differences. Disciplines that use designs with low statistical power are more likely to have a replication crisis.
To conclude, the 2010s have seen a rise in publications of nonsignificant results that fail to replicate original results and that contradict theoretical predictions. The replicability of published results is particularly low in social psychology.
Responses to the Replication Crisis in Social Psychology
There have been numerous responses to the replication crisis in social psychology. Broadly, they can be classified as arguments that support the notion of a crisis and arguments that claim that there is no crisis. I first discuss problems with no-crisis arguments. I then examine the pro-crisis arguments and discuss their implications for the future of psychology as a science.
No Crisis: Downplaying the Finding
Some social psychologists have argued that the term crisis is inappropriate and overly dramatic. “Every generation or so, social psychologists seem to enjoy experiencing a ‘crisis.’ While sympathetic to the underlying intentions underlying these episodes— first the field’s relevance, then the field’s methodological and statistical rigor—the term crisis seems to me overly dramatic. Placed in a positive light, social psychology’s presumed ‘crises’ actually marked advances in the discipline” (Pettigrew, 2018, p. 963). Others use euphemistic and vague descriptions of the low replication rate in social psychology. For example, Fiske (2017) notes that “like other sciences, not all our effects replicate” (p. 654). Crandall and Sherman (2016) note that the number of successful replications in social psychology was “at a lower rate than expected” (p. 94).
These comments downplay the stunning finding that only 25% of social psychology results could be replicated. Rather than admitting that there is a problem, these social psychologists find fault with critics of social psychology. “I have been proud of the professional stance of social psychology throughout my long career. But unrefereed blogs and social media attacks sent to thou- sands can undermine the professionalism of the discipline” (Pettigrew, 2018, p. 967). I would argue that lecturing thousands of students each year based on evidence that is not replicable is a bigger problem than talking openly about the low replicability of social psychology on social media.
No Crisis: Experts Can Reliably Produce Effects
After some influential priming results could not be replicated, Daniel Kahneman wrote a letter to John Bargh and suggested that leading priming researchers should conduct a series of replication studies to demonstrate that their original results are replicable (Yong, 2012). In response, Bargh and other prominent social psychologists conducted numerous studies that showed the effects are robust. At least, this is what might have happened in an alternate universe. In this universe, there have been few attempts to self-replicate original findings. Bartlett (2013) asked Bargh why he did not prove his critics wrong by doing the study again. “So why not do an actual examination? Set up the same experiments again, with additional safeguards. It wouldn’t be terribly costly. No need for a grant to get undergraduates to unscramble sentences and stroll down a hallway” (Bartlett, 2013).
Bargh’s answer is not very convincing. “Bargh says he wouldn’t want to force his graduate students, already worried about their job prospects, to spend time on research that carries a stigma. Also, he is aware that some critics believe he’s been pulling tricks, that he has a ‘special touch’ when it comes to priming, a comment that sounds like a compliment but isn’t. ‘I don’t think anyone would believe me,’ he says” (Bartlett, 2013).
One self-replication ended with a replication failure (Elkins- Brown, Saunders, & Inzlicht, 2018). One notable successful self- replication was conducted by Petty and colleagues (Luttrell, Petty, & Xu, 2017), after a replication study by Ebersole et al. (2016) failed to replicate a seminal finding by Cacioppo, Petty, and Morris (1983) that need for cognition moderates the effect of argument strength on attitudes. Luttrell et al. (2017) were able to replicated the original finding by Cacioppo et al., and they repro- duced the nonsignificant result of Ebersole et al.’s replication study. In addition, they found a significant interaction with exper- imental design, indicating that procedural differences made the effect weaker in Ebersole et al.’s replication study. This study has been celebrated as an exemplary way to respond to replication failures. It also suggests that flaws in replication studies are some- times responsible for replication failures. However, it is impossible to generalise from this single instance to other replication failures. Thus, it remains unclear how many replication failures were caused by problems with the replication studies.
No Crisis: Decline Effect
The idea that replication failures occur because effects weaken over time was proposed by Johnathan Schooler and popularized in a New Yorker article (Lehrer, 2010). Schooler coined the term decline effect for the observation that effect sizes often decrease over time. Unfortunately, it does not work for more mundane behaviours like eating cheesecake. No matter how often you eat cheesecakes, they still add pounds to your weight. However, for effects in social psychology, it seems to be the case that it is easier to discover effects than to replicate them (Wegner, 1992). This is also true for Schooler and Engstler-Schooler’s (1990) verbal over- shadowing effect. A registered replication report replicated a statistically significant effect but with smaller effect sizes (Alogna et al., 2014). Schooler (2014) considered this finding a win-win because his original results had been replicated, and the reduced effect size supported the presence of a decline effect. However, the notion of a decline effect is misleading because it merely describes a phenomenon rather than providing an explanation for it. Schooler (2014) offered several possible explanations. One possible explanation was regression to the mean (see next paragraph). A second explanation was that slight changes in experimental procedures can reduce effect sizes (more detailed discussion below). More controversial, Schooler also eludes to the possibility that some paranormal processes may produce a decline effect. “Perhaps, there are some parallels between VO [verbal overshadowing] effects and parapsychology after all, but they reflect genuine unappreciated mechanisms of nature (Schooler, 2011) and not simply the product of publication bias or other artifact” (p. 582). Schooler, however, fails to acknowledge that a mundane explanation for the decline effect involves questionable research practices that inflate effect size estimates in original studies. Using statistical tools, Francis (2012) showed that Schooler’s original verbal over-shadowing studies showed signs of bias. Thus, there is no need to look for paranormal explanation of the decline effect in verbal overshadowing. The normal practices of selectively publishing only significant results are sufficient to explain it. In sum, the decline effect is descriptive rather than explanatory, and Schooler’s suggestion that it reflects some paranormal phenomena is not supported by scientific evidence.
No Crisis: Regression to the Mean Is Normal
Regression to the mean has been invoked as one possible explanation for the decline effect (Fiedler, 2015; Schooler, 2014). Fiedler’s argument is that random measurement error in psycho- logical measures is sufficient to produce replication failures. How- ever, random measurement error is neither necessary nor sufficient to produce replication failures. The outcome of a replication study is determined solely by a study’s statistical power, and if the replication study is an exact replication of an original study, both studies have the same amount of random measurement error and power (Brunner & Schimmack, 2020). Thus, if the Open Science Collaboration (OSC) project found 97 significant results in 100 published studies, the observed discovery rate of 97% suggests that the studies had 97% power to obtain a significant result. Random measurement error would have the same effect on power in the replication studies. Thus, random measurement error cannot ex- plain why the replication studies produced only 37% significant results. Therefore, Fiedler’s claim that random measurement error alone explains replication failures is based on a misunderstanding of the phenomenon of regression to the mean.
Moreover, regression to the mean requires that studies were selected for significance. Schooler (2014) ignores this aspect of regression to the mean when he suggests that regression to the mean is normal and expected. It is not. The effect sizes of eating cheesecake do not decrease over time because there is no selection process. In contrast, the effect sizes of social psychological experiments decrease when original articles selected significant results and replication studies do not select for significance. Thus, it is not normal for success rates to decrease from 97% to 25%, just like it would not be normal for a basketball players’ free-throw percent- age to drop from 97% to 25%. In conclusion, regression to the mean implies that original studies were selected for significance and would suggest that replication failures are produced by questionable research practices. Regression to the mean therefore be- comes an argument why there is a crisis once it is recognized that it requires selective reporting of significant results, which leads to illusory success rates in psychology journals.
No Crisis: Exact Replications Are Impossible
Heraclitus, an ancient Greek philosopher, observed that you can never step into the same river twice. Similarly, it is impossible to exactly re-create the conditions of a psychological experiment. This trivial observation has been used to argue that replication failures are neither surprising nor problematic but rather the norm. We should never expect to get the same result from the same paradigm because the actual experiments are never identical, just like a river is always changing (Stroebe & Strack, 2014). This argument has led to a heated debate about the distinction and value of direct versus conceptual replication studies (Crandall & Sherman, 2016; Pashler & Harris, 2012; Zwaan, Etz, Lucas, & Donnellan, 2018).
The purpose of direct replication studies is to replicate an original study as closely as possible so that replication failures can correct false results in the literature (Pashler & Harris, 2012). However, journals were reluctant to publish replication failures. Thus, a direct replication had little value. Either the results were not significant or they were not novel. In contrast, conceptual replication studies were publishable as long as they produced a significant result. Thus, publication bias provides an explanation for many seemingly robust findings (Bem, 2011) that suddenly cannot be replicated (Galak et al., 2012). After all, it is simply not plausible that conceptual replications that intentionally change features of a study are always successful, while direct replications that try to reproduce the original conditions as closely as possible fail in large numbers.
The argument that exact replications are impossible also ignores the difference between disciplines. Why is there no replication crisis in cognitive psychology if each experiment is like a new river? And why does eating cheesecake always lead to a weight gain, no matter whether it is chocolate cheesecake, raspberry white-truffle cheesecake, or caramel fudge cheesecake? The reason is that the main features of rivers remain the same. Even if the river is not identical, you still get wet every time you step into it. To explain the higher replicability of results in cognitive psychology than in social psychology, Van Bavel, Mende-Siedlecki, Brady, and Reinero (2016) proposed that social psychological studies are more difficult to replicate for a number of reasons. They called this property of studies contextual sensitivity. Coding studies for contextual sensitivity showed the predicted negative correlation between contextual sensitivity and replicability. However, Inbar (2016) found that this correlation was no longer significant when discipline was included as a predictor. Thus, the results suggested that social psychological studies are more contextually sensitive and less replicable but that contextual sensitivity did not explain the lower replicability of social psychology.
It is also not clear that contextual sensitivity implies that social psychology does not have a crisis. Replicability is not the only criterion of good science, especially if exact replications are impossible. Findings that can only be replicated when conditions are reproduced exactly lack generalizability, which makes them rather useless for applications and for construction of broader theories. Take verbal overshadowing as an example. Even a small change in experimental procedures reduced a practically significant effect size of 16% to a no longer meaningful effect size of 4% (Alogna et al., 2014), and neither of these experimental conditions were similar to real-world situations of eyewitness identification. Thus, the practical implications of this phenomenon remain unclear because it depends too much on the specific context.
In conclusion, empirical results are only meaningful if research- ers have a clear understanding of the conditions that can produce a statistically significant result most of the time (Fisher, 1926). Contextual sensitivity makes it harder to do so. Thus, it is one potential factor that may contribute to the replication crisis in social psychology because social psychologists do not know under which conditions their results can be reproduced. For example, I asked Roy F. Baumeister to specify optimal conditions to replicate ego depletion. He was unable or unwilling to do so (Baumeister, 2016).
No Crisis: The Replication Studies Are Flawed
The argument that replication studies are flawed comes in two flavors. One argument is that replication studies are often carried out by young researchers with less experience and expertise. They did their best, but they are just not very good experimenters (Gilbert, King, Pettigrew, & Wilson, 2016). Cunningham and Baumeister (2016) proclaim, “Anyone who has served on university thesis committees can attest to the variability in the competence and commitment of new researchers. Nonetheless, a graduate committee may decide to accept weak and unsuccessful replication studies to fulfill degree requirements if the student appears to have learned from the mistakes” (p. 4). There is little evidence to support this claim. In fact, a meta-analysis found no differences in effect sizes between studies carried out by Baumeister’s lab and other labs (Hagger, Wood, Stiff, & Chatzisarantis, 2010).
The other argument is that replication failures are sexier and more attention grabbing than successful replications. Thus, replication researchers sabotage their studies or data analyses to produce nonsignificant results (Bryan, Yeager, & O’Brien, 2019; Strack, 2016). The latter accusations have been made without empirical evidence to support this claim. For example, Strack (2016) used a positive correlation between sample size and effect size to claim that some labs were motivated to produce nonsignificant results, presumably by using a smaller sample size. However, a proper bias analysis showed no evidence that there were too few significant results (Schimmack, 2018a). Moreover, the overall effect size across all labs was also nonsignificant.
Inadvertent problems, however, may explain some replication failures. For example, some replication studies reduced statistical power by replicating a study with a smaller sample than the original study (OSC, 2015; Ritchie et al., 2012). In this case, a replication failure could be a false negative (Type II error). Thus, it is problematic to conduct replication studies with smaller samples. At the same time, registered replication reports with thou- sands of participants should be given more weight than original studies with fewer than 100 participants. Size matters.
However, size is not the only factor that matters, and researchers disagree about the implications of replication failures. Not surpris- ingly, authors of the original studies typically recognise some problems with the replication attempts (Baumeister & Vohs, 2016; Strack, 2016; cf. Skibba, 2016). Ideally, researchers would agree ahead of time on a research design that is acceptable to all parties involved. Kahneman called this model an adversarial collaboration (Kahneman, 2003). However, original researchers have either not participated in the planning of a study (Strack, 2016) or withdrawn their approval after the negative results were known (Baumeister & Vohs, 2016). No author of an original study that failed to replicate has openly admitted that questionable research practices contributed to replication failures.
In conclusion, replication failures can occur for a number of reasons, just like significant results in original studies can occur for a number of reasons. Inconsistent results are frustrating because they often require further research. This being said, there is no evidence that low quality of replication studies is the sole or the main cause of replication failures in social psychology.
No Crisis: Replication Failures Are Normal
In an opinion piece for the New York Times, Lisa Feldmann Barrett, current president of the Association for Psychological Science, commented on the OSC results and claimed that “the failure to replicate is not a cause for alarm; in fact, it is a normal part of how science works” (Barrett, 2015). On the surface, Barrett makes a valid point. It is true that replication failures are a normal part of science. First, if psychologists would conduct studies with 80% power, one out of five studies would fail to replicate, even if everything is going well and all predictions are true. Second, replication failures are expected when researchers test risky hypotheses (e.g., effects of candidate genes on personality) that have a high probability of being false. In this case, a significant result may be a false-positive result and replication failures demonstrate that it was a false positive. Thus, honest reporting of replication failures plays an integral part in normal science, and the success rate of replication studies provides valuable information about the empirical support for a hypothesis. However, a success rate of 25% or less for social psychology is not a sign of normal science, especially when social psychology journals publish over 90% significant results (Motyl et al., 2017; Sterling, 1959; Sterling et al., 1995). This discrepancy suggests that the problem is not the low success rate in replication studies but the high success rate in psychology journals. If social psychologists tested risky hypotheses that have a high probability of being false, journals should report a lot of nonsignificant results, especially in articles that report multiple tests of the same hypothesis, but they do not (cf. Schimmack, 2012).
Crisis: Original Studies Are Not Credible Because They Used Null-Hypothesis Significance Testing
Bem’s anomalous results were published with a commentary by Wagenmakers, Wetzels, Borsboom, and van der Maas (2011). This commentary made various points that are discussed in more detail below, but one unique and salient point of Wagenmakers et al.’s comment concerned the use of null-hypothesis significance testing (NHST). Bem presented nine results with p values below .05 as evidence for ESP. Wagenmakers et al. object to the use of a significance criterion of .05 and argue that this criterion makes it too easy to publish false-positive results (see also Benjamin et al., 2016).
Wagenmakers et al. (2011) claimed that this problem can be avoided by using Bayes factors. When they used Bayes factors with default priors, several of Bem’s studies no longer showed evidence for ESP. Based on these findings, they argued that psychologists must change the way they analyse their data. Since then, Wagenmakers has worked tirelessly to promote Bayes factors as an alternative to NHST. However, Bayes factors have their own problems. The biggest problem is that they depend on the choice of a prior.
Bem, Utts, and Johnson (2011) pointed out that Wagenmakers et al.’s (2011) default prior assumed that there is a 50% probability that ESP works in the opposite direction (below chance accuracy) and a 25% probability that effect sizes are greater than one stan- dard deviation (Cohen’s d > 1). Only 25% of the prior distribution was allocated to effect sizes in the predicted direction between 0
No Crisis: Replication Failures Are Normal
In an opinion piece for the New York Times, Lisa Feldmann Barrett, current president of the Association for Psychological Science, commented on the OSC results and claimed that “the failure to replicate is not a cause for alarm; in fact, it is a normal part of how science works” (Barrett, 2015). On the surface, Barrett makes a valid point. It is true that replication failures are a normal part of science. First, if psychologists would conduct studies with 80% power, one out of five studies would fail to replicate, even if everything is going well and all predictions are true. Second, replication failures are expected when researchers test risky hy- potheses (e.g., effects of candidate genes on personality) that have a high probability of being false. In this case, a significant result may be a false-positive result and replication failures demonstrate that it was a false positive. Thus, honest reporting of replication failures plays an integral part in normal science, and the success rate of replication studies provides valuable information about the empirical support for a hypothesis. However, a success rate of 25% or less for social psychology is not a sign of normal science, especially when social psychology journals publish over 90% significant results (Motyl et al., 2017; Sterling, 1959; Sterling et al., 1995). This discrepancy suggests that the problem is not the low success rate in replication studies but the high success rate in psychology journals. If social psychologists tested risky hypothe- ses that have a high probability of being false, journals should report a lot of nonsignificant results, especially in articles that report multiple tests of the same hypothesis, but they do not (cf. Schimmack, 2012).
Crisis: Original Studies Are Not Credible Because They Used Null-Hypothesis Significance Testing
Bem’s anomalous results were published with a commentary by Wagenmakers, Wetzels, Borsboom, and van der Maas (2011). This commentary made various points that are discussed in more detail below, but one unique and salient point of Wagenmakers et al.’s comment concerned the use of null-hypothesis significance testing (NHST). Bem presented nine results with p values below .05 as evidence for extrasensory perception (ESP). Wagenmakers et al. object to the use of a significance criterion of .05 and argue that this criterion makes it too easy to publish false-positive results (see also Benjamin et al., 2016).
Wagenmakers et al. (2011) claimed that this problem can be avoided by using Bayes factors. When they used Bayes factors with default priors, several of Bem’s studies no longer showed evidence for ESP. Based on these findings, they argued that psychologists must change the way they analyse their data. Since then, Wagenmakers has worked tirelessly to promote Bayes factors as an alternative to NHST. However, Bayes factors have their own problems. The biggest problem is that they depend on the choice of a prior.
Bem, Utts, and Johnson (2011) pointed out that Wagenmakers et al.’s (2011) default prior assumed that there is a 50% probability that ESP works in the opposite direction (below chance accuracy) and a 25% probability that effect sizes are greater than one standard deviation (Cohen’s d > 1). Only 25% of the prior distribution was allocated to effect sizes in the predicted direction between 0 and 1. This prior makes no sense for research on ESP processes that are expected to produce small effects.
When Bem et al. (2011) specified a more reasonable prior, Bayes factors actually showed more evidence for ESP than NHST. Moreover, the results of individual studies are less important than the combined evidence across studies. A meta-analysis of Bem’s studies shows that even with the default prior, Bayes factors reject the null hypothesis with an odds ratio of 1 billion to 1. Thus, if we trust Bem’s data, Bayes factors also suggest that Bem’s results are robust, and it remains unclear why Galak et al. (2012) were unable to replicate Bem’s results.
Another argument in favour of Bayes-Factors is that NHST is one-sided. Significant results are used to reject the null-hypothesis, but nonsignificant results cannot be used to affirm the null- hypothesis. This makes nonsignificant results difficult to publish, which leads to publication bias. The claim is that Bayes factors solve this problem because they can provide evidence for the null hypothesis. However, this claim is false (Tendeiro & Kiers, 2019). Bayes factors are odds ratios between two alternative hypotheses. Unlike in NHST, these two competing hypotheses are not mutually exclusive. That is, an infinite number of additional hypotheses are not tested. Thus, if the data favour the null hypothesis, they do not provide support for the null hypothesis. They merely provide evidence against one specified alternative hypothesis. There is always another possible alternative hypothesis that fits the data better than the null hypothesis. As a result, even Bayes factors that strongly favour H0 fail to provide evidence that the true effect size is exactly zero.
The solution to this problem is not new but unfamiliar to many psychologists. To demonstrate the absence of an effect, it is necessary to specify a region of effect sizes around zero and to demonstrate that the population effect size is likely to be within this region. This can be achieved using NHST (equivalence tests; Lakens, Scheel, & Isager, 2018) or Bayesian statistics (Kruschke & Liddell, 2018). The main reason why psychologists are not familiar with tests that demonstrate the absence of an effect may be that typical sample sizes in psychology are too small to produce precise estimates of effect sizes that could justify the conclusion that the population effect size is too close to zero to be meaningful.
An even more radical approach was taken by the editors of Basic and Applied Social Psychology (Trafimow & Marks, 2015), who claimed that NHST is logically invalid (Trafimow, 2003). Based on this argument, the editors banned p values from publications, which solves the problem of replication failures because there are no formal inferential tests. However, authors continue to draw causal inferences that are in line with NHST but simply omit statements about p values. It is not clear that this cosmetic change in the presentation of results is a solution to the replication crisis.
In conclusion, Wagenmakers et al. and others have blamed the use of NHST for the replication crisis, but this criticism ignores the fact that cognitive psychology also uses NHST and does not suffer a replication crisis. The problem with Bem’s results was not the use of NHST but the use of questionable research practices to produce illusory evidence (Francis, 2012; Schimmack, 2012, 2018b, 2020).
Crisis: Original Studies Report Many False Positives
An influential article by Ioannidis (2005) claimed that most published research findings are false. This eye-catching claim has been cited thousands of times. Few citing authors have bothered to point out that the claim is entirely based on hypothetical scenarios rather than empirical evidence. In psychology, fear that most published results are false positives was stoked by Simmons, Nelson, and Simonsohn’s (2011) “False-Positive Psychology” ar- ticle that showed with simulation studies that the aggressive use of questionable research practices can dramatically increase the prob- ability that a study produces a significant result without a real effect. These articles shifted concerns about false negatives in the 1990s (e.g., Cohen, 1994) to concerns about false positives.
The problem with the current focus on false-positive results is that it implies that replication failures reveal false-positive results in original studies. This is not necessarily the case. There are two possible explanations for a replication failure. Either the original study had low power to show a true effect (the nil hypothesis is false) or the original study reported a false-positive result and the nil hypothesis is true. Replication failures do not distinguish be- tween true and false nil hypothesis, but they are often falsely interpreted as if replication failures reveal that the original hypothesis was wrong. For example, Nelson, Simmons, and Simonsohn (2018) write, “Experimental psychologists spent several decades relying on methods of data collection and analysis that make it too easy to publish false-positive, nonreplicable results. During that time, it was impossible to distinguish between findings that are true and replicable and those that are false and not replicable” (p. 512). This statement ignores that results can be true but difficult to replicate and that the nil hypothesis is often unlikely to be true.
The false assumption that replication failures reveal false- positive results has created a lot of confusion in the interpretation of replication failures (Maxwell, Lau, & Howard, 2015). For example, Gilbert et al. (2016) attribute the low replication rate in the reproducibility project to low power of the replication studies. This does not make sense when the replication studies had the same or sometimes even larger sample sizes than the original studies. As a result, the replication studies had as much or more power than the original studies. So, how could low power explain that discrepancy between the 97% success rate in original studies and the 25% success rate in replication studies? It cannot.
Gilbert et al.’s (2016) criticism only makes sense if replication failures in the replication studies are falsely interpreted as evidence that the original results were false positives. Now it makes sense to argue that both the original studies and the replication studies had low power to detect true effects and that replication failures are expected when true effects are tested in studies with low power. The only question that remains is why original studies all reported significant results when they had low power, but Gilbert et al. (2016) do not address this question.
Aside from Simmons et al.’s (2011) simulation studies, a few articles tried to examine the rate of false-positive results empirically. One approach is to examine sign changes in replication studies. If 100 true null hypotheses are tested, 50 studies are expected to show a positive sign and 50 studies are expected to show a negative sign due to random sampling error. If these 100 studies are replicated, this will happen again. Just like two coin flips, we would therefore expect 50 studies with the same outcome(both positive or both negative) and 50 studies with different outcomes (one positive, one negative).
Wilson and Wixted (2018) found that 25% of social psychological results in the OSC project showed a sign reversal. This would suggest that 50% of the studies tested a true null hypothesis. Of course, sign reversals are also possible when the effect size is not strictly zero. However, the probability of a sign reversal decreases as effect sizes increase. Thus, it is possible to say that about 50% of the replicated studies had an effect size close to zero. Unfortunately, this estimate is imprecise due to the small sample size.
Gronau, Duizer, Bakker, and Wagenmakers (2017) attempted to estimate the false discovery rate using a statistical model that is fitted to the exact p values of original studies. The applied this model to three data sets and found false discovery rates (FDRs) of 34-46% for cognitive psychology, 40 – 60% for social psychology in general, and 48-88% for social priming. However, Schimmack and Brunner (2019) discovered a statistical flaw in this model that leads to the overestimation of the FDR. They also pointed out that it is impossible to provide exact estimates of the FDR because the distinction between absolutely no effect and a very small effect is arbitrary.
Bartoš and Schimmack (2020) developed a statistical model, called z-curve.2.0, that makes it possible to estimate the maximum FDR. If this maximum is low, it suggests that most replication failures are due to low power. Applying z-curve2.0 to Gronau et al.’s (2017) data sets yields FDRs of 9% (95% CI [2%, 24%]) for cognitive psychology, 26% (95% CI [4%, 100%]) for social psychology, and 61% (95% CI [19%, 100%]) for social priming. The z-curve estimate that up to 61% of social priming results could be false positives justifies Kahneman’s letter to Bargh that called out social priming research as the “poster child for doubts about the integrity of psychological research” (cf. Yong, 2012). The difference between 9% for cognitive psychology and 61% for social priming makes it clear that it is not possible to generalize from the replication crisis in social psychology to other areas of psychology. In conclusion, it is impossible to specify exactly whether an original finding was a false-positive result or not. There have been several attempts to estimate the number of false-positive results in the literature, but there is no consensus about the proper method to do so. I believe that the distinction between false and true positives is not particularly helpful if the null hypothesis is specified as a value of zero. An effect size of d = .0001 is not any more meaningful than an effect size of d = 0000. To be meaningful, published results should be replicable given the same sample sizes as used in original research. Demonstrating a significant result in the same direction in a much larger sample with a much smaller effect size should not be considered a successful replication.
Crisis: Original Studies Are Selected for Significance
The most obvious explanation for the replication crisis is the well-known bias to publish only significant results that confirm theoretical predictions. As a result, it is not necessary to read the results section of a psychological article. It will inevitably report confirmatory evidence, p < .05. This practice is commonly known as publication bias. Concerns about publication bias are nearly as old as empirical psychology (Rosenthal, 1979; Sterling, 1959). Kerr (1998) published his famous “HARKing” (hypothesising after results are known) article to explain how social psychologists were able to report mostly significant results. Social psychology journals responded by demanding that researchers publish multiple replication studies within a single article (cf. Wegner, 1992). These multiple-study articles created a sense of rigor and made false- positive results extremely unlikely. With five significant results with p < .05, the risk of a false-positive result is smaller than the criterion used by particle physicists to claim a discovery (cf. Schimmack, 2012). Thus, Bem’s (2011) article that contained nine successful studies exceeded the stringent criterion that was used to claim the discovery of the Higgs-Boson particle, the most celebrated findings in physics in the 2010s. The key difference be- tween the discovery of the Higgs-Boson particle in 2012 and Bem’s discovery of mental time travel is that physicists conducted a single powerful experiment to test their predictions, while Bem conducted many studies and selectively published results that supported his claim (Schimmack, 2018b). Bem (2012) even admitted that he ran many small studies that were not included in the article. At the same time, he was willing to combine several small studies with promising trends into a single data set. For example, Study 6 was really four studies with Ns = 50, 41, 19, and 40 (cf. Schimmack, Schultz, Carlsson, & Schmukle, 2018). These questionable, to say the least, practices were so common in social psychology that leading social psychologists were unwilling to retract Bem’s article because this practice was considered acceptable (Kitayama, 2018).
There have been three independent approaches to examine the use of questionable research practices. All three approaches show converging evidence that questionable practices inflate the rate of significant results in social psychology journals. Cairo, Green, Forsyth, Behler, and Raldiris (2020) demonstrated that published articles report more significant results than dissertations. John et al. (2012) found evidence for the use of questionable practices with a survey of research practices. The most widely used QRPs were not reporting all dependent variables (65%), collecting more data after snooping (57%), and selectively reporting studies that worked (48%). Moreover, researchers found these QRPs acceptable with defensibility ratings (0 –2) of 1.84, 1.79, and 1.66, respectively. Thus, researchers are using questionable practices because they do not consider them to be problematic. It is unclear whether attitudes toward questionable research practices have changed in response to the replication crisis.
Social psychologists have responded to John et al.’s (2012) article in two ways. One response was to question the importance of the findings. Stroebe and Strack (2014) argued that these practices may not be questionable, but they do not counter Sterling’s argument that these practices invalidate the meaning of significance testing and p values. Fiedler and Schwarz (2016) argue that John et al.’s (2012) survey produced inflated estimates of the use of QRPs. However, they fail to provide an alternative explanation for the low replication rate of social psychological research.
Statistical methods that can reveal publication bias provide additional evidence about the use of QRPs. Although these tests often have low power in small sets of studies (Renkewitz & Keiner, 2019), they can provide clear evidence of publication bias when bias is large (Francis, 2012; Schimmack, 2012) or when the set of studies is large (Carter, Kofler, Forster, & McCullough, 2015; Carter & McCullough, 2013, 2014). One group of bias tests compares the success rate to estimates of mean power. The advantage of these tests is that they provide clear evidence of QRPs. Francis used this approach to demonstrate that 82% of articles with four or more studies that were published between 2009 and 2012 in Psychological Science showed evidence of bias. Given the small set of studies, this finding implies that selection for significance was severe (Schimmack, 2020).
Social psychologists have mainly ignored evidence that QRPs were used to produce significant results. John et al.’s article has been cited over 500 times, but it has not been cited by social psychologists who commented on the replication crisis like Fiske, Baumeister, Gilbert, Wilson, or Nisbett. This is symptomatic of the response by some eminent social psychologists to the replication crisis. Rather than engaging in a scientific debate about the causes of the crisis, they have remained silent or dismissed critics as unscientific. “Some critics go beyond scientific argument and counterargument to imply that the entire field is inept and misguided (e.g., Gelman, 2014; Schimmack, 2014)” (Fiske, 2017, p. 653). Yet, Fiske fails to explain why social psychological results cannot be replicated.
Others have argued that Francis’s work is unnecessary because the presence of publication bias is a well-known fact. Therefore, “one is guaranteed to eventually reject a null we already know is false” (Simonsohn, 2013, p. 599). This argument ignores that bias tests can help to show that social psychology is improving. For example, bias tests show no bias in registered replication reports, indicating that this new format produces more credible results (Schimmack, 2018a).
Murayama, Pekrun, and Fiedler (2014) noted that demonstrating the presence of bias does not justify the conclusion that there is no effect. This is true but not very relevant. Bias undermines the credibility of the evidence that is supposed to demonstrate an effect. Without credible evidence, it remains uncertain whether an effect is present or not. Moreover, Murayama et al. acknowledge that bias always inflates effect size estimates, which makes it more difficult to assess the practical relevance of published results.
A more valid criticism of Francis’s bias analyses is that they do not reveal the amount of bias (Simonsohn, 2013). That is, when we see 95% significant results in a journal and there is bias, it is not clear whether mean power was 75% or 25%. To be more useful, bias tests should also provide information about the amount of bias.
In conclusion, selective reporting of significant results inflates effect sizes, and the observed discovery rate in journals gives a false impression of the power and replicability of published results. Surveys and bias tests show that the use of QRPs in social psychology were widespread. However, bias tests merely show that QRPs were used. They do not show how much QRPs influenced reported results.
z-Curve: Quantifying the Crisis
Some psychologists developed statistical models that can quantify the influence of selection for significance on replicability. Brunner and Schimmack (2020) compared four methods to estimate the expected replication rate (ERR), including the popular p-curve method (Brunner, 2018; Simonsohn, Nelson, & Simmons, 2014; Ulrich & Miller, 2018). They found that p-curve overestimated replicability when effect sizes vary across studies. In contrast, a new method called z-curve performed well across many scenarios, especially when heterogeneity was present.
Bartoš and Schimmack (2020) validated an extended version of z-curve (z-curve2.0) that provides confidence intervals and pro- vides estimates of the expected discovery rate, that is, the percent- age of observed significant results for all tests that were conducted, even if they were not reported. To do so, z-curve estimates the size of the file drawer of unpublished studies with nonsignificant results. The z-curve has already been applied to various data sets of results in social psychology (see R-Index blog for numerous examples).
The most important data set was created by Motyl et al. (2017), who used representative sampling of social psychology journals to examine the credibility of social psychology. The data set was also much larger than the 100 studies of the actual replication project (OSC, 2015). The main drawback of Motyl et al.’s audit of social psychology was that they did not have a proper statistical tool to estimate replicability. I used this data set to estimate the replica- bility of social psychology based on a representative sample of studies. To be included in the z-curve analysis, a study had to use a t test or F test with no more than four numerator degrees of freedom. I excluded studies from the journal Psychological Science to focus on social psychology. This left 678 studies for analysis. The set included 450 between-subjects studies, 139 mixed designs, and 67 within-subject designs. The preponderance of between-subjects designs is typical of social psychology and one of the reasons for the low power of studies in social psychology.
Figure 1 was created with the R-package zcurve. The figure shows a histogram of test statistics converted into z-scores. The red line shows statistical significance at z = 1.96, which corresponds to p < .05 (two-tailed). The blue line shows the predicted values based on the best-fitting mixture model that is used to estimate the expected replication rate and the expected discovery rate. The dotted lines show 95% confidence intervals.
The results in Figure 1 show an expected replication rate of 43% (95% CI [36%, 52%]). This result is a bit better than the 25% estimate obtained in the OSC project. There are a number of possible explanations for the discrepancy between the OSC estimate and the z-curve estimate. First of all, the number of studies in the OSC project is very small and sampling error alone could explain some of the differences. Second, the set of studies in the OSC project was not representative and may have selected studies with lower replicability. Third, some actual replication studies may have modified procedures in ways that lowered the chance of obtaining a significant result. Finally, it is never possible to exactly replicate a study (Stroebe & Strack, 2014; Van Bavel et al., 2016). Thus, z-curve estimates are overly optimistic because they assume exact replications. If there is contextual sensitivity, selection for significance will produce additional regression to the mean, and a better estimate of the actual replication rate is the expected discovery rate, EDR (Bartoš & Schimmack, 2020). The estimated EDR of 21% is close to the 25% estimate based on actual replication studies. In combination, the existing evidence suggests that the replicability of social psychological research is somewhere be- tween 20% and 50%, which is clearly unsatisfactory and much lower than the observed discovery rate of 90% or more in social psychology journals.
Figure 1 also clearly shows that questionable research practices explain the gap between success rates in laboratories and success rates in journals. The z-curve estimate of nonsignificant results shows that a large proportion of nonsignificant results is expected, but hardly any of these expected studies ever get published. This is reflected in an observed discovery rate of 90% and an expected discovery rate of 21%. The confidence intervals do not overlap, indicating that this discrepancy is statistically significant. Given such extreme selection for significance, it is not surprising that published effect sizes are inflated and replication studies fail to reproduce significant results. In conclusion, out of all explanations for replication failures in psychology, the use of questionable research practices is the main factor.
The z-curve can also be used to examine the power of subgroups of studies. In the OSC project, studies with a z-score greater than 4 had an 80% chance to be replicated. To achieve an ERR of 80% with Motyl et al.’s (2017) data, z-scores have to be greater than 3.5. In contrast, studies with just significant results (p < .05 and p > .01) have an ERR of only 28%. This information can be used to reevaluate published results. Studies with p values between .05 and .01 should not be trusted unless other information suggests otherwise (e.g., a trustworthy meta-analysis). In contrast, results with z-scores greater than 4 can be used to plan new studies. Unfortunately, there are much more questionable results with p values greater than .01 (42%) than trustworthy results with z > 4 (17%), but at least there are some findings that are likely to replicate even in social psychology.
An Inconvenient Truth
Every crisis is an opportunity to learn from mistakes. Lending practices were changed after the financial crisis in the 2000s. Psychologists and other sciences can learn from the replication crisis in social psychology, but only if they are honest and upfront about the real cause of the replication crisis. Social psychologists did not use the scientific method properly. Neither Fisher nor Neyman and Pearson, who created NHST, proposed that nonsignificant results are irrelevant or that only significant results should be published. The problems of selection for significance is evident and has been well known (Rosenthal, 1979; Sterling, 1959). Cohen (1962) warned about low power, but the main concern was a large file drawer filled with Type II errors. Nobody could imagine that whole literatures with hundreds of studies are built on nothing but sampling error and selection for significance. Bem’s article and replication failures in the 2010s showed that the abuse of questionable research practices was much more excessive than any- body was willing to believe.
The key culprit were conceptual replication studies. Even social psychologists were aware that it is unethical to hide replication failures. For example, Bem advised researchers to use questionable research practices to find significant results in their data. “Go on a fishing expedition for something—anything—interesting, even if this meant to ‘err on the side of discovery’” (Bem, 2000). However, even Bem made it clear that “this is not advice to suppress negative results. If your study was genuinely designed to test hypotheses that derive from a formal theory or are of wide general interest for some other reason, then they should remain the focus of your article. The integrity of the scientific enterprise requires the reporting of disconfirming results.”
How did social psychologists justify to themselves that it is OK to omit nonsignificant results? One explanation is the distinction between direct and conceptual replications. Conceptual replications always vary at least a small detail of a study. Thus, a nonsignificant result is never a replication failure of a previous study. It is just a failure of a specific study to show a predicted effect. Graduate students were explicitly given the advice to “never do a direct replication; that way, if a conceptual replication doesn’t work, you maintain plausible deniability” (Anonymous, cited in Spellman, 2015). This is also how Morewedge, Gilbert, and Wilson (2014) explain why they omitted nonsignificant results from a publication:
Let us be clear: We did not run the same study over and over again until it yielded significant results and then report only the study that “worked.” Doing so would be clearly unethical. Instead, like most researchers who are developing new methods, we did some preliminary studies that used different stimuli and different procedures and that showed no interesting effects. Why didn’t these studies show interesting effects? We’ll never know.
It was only in 2012 that psychologists realized that changing results in their studies were heavily influenced by sampling error and not by some minor changes in the experimental procedure. Only a few psychologists have been open about this. In a commendable editorial, Lindsay (2019) talks about his realization that his research practices were suboptimal:
Early in 2012, Geoff Cumming blew my mind with a talk that led me to realize that I had been conducting underpowered experiments for decades. In some lines of research in my lab, a predicted effect would come booming through in one experiment but melt away in the next. My students and I kept trying to find conditions that yielded consistent statistical significance—tweaking items, instructions, exclusion rules— but we sometimes eventually threw in the towel because results were maddeningly inconsistent.
Rather than invoking some supernatural decline effect, Lindsay realized that his research practices were suboptimal. A first step for social psychologists is to acknowledge their past mistakes and to learn from their mistakes. Making mistakes is a fact of life. What counts is the response to a mistake. So far, the response by social psychologists has been underwhelming. It is time for some leaders to step up or to step down and make room for a new generation of social psychologists who follow open and transparent practices.
The Way Out of the Crisis
A clear analysis of the replication crisis points toward a clear path out of the crisis. Given that “lax data collection, analysis, and reporting” standards (Carpenter, 2012, p. 1558) allowed for the use of QRPs that undermine the credibility of social psychology, the most obvious solution is to ban the use of questionable research practices and to treat them like other types of unethical behaviours (Engel, 2015). However, no scientific organisation has clearly
stated which practices are acceptable and which practices are not, and prominent social psychologists oppose clear rules of scientific misconduct (Fiske, 2016).
At present, the enforcement of good practices is left to editors of journals who can ask pertinent questions during the submission process (Lindsay, 2019). Another solution has been to ask re- searchers to preregister their studies, which limits researchers’ freedom to go on a fishing expedition (Nosek, Ebersole, DeHaven, & Mellor, 2018). Some journals reward preregistering with badges (JESP), but some social psychology journals do not (PSPB, SPPS). There has been a lot of debate about the value of preregistration and concerns that it may reduce creativity. However, preregistra- tion does not imply that all research has to be confirmatory. It merely makes it possible to distinguish clearly between explor- atory and confirmatory research.
It is unlikely that preregistration alone will solve all problems, especially because there are no clear standards about preregistra- tions and how much they constrain the actual analyses. For exam- ple, Noah, Schul, and Mayo (2018) preregistered the prediction of an interaction between being observed and a facial feedback ma- nipulation. Although the predicted interaction was not significant, they interpreted the nonsignificant pattern as confirming their prediction rather than stating that there was no support for their preregistered prediction of an interaction effect. A z-curve analysis of preregistered studies in JESP still found evidence of QRPs, although less so than for articles that were not preregistered (Schimmack, 2020). To improve the value of preregistration, so- cieties should provide clear norms for research ethics that can be used to hold researchers accountable when they try to game preregistration (Yamada, 2018).
Preregistration of studies alone will only produce more nonsig- nificant results and not increase the replicability of significant results because studies are underpowered. To increase replicabil- ity, social psychologists finally have to conduct power analysis to plan studies that can produce significant results without QRPs. This also means they need to publish less because more resources are needed for a single study (Schimmack, 2012).
To ensure that published results are credible and replicable, I argue that researchers should be rewarded for conducting high- powered studies. As a priori power analyses are based on estimates of effect sizes, they cannot provide information about the actual power of studies. However, z-curve can provide information about the typical power of studies that are conducted within a lab. This information provides quantitative information about the research practices of a lab.
This can be useful information to evaluate the contribution of a research to psychological science. Imagine an eminent scholar [I had to delete the name of this imaginary scholar in the published version, I used the R-Index of Roy F. Baumeister for this example] with an H-index of 100, but assume that this H-index was achieved by publishing many studies with low power that are difficult to replicate. A z-curve analysis might produce an estimate of 25%. This information can be integrated with the H-index to produce a replicability-weighted H-index of RH = 100 * .25 = 25. Another researcher may be less prolific and only have an H-index of 50. A z-curve analysis shows that these studies have a replicability of 80%. This yields an RH-index of 40, which is higher than the RH index of the prolific researcher. By quantifying replicability, we can reward researchers who make replicable contributions to psychological science.
By taking replicability into account, the incentive to publish as many discoveries as possible without concerns about their truth- value (i.e., “to err on the side of discovery”) is no longer the best strategy to achieve fame and recognition in a field. The RH-index could also motivate researchers to retract articles that they no longer believe in, which would lower the H-index but increase the R-index. For highly problematic studies, this could produce a net gain in the RH-index.
Social psychology is changing in response to a replication crisis. To (re)gain trust in social psychology as a science, social psychol- ogists need to change their research practices. The problem of low power has been known since Cohen (1962), but only in recent years, power of social psychological studies has increased (Schim- mack, 2020). Aside from larger samples, social psychologists are also starting to use within-subject designs that increase power (Lin, Saunders, Friese, Evans, & Inzlicht, 2020). Finally, social psychologists need to change the way they report their results. Most important, they need to stop reporting only results that confirm their predictions. Fiske (2016) recommended that scientists keep track of their questionable practices, and Wicherts et al. (2016) provided a checklist to do so. I think it would be better to ban these practices altogether. Most important, once a discovery has been made, failures to replicate this finding provide valuable, new information and need to be published (Galak et al., 2012), and theories that fail to provide consistent support need to be abandoned or revised (Ferguson & Heene, 2012).
My personal contribution to improving science has been the development of tools that make it possible to examine whether reported results are credible or not (Bartoš & Schimmack, 2020; Schimmack, 2012; Brunner & Schimmack, 2020). I agree with Fiske (2017) that science works better when we can trust scientists, but a science with a replication rate of 25% is not trustworthy. Ironically, the same tool that reveals shady practices in the past can also demonstrate that practices in social psychology are improving (Schimmack, 2020). Hopefully, z-curve analyses of social psychology will eventually show that social psychology has become a trustworthy science.
Acosta, A., Adams, R. B., Jr., Albohn, D. N., Allard, E. S., Beek, T., Benning, S. D., . . . Zwaan, R. A. (2016). Registered replication report: Strack, Martin, & Stepper (1988). Perspectives on Psychological Sci- ence, 11, 917–928. http://dx.doi.org/10.1177/1745691616674458
Alogna, V. K., Attaya, M. K., Aucoin, P., Bahník, Š., Birch, S., Birt, A. R.,. . . Zwaan, R. A. (2014). Registered replication report: Schooler & Engstler-Schooler (1990). Perspectives on Psychological Science, 9, 556 –578. http://dx.doi.org/10.1177/1745691614545653
Bem, D. J. (2011). Feeling the future: Experimental evidence for anoma- lous retroactive influences on cognition and affect. Journal of Person- ality and Social Psychology, 100, 407– 425. http://dx.doi.org/10.1037/a0021524
Bem, D. J., Utts, J., & Johnson, W. O. (2011). Must psychologists change the way they analyze their data? Journal of Personality and Social Psychology, 101, 716 –719. http://dx.doi.org/10.1037/a0024777
Brunner, J., & Schimmack, U. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta- Psychology. MP.2018.874, https://doi.org/10.15626/MP.2018.874
Bryan, C. J., Yeager, D. S., & O’Brien, J. M. (2019). Replicator degrees of freedom allow publication of misleading failures to replicate. Proceed- ings of the National Academy of Sciences USA, 116, 25535–25545. http://dx.doi.org/10.1073/pnas.1910951116
Cacioppo, J. T., Petty, R. E., & Morris, K. (1983). Effects of need for cognition on message evaluation, recall, and persuasion. Journal of Personality and Social Psychology, 45, 805– 818. http://dx.doi.org/10.1037/0022-3518.104.22.1685
Cairo, A. H., Green, J. D., Forsyth, D. R., Behler, A. M. C., & Raldiris, T. L. (2020). Gray (literature) mattes: Evidence of selective hypothesis reporting in social psychological research. Personality and Social Psy- chology Bulletin. Advance online publication. http://dx.doi.org/10.1177/ 0146167220903896
Carter, E. C., Kofler, L. M., Forster, D. E., & McCullough, M. E. (2015). A series of meta-analytic tests of the depletion effect: Self-control does not seem to rely on a limited resource. Journal of Experimental Psy- chology: General, 144, 796 – 815. http://dx.doi.org/10.1037/xge0000083 Carter, E. C., & McCullough, M. E. (2013). Is ego depletion too incredible? Evidence for the overestimation of the depletion effect. Behavioraland Brain Sciences, 36, 683– 684. http://dx.doi.org/10.1017/S0140525X13000952
Carter, E. C., & McCullough, M. E. (2014). Publication bias and the limited strength model of self-control: Has the evidence for ego depletion been overestimated? Frontiers in Psychology, 5, 823.http://dx.doi.org/10.3389/fpsyg.2014.00823
Crandall, C. S., & Sherman, J. W. (2016). On the scientific superiority of conceptual replications for scientific progress. Journal of Experimental Social Psychology, 66, 93–99. http://dx.doi.org/10.1016/j.jesp.2015.10.002
Cunningham, M. R., & Baumeister, R. F. (2016). How to make nothing out of something: Analyses of the impact of study sampling and statistical interpretation in misleading meta-analytic conclusions. Frontiers in Psy- chology, 7, 1639. http://dx.doi.org/10.3389/fpsyg.2016.01639
Ebersole, C. R., Atherton, O. E., Belanger, A. L., Skulborstad, H. M., Allen, J. M., Banks, J. B., . . . Nosek, B. A. (2016). Many Labs 3: Evaluating participant pool quality across the academic semester via replication. Journal of Experimental Social Psychology, 67, 68 – 82. http://dx.doi.org/10.1016/j.jesp.2015.10.012
Ferguson, C. J., & Heene, M. (2012). A vast graveyard of undead theories: Publication bias and psychological science’s aversion to the null. Per- spectives on Psychological Science, 7, 555–561. http://dx.doi.org/10.1177/1745691612459059
Galak, J., LeBoeuf, R. A., Nelson, L. D., & Simmons, J. P. (2012). Correcting the past: Failures to replicate. Journal of Personality andSocial Psychology, 103, 933–948. http://dx.doi.org/10.1037/a0029709
Gronau, Q. F., Duizer, M., Bakker, M., & Wagenmakers, E.-J. (2017). Bayesian mixture modeling of significant p values: A meta-analytic method to estimate the degree of contamination from Ho. Journal of Experimental Psychology: General, 146, 1223–1233. http://dx.doi.org/10.1037/xge0000324
Hagger, M. S., Chatzisarantis, N. L. D., Alberts, H., Anggono, C. O., Batailler, C., Birt, A. R., . Zwienenberg, M. (2016). A multilab preregistered replication of the ego-depletion effect. Perspectives on Psychological Science, 11, 546 –573. http://dx.doi.org/10.1177/1745691616652873
Hagger, M. S., Wood, C., Stiff, C., & Chatzisarantis, N. L. D. (2010). Ego depletion and the strength model of self-control: A meta-analysis. Psychological Bulletin, 136, 495–525. http://dx.doi.org/10.1037/a0019486
Inbar, Y. (2016). Association between contextual dependence and replicability in psychology may be spurious. Proceedings of the National Academy of Sciences, 113(34):E4933-9334, doi.org/10.1073/pnas.1608676113
John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the Prevalence of Questionable Research Practices With Incentives for Truth Telling. Psychological Science, 23(5), 524–532. https://doi.org/10.1177/0956797611430953
Kruschke, J. K., & Liddell, T. M. (2018). The Bayesian new statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychonomic Bulletin & Review, 25, 178 –206. http://dx.doi.org/10.3758/s13423-016-1221-4
Kvarven, A., Strømland, E. & Johannesson, M. (2020). Comparing meta-analyses and preregistered multiple-laboratory replication projects. Nature Human Behaviour 4, 423–434 (2020). https://doi.org/10.1038/s41562-019-0787-z
Lakens, D., Scheel, A. M., & Isager, P. M. (2018). Equivalence testing for psychological research: A tutorial. Advances in Methods and Practices in Psychological Science, 1, 259 –269. http://dx.doi.org/10.1177/2515245918770963
Luttrell, A., Petty, R. E., & Xu, M. (2017). Replicating and fixing failed replications: The case of need for cognition and argument quality. Journal of Experimental Social Psychology, 69, 178 –183. http://dx.doi.org/10.1016/j.jesp.2016.09.006
Maxwell, S. E., Lau, M. Y., & Howard, G. S. (2015). Is psychology suffering from a replication crisis? What does “failure to replicate” really mean? American Psychologist, 70, 487– 498. http://dx.doi.org/10.1037/a0039400
Motyl, M., Demos, A. P., Carsel, T. S., Hanson, B. E., Melton, Z. J., Mueller, A. B., . . . Skitka, L. J. (2017). The state of social and personality science: Rotten to the core, not so bad, getting better, or getting worse? Journal of Personality and Social Psychology, 113, 34 –58. http://dx.doi.org/10.1037/pspa0000084
Murayama, K., Pekrun, R., & Fiedler, K. (2014). Research practices that can prevent an inflation of false-positive rates. Personality and Social Psychology Review, 18, 107–118. http://dx.doi.org/10.1177/1088868313496330
Noah, T., Schul, Y., & Mayo, R. (2018). When both the original study and its failed replication are correct: Feeling observed eliminates the facial- feedback effect. Journal of Personality and Social Psychology, 114, 657– 664. http://dx.doi.org/10.1037/pspa0000121
Nosek, B. A., Ebersole, C. R., DeHaven, A. C., & Mellor, D. T. (2018). The preregistration revolution. Proceedings of the National Academy of Sciences USA, 115, 2600 –2606. http://dx.doi.org/10.1073/pnas.1708274114
Renkewitz, F., & Keiner, M. (2019). How to detect publication bias in psychological research: A comparative evaluation of six statistical methods. Zeitschrift für Psychologie, 227(4), 261-279. http://dx.doi.org/10.1027/2151-2604/a000386
Ritchie, S. J., Wiseman, R., & French, C. C. (2012). Failing the future: Three unsuccessful attempts to replicate Bem’s ‘retroactive facilitation of recall’ effect. PLoS One, 7, e33423. http://dx.doi.org/10.1371/journal.pone.0033423
Schimmack, U. (2018b). Why the Journal of Personality and Social Psychology Should Retract Article DOI:10.1037/a0021524 “Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect” by Daryl J. Bem. Retrieved January 6, 2020, from https://replicationindex.com/2018/01/05/bem-retraction
Schooler, J. W. (2014). Turning the lens of science on itself: Verbal overshadowing, replication, and metascience. Perspectives on Psycho- logical Science, 9, 579 –584. http://dx.doi.org/10.1177/1745691614547878
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359 –1366. http://dx.doi.org/10.1177/0956797611417632
Simonsohn, U. (2013). It does not follow: Evaluating the one-off publication bias critiques by Francis (2012a, 2012b, 2012c, 2012d, 2012e, in press). Perspective on Psychological Science, 7, 597–599. http://dx.doi.org/10.1177/1745691612463399
Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve and effect size: Correcting for publication bias using only significant results. Perspectives on Psychological Science, 9, 666 – 681. http://dx.doi.org/10.1177/1745691614553988
Sterling, T. D. (1959). Publication decision and the possible effects on inferences drawn from tests of significance— or vice versa. Journal of the American Statistical Association, 54, 30 –34.
Sterling, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication decisions revisited: The effect of the outcome of statistical tests on the decision to publish and vice versa. The American Statistician, 49, 108 –112.
Van Bavel, J. J., Mende-Siedlecki, P., Brady, W. J., & Reinero, D. A. (2016). Contextual sensitivity in scientific reproducibility. Proceedings of the National Academy of Sciences USA, 113, 6454 – 6459. http://dx.doi.org/10.1073/pnas.1521897113
Wagenmakers, E. J., Wetzels, R., Borsboom, D., & van der Maas, H. L. (2011). Why psychologists must change the way they analyze their data: The case of psi: Comment on Bem (2011). Journal of Personality and Social Psychology, 100, 426 – 432. http://dx.doi.org/10.1037/a0022790
Wagenmakers, E.-J., Beek, T., Dijkhoff, L., Gronau, Q. F., Acosta, A., Adams, R. B., … Zwaan, R. A. (2016). Registered Replication Report: Strack, Martin, & Stepper (1988). Perspectives on Psychological Science, 11(6), 917–928. https://doi.org/10.1177/1745691616674458
Wicherts, J. M., Veldkamp, C. L. S., Augusteijn, H. E. M., Bakker, M., van Aert, R. C. M., & van Assen, M. A. L. M. (2016). Degrees of freedom in planning, running, analyzing, and reporting psychological studies: A checklist to avoid p-hacking. Frontiers in Psychology, 7, 1832.http://dx.doi.org/10.3389/fpsyg.2016.01832
Wilson, B. M., & Wixted, J. T. (2018). The prior odds of testing a true effect in cognitive and social psychology. Advances in Methods and Practices in Psychological Science, 1, 186 –197. http://dx.doi.org/10.1177/2515245918767122
Zwaan, R. A., Etz, A., Lucas, R. E., & Donnellan, M. B. (2018). Improving social and behavioral science by making replication mainstream: A response to commentaries. Behavioral and Brain Sciences, 41, e157. http://dx.doi.org/10.1017/S0140525X18000961
Social psychologists have responded differently to the replication crisis. Some eminent social psychologists were at the end of their careers when the crisis started in 2011. Their research output in the 2010s is too small for quantitative investigations. Thus, it makes sense to look at the younger generation of future leaders in the field.
A prominent social psychologist is Mickey Inzlicht. Not only is he on the path to becoming an eminent social psychologist (current H-Index in WebOfScience 40, over 1,000 citations in 2018), he is also a prominent commentator on the replication crisis. Most notable are Mickey’s blog posts that document his journey from believing in social psychology to becoming a skeptic, if not nihilist as more and more studies failed to replicate, including his areas of research (ego-depletion, stereotype threat; Inzlicht, 2016). Mickey is also one of the few researchers who has expressed doubts about his own findings that were obtained with methods that are now considered questionable and are difficult to replicate (Inzlicht, 2015).
He used some bias-detection tools on older and newer articles and found that the older articles showed clear evidence that questionable practices were used. His critical self-analysis was meant to stimulate more critical self-examinations, but it remains a rare example of honesty among social psychologists.
In 2016, Mickey did another self-examination that showed some positive trends in research practices. However, 2016 leaves little time for improvement and the tools were not the best tools. Here I use the most powerful method to examine questionable research practices and replicability, z-curve (Brunner & Schimmack, 2019). Following another case-study (Adam D. Galinsky), I divide the time periods into before and including 2012 and the years after 2012.
One notable difference between the two time periods is that the observed discovery rate decreased from 64% , 95%CI 59%-69%), to 49%, 95%CI = 44%-55%. This change shows that there is less selection for significance after 2012. There is also positive evidence that results before 2012 were selected for significance. The Observed Discovery Rate of 64% is higher than the expected discovery rate based on z-curve, EDR = 26%, 95%CI = 7% to 41%. However, the results after 2012 show no significant evidence that results are selected for significance because the ODR = 49% is within the 95%CI of the EDR, 7% to 64%. Visual inspection suggests a large file-drawer but that is caused by the blip of p-values just below .05 (z = 2 to z = 2.2). If these values are excluded and z-curve is fitted to z-values greater than 2.2, the model even suggests that there are more non-significant results than expected (Figure 2).
Overall, these results show that the reported results after 2013 are more trustworthy, in part because more non-significant results are reported.
Honest reporting of non-significant results is valuable, but these results are inconclusive. Thus, another important question is whether power has increased to produce more credible significant results. This is evaluated by examining the replicability of significant results. Replicability increased from 47%, 95%CI = 36% to 59%, to 68%, 95% 57% to 78%. This shows that significant results published after 2012 are more likely to replicate. However, an average replicability of 68% is still a bit short of the recommended level of 80%. Moreover, this estimate includes focal and non-focal tests and there is heterogeneity. For p-values in the range between .05 and .01, replicability is estimated to be only 30%. However, this estimate increases to 56% for the model in Figure 2. Thus, there is some uncertainty about the replicability of just significant p-values. For p-values beween .01 and .001 replicabilty is about 50%, which is acceptable but not ideal.
In conclusion, Mickey Inzlicht has been more self-critical about his past research practices than other social psychologists who have used the same questionable research practices to produce publishable significant results. Consistent with his own self-analysis, these results show that research practices changed mostly by reporting more non-significant results, but also by increasing power of studies.
I hope these positive results make Mickey revise his opinion about the value of z-curve results (Inzlicht, 2015). In 2015, Mickey argued that z-curve results are not ready for prime time. Meanwhile, z-curve has been vetted in simulation studies and is in press in Meta-Psychology. The present results show that z-curve is a valuable tool to reward the use of open science practices that lead to the publication of more credible results.
Social psychologists have responded differently to the replication crisis. Some eminent social psychologists were at the end of their careers when the crisis started in 2011. Their research output in the 2010s is too small for quantitative investigations. Thus, it makes sense to look at the younger generation of future leaders in the field.
By quantitative measures one of the leading social psychologists with an active lab in the 2010s is Adam D. Galinsky. Web of Science shows that he is on track to become a social psychologists with an H-Index of 100. He currently has 213 articles with 14,004 citations and an H-Index of 62.
Several of Adam D. Galinsky’s Psychological Science articles published between 2009-2012 were examined by Greg Francis and showed signs of questionable research practices. This is to be expected because the use of QRPs was the norm in social psychology. The more interesting question is how a productive and influential social psychologists like Adam D. Galinsky responded to the replication crisis. Given his large number of articles, it is possible to examine this quantiatively by z-curving the automatically extracted test-statistics of the articles. Although automatic extraction has the problem that it does not distinguish between focal and non-focal tests, it has the advantage that it is 100% objective and can reveal changes in research practices over time.
The good news is that results have become more replicable. The average replicability for all tests was 48% (95%CI = 42%-57%) before 2012 and 61% (95%CI = 54%-69%) since then. Zooming in on p-values between .05 and .01, replicability increased from 23% to 38%.
The observed discovery rate has not changed (71% vs. 69%). Thus, articles do not report more non-significant results, although it is not clear whether articles report more non-significant focal (and fewer non-significant non-focal tests). This observed discovery rates are significantly higher than the estimated discovery rates before 2012, 26% (9%-36%) and after 2012, 40% (18%-59%). Thus, there is evidence of selection bias; that is published results are selected for significance. The extend of selection bias can be seen visually by comparing the histogram of observed non-significant results to the predicted densities shown by the grey line. This ‘file-drawer’ has decreased but is still clearly visible after 2012.
Social psychology is a large field and the response to the replication crisis has been mixed. Whereas some social psychologists are leaders in open science practices and have changed their research practices considerably, others have not. At present, journals still reward significant results and researchers who continue to use questionable research practices continue to have an advantage. The good news is that it is now possible to examine and quantify the use of questionable research practices and to take this information into account. The 2020s will show whether the field will finally take information about replicability into account and reward slow and solid results more than fast and wobbly results.
“As was evident from my questions after the talk, I was less enthused by the idea of doing a large, replication of Darryl Bem’s studies on extra-sensory perception. Zoltán Kekecs and his team have put in a huge amount of work to ensure that this study meets the highest standards of rigour, and it is a model of collaborative planning, ensuring input into the research questions and design from those with very different prior beliefs. I just wondered what the point was. If you want to put in all that time, money and effort, wouldn’t it be better to investigate a hypothesis about something that doesn’t contradict the laws of physics?”
I think she makes a valid and important point. Bem’s (2011) article highlighted everything that was wrong with the research practices in social psychology. Other articles in JPSP are equally incredible, but this was ignored because naive readers found the claims more plausible (e.g., blood glucose is the energy for will power). We know now that none of these published results provide empirical evidence because the results were obtained with questionable research practices (Schimmack, 2014; Schimmack, 2018). It is also clear that these were not isolated incidents, but that hiding results that do not support a theory was (and still is) a common practice in social psychology (John et al., 2012; Schimmack, 2019).
A large attempt at estimating the replicability of social psychology revealed that only 25% of published significant results could be replicated (OSC). The rate for between-subject experiments was even lower. Thus, the a-priori probability (base rate) that a randomly drawn study from social psychology will produce a significant result in a replication attempt is well below 50%. In other words, a replication failure is the more likely outcome.
The low success rate of these replication studies was a shock. However, it is sometimes falsely implied that the low replicability of results in social psychology was not recognized earlier because nobody conducted replication studies. This is simply wrong. In fact, social psychology is one of the disciplines in psychology that required researchers to conduct multiple studies that showed the same effect to ensure that a result was not a false positive result. Bem had to present 9 studies with significant results to publish his crazy claims about extrasensory perception (Schimmack, 2012). Most of the studies that failed to replicate in the OSC replication project were taken from multiple-study articles that reported several successful demonstrations of an effect. Thus, the problem in social psychology was not that nobody conducted replication studies. The problem was that social psychologists only reported replication studies that were successful.
The proper analyses of the problem also suggests a different solution to the problem. If we pretend that nobody did replication studies, it may seem useful to starting doing replication studies. However, if social psychologists conducted replication studies, but did not report replication failures, the solution is simply to demand that social psychologists report all of their results honestly. This demand is so obvious that undergraduate students are surprised when I tell them that this is not the way social psychologists conduct their research.
In sum, it has become apparent that questionable research practices undermine the credibility of the empirical results in social psychology journals, and that the majority of published results cannot be replicated. Thus, social psychology lacks a solid empirical foundation.
It is implied by information theory that little information is gained by conducting actual replication studies in social psychology because a failure to replicate the original result is likely and uninformative. In fact, social psychologists have responded to replication failures by claiming that these studies were poorly conducted and do not invalidate the original claims. Thus, replication studies are both costly and have not advanced theory development in social psychology. More replication studies are unlikely to change this.
A better solution to the replication crisis in social psychology is to characterize research in social psychology from Festinger’s classic small-sample, between-subject study in 1957 to research in 2017 as exploratory and hypotheses generating research. As Bem suggested to his colleagues, this was a period of adventure and exploration where it was ok to “err on the side of discovery” (i.e., publish false positive results, like Bem’s precognition for erotica). Lot’s of interesting discoveries were made during this period; it is just not clear which of these findings can be replicated and what they tell us about social behavior.
Thus, new studies in social psychology should not try to replicate old studies. For example, nobody should try to replicate Devine’s subliminal priming study with racial primes with computers and software from the 1980s (Devine, 1989). Instead, prominent theoretical predictions should be tested with the best research methods that are currently available. Thus, the way forward is not to do more replication studies, but rather to use open science (a.k.a. honest science) that uses experiments to subject theories to empirical tests that may also falsify a theory (e.g., subliminal racial stimuli have no influence on behavior). The main shift that is required is to get away from research that can only confirm theories and to allow for empirical data to falsify theories.
This was exactly the intent of Danny Kahneman’s letter, when he challenged social priming researchers to respond to criticism of their work by going into their labs and to demonstrate that these effects can be replicated across many labs.
Kahneman makes it clear that the onus of replication is on the original researchers who want others to believe their claims. The response to this letter speaks volumes. Not only did social psychologists fail to provide new and credible evidence that their results can be replicated, they also demonstrated defiant denial in the face of replication failures by others. The defiant denial by prominent social psychologists (e.g., Baumeister, 2019) make it clear that they will not be convinced by empirical evidence, while others who can look at the evidence objectively do not need more evidence to realize that the social psychological literature is a train-wreck (Schimmack, 2017; Kahneman, 2017). Thus, I suggest that young social psychologists search the train wreck for survivors, but do not waste their time and resources on replication studies that are likely to fail.
A simple guide through the wreckage of social psychology is to distrust any significant result with a p-value greater than .01 (Schimmack, 2019). Prediction markets also suggest that readers are able to distinguish credible and incredible results (Atlantic). Thus, I recommend to build on studies that are credible and to stay clear of sexy findings that are unlikely to replicate. As Danny Kahneman pointed out, young social psychologists who work in questionable areas face a dilemma. Either they have to replicate the questionable methods that were used to get the original results, which is increasingly considered unethical, or they end up with results that are not very informative. On the positive side, the replication crisis implies that there are many important topics in social psychology that need to be studied properly with the scientific method. Addressing these important questions may be the best way to rescue social psychology.
In 2008, the world was wondering whether there is a financial crisis. In an editorial Richard S. Fuld Jr concluded that there was no crisis. This is not really what happened. Richard S. Fuld is actually known as the CEO of Lehman Brothers, a bank that declared bankruptcy in the wake of the 2008 financial crisis, when it became apparent that banks had taken on a lot of bad debt that wasn’t worth the servers it was stored on.
A few years later, there were concerns that a crisis was looming in social psychology. Although this crisis was mostly harmless because the outcome of lab experiments with undergraduate students have very little to do with real world events, it was still disconcerting that the top journal of social psychology published false evidence that extraverts have the ability to foresee the location of erotic stimuli (but not a financial crisis) (Bem, 2011).
Although Bem’s fake claims have been debunked by means of statistical investigation of his data and by means of failed replications, the article created healthy skepticism about other findings published by social psychologists. An attempt to replicate findings in social psychology could only reproduce 25% of significant results and the percentage was even lower for between-subject experiments.
Eight years later, two prominent social psychologists, Wendy Wood and Timothy D. Wilson, take stock of the status of experimental social psychology. Given the well-established finding in social psychology that humans have a strong self-serving bias and that positive illusions are good for people, they come to the conclusion that “there is no crisis” (Wood & Wilson, 2018).
Timothy Wilson fails to mention that he has a conflict of interest because he is the writer of a textbook that would be less valuable if the content in the textbook were based on studies that cannot be replicated. Wendy Wood also is the author of a popular book in which she observes that ” we spend a shocking 43 percent of our day doing things without thinking about them” I am not sure she was thinking about the research social psychologists do, which also often appears to be a frantic activism rather than planned testing of theories.
So, what evidence do Wood and Wilson marshal for their claim that there is no crisis in social psychology?
To be clear, Wood and Wilson’s article is based on their involvement in an interdisciplinary committee across scientific disciplines. “No crisis” may be a reasonable verdict for all sciences. After all, the natural sciences are making tremendous progress and the only question is whether their advances will destroy or save the planet, but there is no doubt that advances in the natural sciences have made humans de facto rulers of this planets.
But we cannot generalize from the natural science to the social science or social psychology more specifically. So, the real question for social psychologists is whether there is a crisis in social psychology. Wood and Wilson do not have much to say about this issue, but they make the trivial and misleading observation that “the goal of science is not, and ought not to be, for all results to be replicable” (p. 28).
Why is this true and trivial? After all, all scientists acknowledge that we only do studies to test hypotheses that are not already known to be true. This means, we will sometimes test a false hypothesis (e.g., Extraverts can guess above chance which underwear I am going to pick for my first day of classes.) Sometimes, our data will give us the wrong answer, which is called a false positive or a type-I error. The whole point of statistical significance testing, which social psychologists routinely do in their journals, is to keep the rate of such false discoveries at an acceptable minimum.
However, social psychologists convinced themselves that doing proper science that keeps the false positive rate at a low rate is not interesting. Their cheerleader Bem told them “Let’s err on the side of discovery.” The more discoveries, the merrier social psychologists will be. Who cares whether they are true or not as long as they make for good stories in social psychology textbooks. And so they went on a rampage and erred on the side of discovery (social priming, ego-depletion, unconscious racism, stereotype-threat, terror-management, etc. etc.) and now their textbooks are filled with findings that cannot be replicated.
So how did Wood and Wilson mislead readers? They were right that the goal of science is not to replicate ALL results, but they fail to point out that a good science is build on findings that do replicate and that a good science aims to have a high percentage of findings that replicate. 25 percent or less is a failing grade and nowhere near the goal of a good science.
So, Wood and Wilson’s quote is a distraction. They state a trivial truth to imply that there is no crisis because failures are ok, but they avoid talking about the embarrassing frequency of replication failures in social psychology.
And the political nature of their article is clear when the authors conclude with their personal beliefs” “We were more convinced than ever in the fundamental soundness of our field” without pointing to a shred of evidence that would make this more than wishful thinking.
Their self-serving statement totally disregards the evidence that has accumulated over the past eight years that social psychologists were some of the most outrageous users of questionable research practices to produce significant results that do not replicate (search this blog for numerous demonstrations, but here is one (Social Psychology Audit), including an audit of Wilson’s work.
Social psychologists would be the first to warn you about the credibility of a messenger who wants you to buy something. I would say, don’t buy what social psychologists tell you about the credibility of social psychology.
Social psychology textbook like colorful laboratory experiments that illustrate a theoretical point. As famous social psychologist Daryl Bem stated, he considered his experiments more illustrations of what could happen than empirical tests of what actually happens. Unfortunately, social psychology textbooks make it less obvious that the results of highlighted studies should not be generalized to real life.
Myers and Twenge (2019) tell the story of fishy smells.
In a laboratory experiment, exposure to a fishy smell caused people to be suspicious of each other and cooperate less—priming notions of a shady deal as “fishy” (Lee & Schwarz, 2012). All these effects occurred without the participants’ conscious awareness of the scent and its influence.
They don’t even mention some other fun facts about this study. To make sure that the effect is not just a mood effect induced by bad odors in general, fishy smells were contrasted with fart smells, and the effect seemed to be limited to fishy smells.
The article was published in the top journal for experimental social psychology (JPSP:ASC) and is relatively highly cited.
However, the studies reported in this article smell a bit fishy and should be consumed with a grain of salt and a lot of lemon. The problem is that all of the results are significant, which is highly unlikely unless studies have very high statistical power (Schimmack, 2012).
And it even works the other way around.
And making people think about suspicion, also makes them think about fish, in theory.
Suspicion also makes you be more sensitive to fishy smells.
Undergraduate students may not realize what the problem with these studies is. After all, they all worked out; that is they produced a p-value less than .05, which is supposed to ensure that no more than 1 out of 20 studies are a false positive result. As all of these studies are significant, it is extremely unlikely that all of them are false positives. So, we would have to infer that suspicion is related to fishy smells in our minds.
However, since 2012 it is clear that we have to draw another conclusion. The reason is that results in social psychology articles like this one smell fishy and suggest that the authors are telling us a fun story, but they are not telling us what really happened in their lab. It is extremely unlikely that the authors reported all of their studies and data analyses that they conducted. Instead they may have used a variety of so-called questionable research practices that increase the chances of reporting a significant result. Questionable research practices are also known as fishing for significance. These questionable research practices have the undesirable effect that they increase the type-I error rate. Thus, while the reported p-values are below .05, the risk of a false positive result is not and could be as high as 100%.
To demonstrate that researchers used questionable research practices, we can conduct a bias test. The most powerful bias test for small sets of studies is the Test of Insufficient Variance. When most p-values are just significant , p < .05 and p > .005, but always significant the results are not trustworthy because sampling error should produce more variability than we see.
The table lists the test statistics, converts the two-tailed p-values into z-scores and computes the variance of the z-scores. The variance is expected to be 1, but the actual variance is only 0.14. A chi-square test shows that this deviation is significant with p = .01. Thus, we have scientific evidence to claim that these results smell a bit fishy.
Unfortunately, these results are not the only fishy results in social psychology textbooks. Thus, students of social psychology should read textbook claims with a healthy dose of skepticism. They should also ask their professors to provide information about the replicability of textbook findings. Has this study been replicated in a preregistered replication attempt? Would you think you could replicate this result in your own lab? It is time to get rid of the fishy smell and let the fresh wind of open science clean up social psychology.
We can only hope that sooner than later, articles like this will sleep with the fishes.
One of the most famous experiments in psychology is Schachter and Singer’s experiment that was used to support the two-factor theory of emotions: emotions is sympathetic arousal plus cognition about the cause of the arousal (see Dror, 2017, Reisenzein, 2017, for historic reviews).
The classic article “Cognitive, social, and physiological determinants of emotional state” has been cited 2,799 times in WebofScience, and is a textbook classic.
Schachter and Wheeler (1962) summarize the “take-home message” of Schachter and Singer (1962).
In their study of cognitive and physiological determinants of emotional states, Schachter and Singer (1962) have demonstrated that cognitive processes play a major role in the development of emotional states” (p. 121).
The “demonstration” was an experiment in which participants were injected with epinephrine to create a state of arousal or a placebo. This manipulation was crossed with a confederate who either displayed euphoric or angry behavior.
Schachter and Wheeler summarize the key findings.
In experimental situations designed to make subjects euphoric, those subjects who received injections of epinephrine were, on a variety of indices, somewhat more euphoric than subjects who received a placebo injection.
Similarly, in situations designed to make subjects angry and irritated, those who received epinephrine were somewhat angrier than subjects who received placebo.
[Note the discrepancy between the claim “play a major role” and “somewhat more”]
The proceed to make clear that this pattern, although expected, could also have been produced by chance alone.
In both sets of conditions, however, these differences between epinephrine and placebo subjects were significant, at best, at borderline levels of statistical significance.
[Not the discrepancy between “demonstrated” and “borderline significance”]
Schachter and Wheeler conducted another test of the two-factor theory. The study was essentially a conceptual replication and an extension of Schachter and Singer. The replication part of the study was that participants were again injected with a placebo or epinephrine. It is a conceptual replication because the target emotion was amusement, rather than anger or euphoria. Finally, the extension was a third condition in which participants were injected with Chlorpromazine; a sedative. This should suppress activation of sympathetic arousal and dampen amusement.
One dependent variable were observer ratings of amusement. As shown in Table 3, the means were in the predicted direction, but the difference between placebo and epinephrine conditions was not significant.
Ratings of the film were additional dependent variables. Means are again in the same direction, but p-values are not reported and the text mentions that some differences were significant only at borderline levels. The pattern makes clear that this would be the case for the contrasts of the Chlorpromazine condition with the other conditions, but not for the epinephrine – placebo contrast.
Based on these underwhelming and non-significant results, the authors concluded
The overall pattern of experimental results of this study and the Schachter and Singer (1962) experiment gives consistent support to a general formulation of emotion as a function of a state of physiological arousal and of an appropriate cognition (p. 127).
This claim is false. The replication study actually confirmed that an epinephrine injection seems to have no statistically reliable influence on the intensity of emotions.
Dorr (2017) made an interesting historical observation that Schachter was angry (presumably, without injection of epinephrine) that editors added non-significant to some of the results in the Schachter and Singer (1962) article.
“Since the paper has appeared students have tittered at me, my colleagues look down at their plates.” The most serious issue, among several, was that Tables 6–9 were totally misleading. The “notation ‘ns’ in the p column,” as Schachter explained, “is meaningless. Nothing was tested” (Schachter, S., 1962, Schachter to R. Solomon, May 3, 1962).” (Dorr, 2017)
Nothing was tested and nothing was proven, but a theory was born and it lives on in the imagination of hundreds of contemporary psychologists. The failure to provide evidence for it in Schachter and Wheeler was largely ignored. The article has been cited only 145 times compared to 2,799 for Schachter and Singer.
One reason for the impact of Schachter and Singer is that it was published in Psychological Review, while Schachter and Wheeler was published in Journal ol Abnormal and Social Psychology, which later became the Journal of Personality and Social Psychology.
Psychological Review is the journal where a select few psychologists can make sweeping claims with very little evidence, in the hope that other researchers will provide evidence for it. Given that psychology only publishes confirmatory evidence, every Psychological Review is a self-fulfilling prophecy, and every proposed theory will receive empirical support (even if only with marginal significance), and will live forever.
So, what are the take-home messages from this blog post.
The two-factor theory of emotions was never empirically supported.
Just because it was published in Psych Review, doesn’t mean it is true.
Psychology is not an evidence-based science, until it stops worshiping historically important articles as evidence for some eternal truth.
It is not bullying if the target of scientific criticism is deceased.
Social psychology has a replication problem. The reason is that social psychologists used questionable research practices to increase their chances of reporting significant results. The consequence is that the real risk of a false positive result is higher than the stated 5% level in publications. In other words, p < .05 no longer means that at most 5% of published results are false positives (Sterling, 1959). Another problem is that selection for significance with low power produces inflated effect sizes estimates. Estimates suggests that on average published effect sizes are inflated by 100% (OSC, 2015). These problems have persisted for decades (Sterling, 1959), but only now psychologists are recognizing that published results provide weak evidence and might not be replicable even if the same study were replicated exactly.
How should consumers of empirical social psychology (textbook writers, undergraduate students, policy planners) respond to the fact that published results cannot be trusted at face value? Jerry Brunner and I have been working on ways to correct published results for the inflation introduced by selection for significance and questionable practices. Z-curve estimates the mean power of studies selected for significance. Here I applied the method to automatically extracted test statistics from social psychology journals. I computed z-curves for 70+ eminent social psychologists (H-index > 35).
The results can be used to evaluate the published results reported by individual researchers. The main information provided in the table are (a) the replicability of all published p-values, (b) the replicability of just significant p-values (defined as p-values greater than pnorm(2.5) = .0124, and (c) the replicability of p-values with moderate evidence against the null-hypothesis (.0124 > p > .0027). More detailed information is provided in the z-curve plots (powergraphs) that are linked to researchers’ names. An index less than 50% would suggest that these p-values are no longer significant after adjusting for selection for significance. As can be seen in the table, most just significant results are no longer significant after correction for bias.
Caveat: Interpret with Care
The results should not be overinterpreted. They are estimates based on an objective statistical procedure, but no statistical method can compensate perfectly for the various practices that led to the observed distribution of p-values (transformed into z-scores). However, in the absence of any information which results can be trusted, these graphs provide some information. How this information is used by consumers depends ultimately on consumers’ subjective beliefs. Information about the average replicability of researchers’ published results may influence these beliefs.
It is also important to point out that a low replicability index does not mean researchers were committing scientific misconduct. There are no clear guidelines about acceptable and unacceptable statistical practices in psychology. Zcurve is not designed to detect scientific fraud. In fact, it assumes that researcher collect data, but conduct analyses in a way that increases the chances of producing a significant result. The bias introduced by selection for significance is well known and considered acceptable in psychological science.
There are also many factors that can bias results in favor of researchers’ hypotheses without researchers’ awareness. Thus, the bias evident in many graphs does not imply that researchers intentionally manipulated data to support their claims. Thus, I attribute the bias to unidentified researcher influences. It is not important to know how bias occurred. It is only important to detect biases and to correct for them.
It is necessary to do so for individual researchers because bias varies across researchers. For example, the R-Index for all results ranges from 22% to 81%. It would be unfair to treat all social psychologists alike when their research practices are a reliable moderator of replicability. Providing personalized information about replicability allows consumers of social psychological research to avoid stereotyping social psychologists and to take individual differences in research practices into account.
Finally, it should be said that producing replicabilty estimates is subject to biases and errors. Researchers may differ in their selection of hypotheses that they are reporting. A more informative analysis would require hand-coding of researchers’ focal hypothesis tests. At the moment, R-Index does not have the resources to code all published results in social psychology, let alone other areas of psychology. This is an important task for the future. At the moment, automatically extracted results have some heuristic value.
One unintended and unfortunate consequence of making this information available is that some researchers’ reputation might be negatively effected by a low replicability score. This cost has be be weighted against the benefit to the public and the scientific community of obtaining information about the robustness of published results. In this regard, the replicability rankings are no different from actual replication studies that fail to replicate an original finding. The only difference is that replicability rankings use all published results, whereas actual replication studies are often limited to a single or a few studies. While replication failures in a single study are ambiguous, replicability esitmates based on hundreds of published results are more diagnostic of researchers’ practices.
Nevertheless, statistical estimates provide no definitive answer about the reproducibility of a published result. Ideally, eminent researchers would conduct their own replication studies to demonstrate that their most important findings can be replicated under optimal conditions.
It is also important to point out that researchers have responded differently to the replication crisis that became apparent in 2011. It may be unfair to generalize from past practices to new findings for researchers who changed their practices. If researchers preregistered their studies and followed a well-designed registered research protocol, new results may be more robust than a researchers’ past record suggests.
Finally, the results show evidence of good replicability for some social psychologists. Thus, the rankings avoid the problem of selectively targeting researchers with low replicability, which can lead to a negative bias in evaluations of social psychology. The focus on researchers with a high H-index means that the results are representative of the field.
If you believe that you should not be listed as an eminent social psychologists, please contact me so that I can remove you from the list.
If you think you are an eminent social psychologists and you want to be included in the ranking, please contact me so that I can add you to the list.
If you have any suggestions or comments how I can make these rankings more informative, please let me know in the comments section.
*** *** *** *** ***
REPLICABILITY RANKING OF EMINENT SOCIAL PSYCHOLOGISTS
[sorted by R-Index for all tests from highest to lowest rank]
Richard Nisbett has been an influential experimental social psychologist. His co-authored book on faulty human information processing (Nisbett & Ross, 1980) provided the foundation of experimental studies of social cognition (Fiske & Taylor, 1984). Experiments became the dominant paradigm in social psychology with success stories like Daniel Kahneman’s Noble Price for Economics and embarrassments like Diederik Staple’s numerous retractions because he fabricated data for articles published in experimental social psychology (ESP) journals.
The Stapel Debacle raised questions about the scientific standards of experimental social psychology. The reputation of Experimental Social Psychology (ESP) also took a hit when the top journal of ESP research published an article by Daryl Bem that claimed to provide evidence for extra-sensory perceptions. For example, in one study extraverts seemed to be able to foresee the location of pornographic images before a computer program determined the location. Subsequent analyses of his results and data revealed that Daryl Bem did not use scientific methods properly and that the results provide no credible empirical evidence for his claims (Francis, 2012; Schimmack, 2012; Schimmack, 2018).
More detrimental for the field of experimental social psychology was that Bem’s carefree use of scientific methods is common in experimental social psychology; in part because Bem wrote a chapter that instructed generations of experimental social psychologists how they could produce seemingly perfect results. The use of these questionable research practices explains why over 90% of published results in social psychology journals support authors’ hypotheses (Sterling, 1959; Sterling et al., 1995).
Since 2011, some psychologists have started to put the practices and results of experimental social psychologists under the microscope. The most impressive evidence comes from a project that tried to replicate a representative sample of psychological studies (Open Science Collaboration, 2015). Only a quarter of social psychology experiments could be replicated successfully.
The response by eminent social psychologists to these findings has been a classic case of motivated reasoning and denial. For example, in an interview for the Chronicle of Higher Education, Nisbett dismissed these results by attributing them to problems of the replication studies.
Nisbett has been calculating effect sizes since before most of those in the replication movement were born. And he’s a skeptic of this new generation of skeptics. For starters, Nisbett doesn’t think direct replications are efficient or sensible; instead he favors so-called conceptual replication, which is more or less taking someone else’s interesting result and putting your own spin on it. Too much navel-gazing, according to Nisbett, hampers professional development. “I’m alarmed at younger people wasting time and their careers,” he says. He thinks that Nosek’s ballyhooed finding that most psychology experiments didn’t replicate did enormous damage to the reputation of the field, and that its leaders were themselves guilty of methodological problems. And he’s annoyed that it’s led to the belief that social psychology is riddled with errors. “How do they know that?”, Nisbett asks, dropping in an expletive for emphasis.
In contrast to Nisbett’s defensive response, Noble Laureate Daniel Kahneman has expressed concerns about the replicability of BS-ESP results that he reported in his popular book “Thinking: Fast and Slow” He also wrote a letter to experimental social psychologists suggesting that they should replicate their findings. It is telling that several years later, eminent experimental social psychologists have not published self-replications of their classic findings.
Nisbett also ignores that Nosek’s findings are consistent with statistical analyses that show clear evidence of questionable research practices and evidence that published results are too good to be true (Francis, 2014). Below I present new evidence about the credibility of experimental social psychology based on a representative sample of published studies in social psychology.
How Replicable are Between-Subject Social Psychology Experiments (BS-ESP)?
Motyl and colleagues (2017) coded hundreds of articles and over one-thousand published studies in social psychology journals. They recorded information about the type of study (experimental or correlational), the design of the study (within subject vs. between-subject) and about the strength of an effect (as reflected in test statistics or p-values). After crunching the numbers, their results showed clear evidence of publication bias, but also evidence that social psychologists published some robust and replicable results.
In a separate blog-post, I agreed with this general conclusion. However, Motyl et al.’s assessment was based on a broad range of studies, including correlational studies with large samples. Few people doubt that these results would replicate, but experimental social psychologist tend to dismiss these findings because they are correlational (Nisbett, 2016).
The replicability of BS-ESP results is more doubtful because these studies often used between-subject designs with small samples, which makes it difficult to obtain statistically significant results. For example, John Bargh used only 30 participants (15 per condition) for his famous elderly priming study that failed to replicate.
I conducted a replicability analysis of BS-ESP results based on a subset of studies in Motly’s dataset. I selected only studies with between-subject experiments where participants were randomly assigned to different conditions with one degree of freedom. The studies could be comparisons of two groups or a 2 x 2 design that is also often used by experimental social psychologists to demonstrate interaction (moderator) effects. I also excluded studies with fewer than 20 participants per condition because these studies should not have been published because parametric tests require a minimum of 20 participants to be robust (Cohen, 1994).
There were k = 314 studies that fulfilled these criteria. Two-hundred-seventy-eight of these studies (89%) were statistically significant at the standard criterion of p < .05. Including marginally significant and one-sided tests, 95% were statistically significant. This success rate is consistent with Sterling’s results in the 1950s and 1990s and the results of the OSC project. For the replicability analysis, I focused on the 278 results that met the standard criterion of statistical significance.
First, I compared the mean effect size without correcting for publication bias with the bias-corrected effect size estimate using the latest version of puniform (vanAert, 2018). The mean effect size of the k = 278 studies was d = .64, while the bias-corrected puniform estimate was more than 50% lower, d = .30. The finding that actual effect sizes are approximately half of the published effect sizes is consistent with the results based on actual replication studies (OSC, 2015).
Next, I used z-curve (Brunner & Schimmack, 2018) to estimate mean power of the published studies based on the test statistics reported in the original articles. Mean power predicts the success rate if the 278 studies were exactly replicated. The advantage of using a statistical approach is that it avoids problems of carrying out exact replication studies, which is often difficult and sometimes impossible.
Figure 1. Histogram of the strength of evidence against the null-hypothesis in a representative sample of (k = 314) between-subject experiments in social psychology.
The Nominal Test of Insufficient Variance (Schimmack, 2015) showed 61% of results within the range from z = 1.96 and z = 2.80, when only a maximum of 32% is expected from a representative sample of independent tests. The z-statistic of 10.34 makes it clear that this is not a chance finding. Visual inspection shows a sharp drop at a z-score of 1.96, which corresponds to the significance criterion, p < .05, two-tailed. Taken together, these results provide clear evidence that published results are not representative of all studies that are conducted by experimental psychologists. Rather, published results are selected to provide evidence in favor of authors’ hypotheses.
The mean power of statistically significant results is 32% with a 95%CI ranging from 23% to 39%. This means that many of the studies that were published with a significant result would not reproduce a significant result in an actual replication attempt. With an estimate of 32%, the success rate is not reliably higher than the success rate for actual replication studies in the Open Science Reproducibility Project (OSC, 2015). Thus, it is clear that the replication failures are the result of shoddy research practices in the original studies rather than problems of exact replication studies.
The estimate of 32% is also consistent with my analysis of social psychology experiments in Bargh’s book “Before you know it” that draws heavily on BS-ESP results. Thus, the present results replicate previous analyses based on a set of studies that were selected by an eminent experimental social psychologist. Thus, replication failures in experimental social psychology are highly replicable.
A new statistic under development is the maximum false discovery rate; that is, the percentage of significant results that could be false positives. It is based on the fit of z-curves with different proportions of false positives (z = 0). The maximum false discovery rate is 70% with a 95%CI ranging from 50% to 85%. This means, the data are so weak that it is impossible to rule out the possibility that most BS-ESP results are false.
Nisbett’s questioned how critics know that ESP is riddled with errors. I answered his call for evidence by presenting a z-curve analysis of a representative set of BS-ESP results. The results are consistent with findings from actual replication studies. There is clear evidence of selection bias and consistent evidence that the majority of published BS-ESP results cannot be replicated in exact replication studies. Nisbett dismisses this evidence and attributes replication failures to problems with the replication studies. This attribution is a classic example of a self-serving attribution error; that is, the tendency to blame others for negative outcomes.
The low replicability of BS-ESP results is not surprising, given several statements by experimental social psychologists about their research practices. For example, Bem’s (2001) advised students that it is better to “err on the side of discovery” (translation: a fun false finding is better than no finding). He also shrugged off replication failures of his ESP studies with a comment that he doesn’t care whether his results replicate or not.
“I used data as a point of persuasion, and I never really worried about, ‘Will this replicate or will this not?” (Daryl J. Bem, in Engber, 2017)
A similar attitude is revealed in Baumeister’s belief that personality psychology has lost appeal because it developed its scientific method and a “gain in rigor was accomplished by a loss in interest.” I agree that fiction can be interesting, but science without rigor is science fiction.
Another social psychologist (I forgot the name) once bragged openly that he was able to produce significant results in 30% of his studies and compared this to a high batting averages in baseball. In baseball it is indeed impressive to hit a fast small ball with a bat 1 out of 3 times. However, I prefer to compare success rates of BS-ESP researchers to the performance of my students on an exam, where a 30% success rate earns them a straight F. And why would anybody watch a movie that earned a 32% average rating on rottentomatoes.com, unless they are watching it because watching bad movies can be fun (e.g., “The Room”).
The problems of BS-ESP research are by no means new. Tversky and Kahneman (1971) tried to tell psychologists decades ago that studies with low power should not be conducted. Despite decades of warnings by methodologists (Cohen, 1962, 1994), social psychologists have blissfully ignored these warnings and continue to publish meaningless statistically significant results and hiding non-significant ones. In doing so, they conducted the ultimate attribution error. They attributed the results of their studies to the behavior of their participants, while the results actually depended on their own biases that determined which studies they selected for publication.
Many experimental social psychologists prefer to ignore evidence that their research practices are flawed and published results are not credible. For example, Bargh did not mention actual replication failures of his work in his book, nor did he mention that Noble Laureate Daniel Kahneman wrote him a letter in which he described Bargh’s work as “the poster child for doubts about the integrity of psychological research.” Several years later, it is fair to say that evidence is accumulating that experimental social psychology lacks scientific integrity. It is often said that science is self-correcting. Given the lack of self-correction by experimental social psychologists, it logically follows that it is not a science; at least it doesn’t behave like one.
I doubt that members of the Society for Experimental Social Psychology (SESP) will respond to this new information any differently from the way they responded to criticism of the field in the past seven years; that is, with denial, name calling (“Shameless Little Bullies”, “Method Terrorists”, “Human Scum”), or threats of legal actions. In my opinion, the biggest failure of SESP is not the way its members conducted research in the past, but their response to valid scientific criticism of their work. As Karl Popper pointed out “True ignorance is not the absence of knowledge, but the refusal to acquire it.” Ironically, the unwillingness of experimental social psychologists to acquire inconvenient self-knowledge provides some of the strongest evidence for biases and motivated reasoning in human information processing. If only these biases could be studied in BS experiments with experimental social psychologists as participants.
The abysmal results for experimental social psychology should not be generalized to all areas of psychology. The OSC (2015) report examined the replicability of psychology, and found that cognitive studies replicated much better than experimental social psychology results. Motyl et al. (2017) found evidence that correlational results in social and personality psychology are more replicable than BS-ESP results.
It is also not fair to treat all experimental social psychologists alike. Some experimental social psychologists may have used the scientific method correctly and published credible results. The problem is to know which results are credible and which results are not. Fortunately, studies with stronger evidence (lower p-values or higher z-score) are more likely to be true. In actual replication attempts, studies with z-scores greater than 4 had an 80% chance to be successfully replicated (OSC, 2015). I provided a brief description of results that met this criterion in Motyl et al.’s dataset in the Appendix. However, it is impossible to distinguish honest results with weak evidence from results that were manipulated to show significance. Thus, over 50 years of experimental social psychology have produced many interesting ideas without empirical evidence for most of them. Sadly, even today articles are published that are no more credible than those published 10 years ago. If there can be failed sciences, experimental social psychology is one of them. Maybe it is time to create a new society for social psychologists who respect the scientific method. I suggest calling it Society of Ethical Social Psychologists (SESP), and that it adopts the ethics code of the American Physical Society (APS).
Fabrication of data or selective reporting of data with the intent to mislead or deceive is an egregious departure from the expected norms of scientific conduct, as is the theft of data or research results from others.
Journal of Experimental Social Psychology
Klein, W. M. (2003). Effects of objective feedback and “single other” or “average other” social comparison feedback on performance judgments and helping behavior. Personality and Social Psychology Bulletin, 29(3), 418-429.
In this study, participants have a choice to give easy or difficult hints to a confederate after performing on a different task. The strong result shows an interaction effect between performance feedback and the way participants are rewarded for their performance. When their reward is contingent on the performance of the other students, participants gave easier hints after they received positive feedback and harder hints after they received negative feedback.
Phillips, K. W. (2003). The effects of categorically based expectations on minority influence: The importance of congruence. Personality and Social Psychology Bulletin, 29(1), 3-13.
This strong effect shows that participants were surprised when an in-group member disagreed with their opinion in a hypothetical scenario in which they made decisions with an in-group and an out-group member, z = 8.67.
Seta, J. J., Seta, C. E., & McElroy, T. (2003). Attributional biases in the service of stereotype maintenance: A schema-maintenance through compensation analysis. Personality and Social Psychology Bulletin, 29(2), 151-163.
The strong effect in Study 1 reflects different attributions of a minister’s willingness to volunteer for a charitable event. Participants assumed that the motives were more selfish and different from motives of other ministers if they were told that the minister molested a young boy and sold heroin to a teenager. These effects were qualified by a Target Identity × Inconsistency interaction, F(1, 101) = 39.80, p < .001. This interaction was interpreted via planned comparisons. As expected, participants who read about the aberrant behaviors of the minister attributed his generosity in volunteering to the dimension that was more inconsistent with the dispositional attribution of ministers—impressing others (M = 2.26)—in contrast to the same target control participants (M= 4.62), F(1, 101) = 34.06, p < .01.
Trope, Y., Gervey, B., & Bolger, N. (2003). The role of perceived control in overcoming defensive self-evaluation. Journal of Experimental Social Psychology, 39(5), 407-419.
Study 2 manipulated perceptions of changeability of attributes and valence of feedback. A third factor were self-reported abilities. The 2 way interaction showed that participants were more interested in feedback about weaknesses when attributes were perceived as changeable, z = 4.74. However, the critical test was the three-way interaction with self-perceived abilities, which was weaker and not a fully experimental design, F(1, 176) = 6.34, z = 2.24.
Brambilla, M., Sacchi, S., Pagliaro, S., & Ellemers, N. (2013). Morality and intergroup relations: Threats to safety and group image predict the desire to interact with outgroup and ingroup members. Journal of Experimental Social Psychology, 49(5), 811-821.
Three strong results come from this study of morality (zs > 5). In hypothetical scenarios, participants were presented with moral and immoral targets and asked about their behavioral intentions how they would interact with them. All studies showed that participants were less willing to engage with immoral targets. Other characteristics that were manipulated had no effect.
Mason, M. F., Lee, A. J., Wiley, E. A., & Ames, D. R. (2013). Precise offers are potent anchors: Conciliatory counteroffers and attributions of knowledge in negotiations. Journal of Experimental Social Psychology, 49(4), 759-763.
This study showed that recipients of a rounded offer make larger adjustments to the offer than recipients of more precise offers, z = 4.15. This effect was demonstrated in several studies. This is the strongest evidence, in part, because the sample size was the largest. So, if you put your house up for sale, you may suggest a sales price of $491,307 rather than $500,000 to get a higher counteroffer.
Pica, G., Pierro, A., Bélanger, J. J., & Kruglanski, A. W. (2013). The Motivational Dynamics of Retrieval-Induced Forgetting A Test of Cognitive Energetics Theory. Personality and Social Psychology Bulletin, 39(11), 1530-1541.
The strong effect for this analysis is a within-subject main effect. The critical effect was a mixed design three-way interaction. This effect was weaker. “.05). Of greatest importance, the three-way interaction between retrieval-practice repetition, need for closure, and OSPAN was significant, β = −.24, t = −2.25, p < .05.”
Preston, J. L., & Ritter, R. S. (2013). Different effects of religion and God on prosociality with the ingroup and outgroup. Personality and Social Psychology Bulletin, ###.
This strong effect, z = 4.59, showed that participants thought a religious leader would want them to help a family that belongs to their religious group, whereas God would want them to help a family that does not belong to the religious group (cf. These values were analyzed by one-way ANOVA on Condition (God/Leader), F(1, 113) = 23.22, p < .001, partial η2= .17. People expected the religious leader would want them to help the religious ingroup family (M = 6.71, SD = 2.67), whereas they expected God would want them to help the outgroup family (M = 4.39, SD = 2.48)). I find the dissociation between God and religious leaders interesting. The strength of the effect makes me belief that this is a replicable finding.
Sinaceur, M., Adam, H., Van Kleef, G. A., & Galinsky, A. D. (2013). The advantages of being unpredictable: How emotional inconsistency extracts concessions in negotiation. Journal of Experimental Social Psychology, 49(3), 498-508.
Study 2 produced a notable effect of manipulating emotional inconsistency on self-ratings of “sense of unpredictability” (z = 4.88). However, the key dependent variable was concession making. The effect on concession making was not as strong, F(1, 151) = 7.29, z = 2.66.
Newheiser, A. K., & Barreto, M. (2014). Hidden costs of hiding stigma: Ironic interpersonal consequences of concealing a stigmatized identity in social interactions. Journal of Experimental Social Psychology, 52, 58-70.
Participants in this study were either told to reveal their major of study or to falsely report that they are medical students. The strong effect shows that participants who were told to lie reported feeling less authentic, z = 4.51. The effect on a second dependent variable, “belonging” (I feel accepted) was weaker, t(54) = 2.54, z = 2.20.
PERSONALITY AND SOCIAL PSYCHOLOGY BULLETIN
Simon, B., & Stürmer, S. (2003). Respect for group members: Intragroup determinants of collective identification and group-serving behavior. Personality and Social Psychology Bulletin, 29(2), 183-193.
The strong effect in this study shows a main effect of respectful vs. disrespectful feedback from a group-member on collective self-esteem; that is, feeling good about being part of the group. As predicted, a 2 x2 ANOVA revealed that collective identification (averaged over all 12 items; Cronbach’s = .84) was stronger in the respectful-treatment condition than in the disrespectful-treatment condition,M(RESP) = 3.54,M(DISRESP) = 2.59, F(1, 159) = 48.75, p < .001.
Craig, M. A., & Richeson, J. A. (2014). More diverse yet less tolerant? How the increasingly diverse racial landscape affects white Americans’ racial attitudes. Personality and Social Psychology Bulletin, 40(6) 750–761.
Two strong effects are based on studies that aimed to manipulate responses to the race IAT with stories about shifting demographics in the United States. However, the test statistics are based on the comparison of IAT scores against a value of zero and not the comparison of the experimental group and the control group. The relevant results are, t(26) = 2.07, p = .048, d = 0.84 in Study 2a and t(23) = 2.80, p = .01, d = 1.13 in Study 2b. These results are highly questionable because it is unlikely to obtain just significant results in a pair of studies. In addition, the key finding in Study 1 is also just significant, t(84) = 2.29, p = .025, as is the finding in Study 3, F(1,366) = 5.94, p = .015.
Hung, I. W., & Wyer, R. S. (2014). Effects of self-relevant perspective-taking on the impact of persuasive appeals. Personality and Social Psychology Bulletin, 40(3), 402-414.
Participants viewed a donation appeal from a charity called Pangaea. The one-page appeal described the problem of child trafficking and was either self-referential or impersonal. The strong effect was that participants in the self-referential condition were more likely to imagine themselves in the situation of the child. Participants were more likely to imagine themselves being trafficked when the appeal was self-referential than when it was impersonal (M = 4.78, SD =2.95 vs. M = 3.26, SD = 2.96 respectively), F(1, 288) = 17.42, p < .01, ω2 = .041, and this difference did not depend on the victims’ ethnicity (F < 1). Thus, taking the victims’ perspective influenced participants’ tendency to imagine themselves being trafficked without thinking about their actual similarity to the victims that were portrayed. The effect on self-reported likelihood of helping was weaker. Participants reported greater urge to help when the appeal encouraged them to take the protagonists’ perspective than when it did not (M = 5.83, SD = 2.04 vs. M = 5.18, SD = 2.31), F(1, 288) = 5.68, p < .02, ω2 = .013
Lick, D. J., & Johnson, K. L. (2014).” You Can’tTell Just by Looking!” Beliefs in the Diagnosticity of Visual Cues Explain Response Biases in Social Categorization. Personality and Social Psychology Bulletin,
The main effect of social category dimension was significant, F(1, 164) = 47.30, p < .001, indicating that participants made more stigmatizing categorizations in the sex condition (M = 15.94, SD = 0.45) relative to the religion condition (M = 12.42, SD = 4.64). This result merely shows that participants were more likely to indicate that a woman is a woman than that an atheist is an atheist based on a photograph of a person. This finding would be expected based on the greater visibility of gender than religion.
Bastian, B., Jetten, J., Chen, H., Radke, H. R., Harding, J. F., & Fasoli, F. (2013). Losing our humanity the self-dehumanizing consequences of social ostracism. Personality and Social Psychology Bulletin, 39(2), 156-169.
The strong effect in this study reveals that participants rated ostracizing somebody more immoral than a typical everyday interaction. An ANOVA, with condition as the between-subjects variable, revealed that condition had an effect on perceived immorality, F(1, 51) = 39.77, p < .001, η2 = .44, indicating that participants felt the act of ostracizing another person was more immoral (M = 3.83, SD = 1.80) compared with having an everyday interaction (M = 1.37, SD = 0.81).
Kifer, Y., Heller, D., Perunovic, W. Q. E., & Galinsky, A. D. (2013). The good life of the powerful the experience of power and authenticity enhances subjective well-being. Psychological science, 24(3), 280-288.
This strong effect is a manipulation check. The focal test provides much weaker evidence for the claim that authenticity increases wellbeing. The manipulation was successful. Participants in the high-authenticity condition (M = 4.57, SD = 0.62) reported feeling more authentic than those in the low-authenticity condition (M = 2.70, SD = 0.74), t(130) = 15.67, p < .01, d = 2.73. As predicted, participants in the high-authenticity condition (M = 0.38, SD = 1.99) reported higher levels of state SWB than those in the low-authenticity condition (M = −0.46, SD = 2.12), t(130) = 2.35, p < .05, d = 0.40.
Lerner, J. S., Li, Y., & Weber, E. U. (2012). The financial costs of sadness. Psychological science, 24(1) 72–79.
Again, the strong effect is a manipulation check. The emotion-induction procedure was effective in both magnitude and specificity. Participants in the sad-state condition reported feeling more sadness (M = 3.72) than neutrality (M = 1.66), t(78) = 6.72, p < .0001. The critical test that sadness leads to financial losses produced a just significant result. Sad participants were more impatient (mean = .21, median = .04) than neutral participants (mean = .28, median = .19; Mann- Whitney z = 2.04, p = .04).
Tang, S., Shepherd, S., & Kay, A. C. (2014). Do Difficult Decisions Motivate Belief in Fate? A Test in the Context of the 2012 US Presidential Election. 25(4), 1046-1048.
A manipulation check confirmed that participants in the similar-candidates condition saw the candidates as more similar (M = 4.41, SD = 0.80) than did participants in the different-candidates condition (M = 3.24, SD = 0.76), t(180) = 10.14, p < .001. The critical test was not statistically significant. As predicted, participants in the similar-candidates condition reported greater belief in fate (M = 3.45, SD = 1.46) than did those in the different-candidates condition (M = 3.04, SD = 1.44), t(180) = 1.92, p = .057
Caruso, E. M., Van Boven, L., Chin, M., & Ward, A. (2013). The temporal Doppler effect when the future feels closer than the past. Psychological science, 24(4) 530–536.
The strong effect revealed that participants view an event (Valentine’s Day) in the future closer to the present than an event in the past. Valentine’s Day was perceived to be closer 1 week before it happened than 1 week after it happened, t(321) = 4.56, p < .0001, d = 0.51 (Table 1). The effect met the criterion of z > 4 because the sample size was large, N = 323, indicating that experimental social psychology could benefit from larger samples to produce more credible results.
Galinsky, A. D., Wang, C. S., Whitson, J. A., Anicich, E. M., Hugenberg, K., & Bodenhausen, G. V. (2013). The reappropriation of stigmatizing labels the reciprocal relationship between power and self-labeling. Psychological science, 24(10)
The strong effect showed that participants rated a stigmatized group as having more power of a stigmatized label if they used the label themselves than when it was used by others. The stigmatized out-group was seen as possessing greater power over the label in the self-label condition (M = 5.14, SD = 1.52) than in the other-label condition (M = 3.42, SD = 1.76), t(233) = 8.04, p < .001, d = 1.05. The effect on evaluations of the label was weaker. The label was also seen as less negative in the self-label condition (M = 5.61, SD = 1.37) than in the other-label condition (M = 6.03, SD = 1.19), t(233) = 2.46, p = .01, d = 0.33. And the weakest evidence was provided for a mediation effect. We tested whether perceptions of the stigmatized group’s power mediated the link between self-labeling and stigma attenuation. The bootstrap analysis was significant, 95% bias-corrected CI = [−0.41, −0.01]. A value of 0 rather than -0.01 would render this finding non-significant. The t-value for this analysis can be approximated by dividing the mean of the confidence boundaries (-.42/2 = -.21), by an estimate of sampling error (-.21- (-0.01) = -.20 / 2 = -.10). The ratio is an estimate of the signal to noise ratio (-.21 / -.10 = 2.1). With N = 235 this t-value is similar to the z-score and the effect can be considered just significant. This result is consistent with the weak evidence in the other 7 studies in this article.
JOURNAL OF PERSONALITY AND SOCIAL PSYCHOLOGY-Attitudes and Social Cognition
[This journal published Bem’s (2011) alleged evidence in support of extrasensory perceptions]
Ruder, M., & Bless, H. (2003). Mood and the reliance on the ease of retrieval heuristic. Journal of Personality and Social Psychology, 85(1), 20.
The strong effect in this study is based on a contrast analysis for the happy condition. Supporting the first central hypothesis, happy participants responded faster than sad participants (M = 9.81 s vs. M = 14.17 s), F(1, 59) = 23.46, p < .01. The results of the 2 x 2 ANOVA analysis are reported subsequently. The differential impact of number of arguments on happy versus sad participants is reflected in a marginally significant interaction, F(1, 59) = 3.59, p = .06.
Fazio, R. H., Eiser, J. R., & Shook, N. J. (2004). Attitude formation through exploration: valence asymmetries. Journal of personality and social psychology, 87(3), 293.
The strong effect in this study reveals that participants approached stimuli that they were told were rewarding. When they learned by experience that this was not the case, they stopped approaching them. However, when they were told that stimuli were bad, they avoided them and were not able to learn that the initial information was wrong. This resulted in an interaction effect for prior (true or false) and actual information. More important, the predicted interaction was highly significant as well, F(1, 71) =18.65, p< .001.
Pierro, A., Mannetti, L., Kruglanski, A. W., & Sleeth-Keppler, D. (2004). Relevance override: on the reduced impact of” cues” under high-motivation conditions of persuasion studies. Journal of Personality and Social Psychology, 86(2), 251.
The strong effect in this study is based on a contrast effect following an interaction effect. Consistent with our hypothesis, the first informational set exerted a stronger attitudinal impact in the low accountability condition, simple F(1, 42) = 19.85, p < .001, M = (positive) 1.79 versus M (negative) = -.15. The pertinent two-way interaction effect was not as strong. The interaction between accountability and valence of first informational set was significant, F(1, 84) = 5.79, p = .018. For study 2, an effect of F(1,43) = 18.66 was used, but the authors emphasized the importance of the four-way interaction. Of greater theoretical interest was the four-way interaction between our independent variables, F(1, 180) = 3.922, p = .049. For Study 3, an effect of F(1,48) = 18.55 was recorded, but the authors emphasize the importance of the four-way interaction, which was not statistically significant. Of greater theoretical interest is the four-way interaction between our independent variables, F(1, 176) = 3.261, p = .073.
Clark, C. J., Luguri, J. B., Ditto, P. H., Knobe, J., Shariff, A. F., & Baumeister, R. F. (2014). Free to punish: A motivated account of free will belief. Journal of personality and social psychology, 106(4), 501.
The strong effect in this study by my prolific friend Roy Baumeister is due to the finding that participants want to punish a robber more than an aluminum can forager. Participants also wanted to punish the robber (M _ 4.98, SD =1.07) more than the aluminum can forager (M = 1.96, SD = 1.05), t(93) = 13.87, p = .001. Less impressive is the evidence that beliefs about free will are influenced by reading about a robber or an aluminum can forager. Participants believed significantly more in free will after reading about the robber (M = 3.68, SD = 0.70) than the aluminum can forager (M _ 3.38, SD _ 0.62), t(90) = 2.23, p = .029, d = 0.47.
Yan, D. (2014). Future events are far away: Exploring the distance-on-distance effect. Journal of Personality and Social Psychology, 106(4), 514.
This strong effect reflects an effect of a construal level manipulation (thinking in abstract or concrete terms) on temporal distance judgments. The results of a 2 x 2 ANOVA on this index revealed a significant main effect of the construal level manipulation only, F(1, 118) = 23.70, p < .001. Consistent with the present prediction, participants in the superordinate condition (M = 1.81) indicated a higher temporal distance than those in the subordinate condition (M = 1.58).