The 2010s have seen a replication crisis in social psychology (Schimmack, 2020). The main reason why it is difficult to replicate results from social psychology is that researchers used questionable research practices (QRPs, John et al., 2012) to produce more significant results than their low-powered designs warranted. A catchy term for these practices is p-hacking (Simonsohn, 2014).
New statistical techniques made it possible to examine whether published results were obtained with QRPs. In 2012, I used the incredibility index to show that Bem (2011) used QRPs to provide evidence for extrasensory perception (Schimmack, 2012). In the same article, I also suggested that Gailliot, Baumeister, DeWall, Maner, Plant, Tice, and Schmeichel, (2007) used QRPs to present evidence that suggested will-power relies on blood glucose levels. During the review process of my manuscript, Baumeister confirmed that QRPs were used (cf. Schimmack, 2014). Baumeister defended the use of these practices with a statement that the use of these practices was the norm in social psychology and that the use of these practices was not considered unethical.
The revelation that research practices were questionable casts a shadow on the history of social psychology. However, many also saw it as an opportunity to change and improve these practices (Świątkowski and Dompnier, 2017). Over the past decades, the evaluation of QRPs has changed. Many researchers now recognize that these practices inflate error rates, make published results difficult to replicate, and undermine the credibility of psychological science (Lindsay, 2019).
However, there are no general norms regarding these practices and some researchers continue to use them (e.g., Adam D. Galinsky, cf. Schimmack, 2019). This makes it difficult for readers of the social psychological literature to identify research that can be trusted or not, and the answer to this question has to be examined on a case by case basis. In this blog post, I examine the responses of Baumeister, Vohs, DeWall, and Schmeichel to the replication crisis and concerns that their results provide false evidence about the causes of will-power (Friese, Loschelder , Gieseler , Frankenbach & Inzlicht, 2019; Inzlicht, 2016).
To examine this question scientifically, I use test-statistics that are automatically extracted from psychology journals. I divide the test-statistics into those that were obtained until 2012, when awareness about QRPs emerged, and those published after 2012. The test-statistics are examined using z-curve (Brunner & Schimmack, 2019; Bartos & Schimmack, 2020). Results provide information about the expected replication rate and discovery rate. The use of QRPs is examined by comparing the observed discovery rate (how many published results are significant) to the expected discovery rate (how many tests that were conducted produced significant results).
Roy F. Baumeister’s replication rate was 60% (53% to 67%) before 2012 and 65% (57% to 74%) after 2012. The overlap of the 95% confidence intervals indicates that this small increase is not statistically reliable. Before 2012, the observed discovery rate was 70% and it dropped to 68% after 2012. Thus, there is no indication that non-significant results are reported more after 2012. The expected discovery rate was 32% before 2012 and 25% after 2012. Thus, there is also no change in the expected discovery rate and the expected discovery rate is much lower than the observed discovery rate. This discrepancy shows that QRPs were used before 2012 and after 2012. The 95%CI do not overlap before and after 2012, indicating that this discrepancy is statistically significant. Figure 1 shows the influence of QRPs when the observed non-significant results (histogram of z-scores below 1.96 in blue) is compared to the model prediction (grey curve). The discrepancy suggests a large file drawer of unreported statistical tests.
An old saying is that you can’t teach an old dog new tricks. So, the more interesting question is whether the younger contributors to the glucose paper changed their research practices.
The results for C. Nathan DeWall show no notable response to the replication crisis (Figure 2). The expected replication rate increased slightly from 61% to 65%, but the difference is not significant and visual inspection of the plots suggests that it is mostly due to a decrease in reporting p-values just below .05. One reason for this might be a new goal to p-hack at least to the level of .025 to avoid detection of p-hacking by p-curve analysis. The observed discovery rate is practically unchanged from 68% to 69%. The expected discovery rate increased only slightly from 28% to 35%, but the difference is not significant. More important, the expected discovery rates are significantly lower than the observed discovery rates before and after 2012. Thus, there is evidence that DeWall used questionable research practices before and after 2012, and there is no evidence that he changed his research practices.
The results for Brandon J. Schmeichel are even more discouraging (Figure 3). Here the expected replication rate decreased from 70% to 56%, although this decrease is not statistically significant. The observed discovery rate decreased significantly from 74% to 63%, which shows that more non-significant results are reported. Visual inspection shows that this is particularly the case for test-statistics close to zero. Further inspection of the article would be needed to see how these results are interpreted. More important, The expected discovery rates are significantly lower than the observed discovery rates before 2012 and after 2012. Thus, there is evidence that QRPs were used before and after 2012 to produce significant results. Overall, there is no evidence that research practices changed in response to the replication crisis.
The results for Kathleen D. Vohs also show no response to the replication crisis (Figure 4). The expected replication rate dropped slightly from 62% to 58%; the difference is not significant. The observed discovery rate dropped slightly from 69% to 66%, and the expected discovery rate decreased from 43% to 31%, although this difference is also not significant. Most important, the observed discovery rates are significantly higher than the expected discovery rates before 2012 and after 2012. Thus, there is clear evidence that questionable research practices were used before and after 2012 to inflate the discovery rate.
After concerns about research practices and replicability emerged in the 2010s, social psychologists have debated this issue. Some social psychologists changed their research practices to increase statistical power and replicability. However, other social psychologists have denied that there is a crisis and attributed replication failures to a number of other causes. Not surprisingly, some social psychologists also did not change their research practices. This blog post shows that Baumeister and his students have not changed research practices. They are able to publish questionable research because there has been no collective effort to define good research practices and to ban questionable practices and to treat the hiding of non-significant results as a breach of research ethics. Thus, Baumeister and his students are simply exerting their right to use questionable research practices, whereas others voluntarily implemented good, open science, practices. Given the freedom of social psychologists to decide which practices they use, social psychology as a field continuous to have a credibility problem. Editors who accept questionable research in their journals are undermining the credibility of their journal. Authors are well advised to publish in journals that emphasis replicability and credibility with open science badges and with a high replicability ranking (Schimmack, 2019).
2.17.2020 [the blog post has been revised after I received reviews of the ms. The reference list has been expanded to include all major viewpoints and influential articles. If you find something important missing, please let me know.]
Bem’s (2011) article triggered a replication crisis in social psychology. A major replication project found that only 25% of results in social psychology could be replicated. I examine various explanations for this low replication rate and found most of them lacking in empirical support. I then provide evidence that the use of questionable research practices (QRPs) accounts for this result. Using z-curve (Brunner & Schimmack, 2019), and a representative sample of focal hypothesis tests (Motyl et al., 2017), I find that the expected replication rate for social psychology is between 20 and 45 percent. I argue that revealing QRPs and quantifying replicability can provide an incentive to use good research practices and to invest more resources in studies that produce replicable results. The replication crisis in social psychology provides important lessons for other disciplines in psychology that have avoided to take a closer look at their research practices. If psychology wants to be a science, it needs to improve research practices and ensure that published results can falsify theoretical predictions.
Keywords: Replication, Replicability, Replicability Crisis, Expected Replication Rate, Expected Discovery Rate, Questionable Research Practices, Power, Social Psychology
The Big Bang
The 2010s started with a bang. Journal clubs were discussing the preprint of Bem’s (2011) article “Feeling the future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect.” Psychologists were confronted with a choice. Either they had to believe in anomalous effects or they had to believe that psychology was an anomalous science. Ten years later, it is possible to look back at Bem’s article with the hindsight of 2020.
It is now clear that Bem used questionable practices to produce false evidence for his outlandish claims (Francis, 2012; Schimmack, 2012, 2018b, 2020). Moreover, it has become apparent that these practices were the norm and that many other findings in social psychology cannot be replicated. This realization has led to initiatives to change research practices that produce more credible and replicable results. The speed and the extent of these changes has been revolutionary. Akin to the cognitive revolution in the 1960s and the affective revolution in the 1980s, the 2010s have witnessed a method revolution. Two new journals were created that focus on methodological problems and improvements of research practices; Meta-Psychology (MP) and Advances in Methods and Practices in Psychological Science (AMPPS).
In this review, I focus on replication failures in experimental social psychology and the different explanations for these failures. I argue that the use of questionable research practices accounts for many replication failures, and I examine how social psychologists have responded to evidence that QRPs undermine the trustworthiness of social psychological results. Other disciplines may learn from these lessons and may need to reform their research practices in the coming decade.
Arguably, the most important development in psychology has been the normalization of publishing replication failures. When Bem (2011) published his abnormal results supporting paranormal phenomena, researchers quickly failed to replicate these sensational results. However, they had a hard time publishing these results. The Editor of Journal of Personality and Social Psychology at that time, Eliot Smith, did not even send the manuscript out for review. This attempt to suppress negative evidence failed for two reasons. First, online-only journals with unlimited journal space like PlusOne or Frontiers were more than happy to publish these articles (Ritchie, Wiseman, & French, 2012). Second, the decision to reject the replication studies was made public and created a lot of attention because Bem’s article had attracted so much attention (Aldhous, 2011). This created social pressure and in 2012, JPSP did publish replication failures of Bem’s results (Galak, LeBouef, Nelson, & Simmons, 2012).
Over the past decade, new article formats have evolved that make it easier to publish articles that fail to confirm theoretical predictions such as registered reports (Chambers, 2013) and registered-replication reports (Association for Psychological Science, 2015). Registered reports are articles that are accepted for publication before the results are known; thus, avoiding the problem of publication bias that only confirmatory findings are published. Scheel, Schijen, and Lakens (2020) found that this format reduced the rate of significant results from over 90% to about 50%. This difference suggests that the normal literature has a strong bias to publish significant results (Bakker, van Dijk, & Wicherts, 2012; Sterling, 1959; Sterling et al., 1995).
Registered replication reports are registered reports that aim to replicate an original study in a high-powered study with many laboratories. Although, registered replication reports can produce significant (p < alpha) and non-significant (p > alpha) results, they have mostly produced replication failures (Kvarven, Strømland, & Johannesson, 2019). These failures are especially stunning because RRR had a much higher chance to produce a significant result than the original studies with much smaller samples. Thus, the fact that RRR’s of ego-depletion (Hagger et al. 2016), or facial feedback (Wagenmakers et al., 2016) produced non-significant results with thousands of participants were surprising, to say the least.
Replication failures of specific studies are important for specific theories, but they do not examine the crucial question whether these failures are anomalies or symptomatic of a wider problem in psychological science. Answering this broader question requires a representative sample of studies from the population of results published in psychology journals. Given the diversity of psychology, this is a monumental task.
A first step towards this goal was the Reproducibility Project that focused on results published in three psychology journals in the year 2008. The journals represented social/personality psychology (Journal of Personality and Social Psychology, JPSP), cognitive psychology (Journal of Experimental Psychology: Learning, Memory, and Cognition, JEP:LMC), and all areas of psychology (Psychological Science). Although all articles published in 2008 were eligible, not all studies were replicated, in part because some studies were very expensive or difficult to replicate. In the end, 97 studies with significant results were replicated. The headline finding was that only 37% of the replication studies replicated a statistically significant result.
This finding has been widely cited as evidence that psychology has a replication problem. However, headlines tend to blur over the fact that results varied as a function of discipline. While the success rate for cognitive psychology was 50% and even higher for typical within-subject designs with many observations per participant, the success rate was only 25% for social psychology, and even lower for the typical between-subject design that was employed to study ego-depletion, facial feedback or other prominent effects in social psychology.
These results do not warrant the broad claim that psychology has a replication crisis or that most results published in psychology are false. A more nuanced conclusion is that social psychology has a replication crisis and that methodological factors account for these differences. Disciplines that rely on within-subject designs with many repeated measures or intervention studies with a pre-post design are likely to suffer less than disciplines that compare a single measure across participants.
To conclude, the 2010s have seen a rise in publications of non-significant results that fail to replicate original results and that contradict theoretical predictions. The evidence produced by these studies has demonstrated a replication crisis in social psychology, but not in cognitive psychology. Other areas have been slow to investigate the replicability of their published results.
Responses to the Replication Crisis in Social Psychology
There have been numerous responses to the replication crisis in social psychology. Broadly they can be classified as arguments that support the notion of a crisis and arguments that claim that there is no crisis. I first discuss problems with no-crisis arguments. I then examine the pro-crisis arguments and discuss their implications for the future of psychology as a science.
No Crisis: Downplaying the Finding
Some social psychologists have argued that the term crisis is inappropriate and overly dramatic. “Every generation or so, social psychologists seem to enjoy experiencing a “crisis.” While sympathetic to the underlying intentions underlying these episodes—first the field’s relevance, then the field’s methodological and statistical rigor – the term crisis seems to me overly dramatic. Placed in a positive light, social psychology’s presumed “crises” actually marked advances in the discipline.” (Pettigrew, 2018). Others use euphemistic and vague descriptions of the low replication rate in social psychology. For example, Fiske (2017) notes that “like other sciences, not all our effects replicate” (p. 654). Crandall and Sherman (2016) note that the number of successful replications in social psychology was “at a lower rate than expected” (p. 94).
These comments downplay the stunning finding that only 25% of social psychology results could be replicated. Rather than admitting that there is a problem, these social psychologists find fault with critics of social psychology. “I have been proud of the professional stance of social psychology throughout my long career. But unrefereed blogs and social media attacks sent to thousands can undermine the professionalism of the discipline.” (Pettigrew, 2018, p. 967). I would argue that lecturing thousands of students each year based on evidence that is not replicable is a bigger problem than taking openly about the low replicability of social psychology on social media.
No Crisis: Experts can Reliably Produce Effects
After some influential priming results could not be replicated, Daniel Kahneman wrote a letter to John Bargh (Yong, 2012) and suggested that leading priming researchers should conduct a series of replication studies to demonstrate that their original results are replicable. In response, John Bargh and other prominent social psychologists conducted numerous studies that showed the effects are robust. At least, this is what might have happened in an alternate universe. In this universe, there have been few attempts to self-replicate original findings. Bartlett (2013) asked Bargh why he did not prove his critics wrong by doing the study again. The answer is not particularly convincing.
“So why not do an actual examination? Set up the same experiments again, with additional safeguards. It wouldn’t be terribly costly. No need for a grant to get undergraduates to unscramble sentences and stroll down a hallway. Bargh says he wouldn’t want to force his graduate students, already worried about their job prospects, to spend time on research that carries a stigma. Also, he is aware that some critics believe he’s been pulling tricks, that he has a “special touch” when it comes to priming, a comment that sounds like a compliment but isn’t. “I don’t think anyone would believe me,” he says”
A few self-replications ended with a replication failure (Elkins-Brown, Saunders, & Inzlicht, 2018). One notable successful self-replication was conducted by Petty and colleagues (Luttrell, Petty, & Xu, 2017), after a replication study by Ebersole et al. (2016) failed to replicate a seminal finding by Cacioppo, Petty, and Morris (1983). Luttrell et al. were able to replicated the original finding by Cacioppo et al. and they reproduced the non-significant result of Ebersole et al.’s replication study. In addition, they found a significant interaction, indicating that procedural differences made the effect weaker in Ebsersole et al.’s replication study. This study has been celebrated as an exemplary way to respond to replication failures. It also suggests that flaws in replication studies are sometimes responsible for replication failures. However, it is impossible to generalize from this single instance to other replication failures. Thus, it remains unclear how many replication failures were caused by problems with the replication studies.
No-Crisis: Decline Effect
The idea that replication failures occur because effects weaken over time was proposed by Johnathan Schooler and popularized in a New Yorker article (Lehrer, 2010). Schooler coined the term decline effect for the observation that effect sizes often decrease over time. Unfortunately, it does not work for more mundane behaviors like eating cheesecake. No matter how often you eat cheese cakes, they still add calories and pounds to your weight. However, for more elusive effects like social priming or verbal overshadowing, it seems to be the case that it is easier to discovery effects than to replicate them (Wegner, 1992). This is also true for Schooler and Engstler-Schooler’s (1990) verbal overshadowing effect. A registered replication report replicated a statistically significant effect, but with smaller effect sizes (Alogna et al., 2014). Schooler (2014) considered this finding a win-win because his original results had been replicated and the reduced effect size supported the presence of a decline effect. However, the notion of a decline effect is misleading because it merely describes a phenomenon rather than providing an explanation for it. Schooler (2014) offered several possible explanations. One possible explanation was regression to the mean (see next paragraph). A second explanation was that slight changes in experimental procedures can reduce effect sizes (more detailed discussion below). More controversial, Schooler also eludes to the possibility that some paranormal processes may produce a decline effect. “Perhaps, there are some parallels between VO [verbal overshadowing] effects and parapsychology after all, but they reflect genuine unappreciated mechanisms of nature (Schooler, 2011) and not simply the product of publication bias or other artifact” (p. 582). Schooler, however, fails to acknowledge that a mundane explanation for the decline effect are questionable research practices that inflate effect size estimates in original studies. Using statistical tools, Francis (2012) showed that Schooler’s original verbal overshadowing studies showed signs of bias. Thus, there is no need to look for paranormal explanation of the decline effect in verbal overshadowing. The normal practices of selectively publishing only significant results are sufficient to explain it. In sum, the decline effect is descriptive rather than explanatory and Schooler’s suggestion that it reflects some paranormal phenomena is not supported by scientific evidence.
No Crisis: Regression to the Mean is Normal
Regression to the mean has been invoked as one possible explanation for the decline effect (Fiedler, 2015; Schooler, 2014). Fiedler’s argument is that random measurement error in psychological measures is sufficient to produce replication failures. However, random measurement error is neither necessary nor sufficient to produce replication failures. The outcome of a replication study is determined solely by the studies statistical power and if the replication study is an exact replication of an original study, both studies have the same amount of random measurement error and power (Brunner & Schimmack, 2019). Thus, if the OSC project found 97 significant results in 100 published studies, the observed discovery rate of 97% suggests that the studies had 97% power to obtain a significant result. Random measurement error would have the same effect on power and therefore have the same effect on the outcome of original studies and replication studies. Therefore, Fiedler’s claim that random measurement error alone explains replication failures is simply wrong and based on a misunderstanding of statistics.
Moreover, regression to the mean requires that studies were selected for significance. Schooler (2014) ignores this aspect of regression to the mean when he suggests that regression to the mean is normal and expected. It is not. The effect sizes of eating cheesecake do not decrease over time because there is no selection process. In contrast, the effect sizes of social psychological experiments decrease when original articles selected significant results and replication studies do not select for significance. Thus, it is not normal for success rates to decrease from 97% to 25%, just like it would not be normal for a basketball players’ free-throw percentage to drop from 97% to 25%. In conclusion, regression to the mean implies that original studies were selected for significance and would suggest that replication failures are produced by questionable research practices. Regression to the mean therefore becomes an argument why there is a crisis once it is recognized that it requires selective reporting of significant results, which implies that the success rates of 90% or more in psychology journals are illusory.
No Crisis: Exact Replications are Impossible
Heraclitus, an ancient Greek philosopher, observed that you can never step into the same river twice. Similarly, it is impossible to exactly recreate the conditions of a psychological experiment. This trivial observation has been used to argue that replication failures are neither surprising nor problematic, but rather the norm. We should never expect to get the same result from the same paradigm because the actual experiments are never identical, just like a river is always changing (Stroebe & Strack, 2014). This argument has led to a heated debate about the distinction and value of direct versus conceptual replication studies (Crandall & Sherman, 2016; Zwaan, Etz, Lucas, & Donnellan, 2018; Pashler & Harris, 2012).
The purpose of direct replication studies is to replicate an original study as closely as possible so that replication failures provide can correct false results in the literature (Pashler & Harris, 2012). However, journals were reluctant to publish replication failures. Thus, a direct replication had little value. Either the results were not significant or they were not novel. In contrast, conceptual replication studies were publishable as long as they produced a significant result. Thus, publication bias provides an explanation for many seemingly robust findings (Bem, 2011) that suddenly cannot be replicated (Galak et al., 2012). After all, it is simply not plausible that conceptual replications that intentionally change features of a study are always successful, while direct replication that try to reproduce the original conditions as closely as possible fail in large numbers.
The argument that exact replications are impossible also ignores the difference between disciplines. Why is there no replication crisis in cognitive psychology, if each experiment is like a new river? And why does eating cheesecake always lead to a weight gain, no matter whether it is chocolate cheesecake, raspberry white-truffle cheesecake, or caramel fudge cheesecake? The reason is that the main features of rivers remain the same. Even if the river is not identical, you still get wet every time you step into it.
To explain the higher replicability of results in cognitive psychology than in social psychology, van Bavel et al. (2016) proposed that social psychological studies are more difficult to replicate for a number of reasons. They called this property of studies contextual sensitivity. Coding studies for contextual sensitivity showed the predicted negative correlation between contextual sensitivity and replicability. However, Inbar (2016) found that this correlation was no longer significant when discipline was included as a predictor. Thus, the results suggested that social psychological studies are more contextual sensitive and less replicable, but that contextual sensitivity did not explain the lower replicability of social psychology.
It is also not clear that contextual sensitivity implies that social psychology does not have a crisis. Replicability is not the only criterion of good science, especially if exact replications are impossible. Findings that can only be replicated when conditions are reproduced exactly lack generalizability, which makes them rather useless for applications and for construction of broader theories. Take verbal-overshadowing as an example. Even a small change in experimental procedures reduced a practically significant effect size of 16% to a no longer meaningful effect size of 4% (Alogna et al., 2014), and neither of these experimental conditions were similar to real-world situations of eye-witness identification. Thus, the practical implications of this phenomenon remain unclear because it depends too much on the specific context.
In conclusion, empirical results are only meaningful, if researchers have a clear understanding of the conditions that can produce a statistically significant result most of the time (Fisher, 1926). Contextual sensitivity makes it harder to do so. Thus, it is one potential factor that may contribute to the replication crisis in social psychology because social psychologists do not know under which conditions their results can be reproduced. For example, I asked Roy F. Baumeister to specify optimal conditions to replicate ego-depletion. He was unable or unwilling to do so (Baumeister, 2016).
No Crisis: The Replication Studies are Flawed
The argument that replication studies are flawed comes in two flavors. One argument is that replication studies are often carried out by young researchers with less experience and expertise. They did their best, but they are just not very good experimenters (Gilbert, King, Pettigrew, & Wilson, 2016). Cunningham and Baumeister (2016) proclaim “Anyone who has served on university thesis committees can attest to the variability in the competence and commitment of new researchers. Nonetheless, a graduate committee may decide to accept weak and unsuccessful replication studies to fulfill degree requirements if the student appears to have learned from the mistakes” (p. 4). There is little evidence to support this claim. In fact, a meta-analysis found no differences in effect sizes between studies carried out by Baumeister’s lab and other labs (Hagger et al., 2010).
The other argument is that replication failures are sexier and more attention grabbing than successful replications. Thus, replication researchers sabotage their studies or data analyses to produce non-significant results (Bryana, Yeager, & O’Brien, 2019; Strack, 2016). The latter accusations have been made without empirical evidence to support this claim. For example, Strack (2016) used a positive correlation between sample size and effect size to claim that some labs were motivated to produce non-significant results, presumably by using a smaller sample size. However, a proper bias analysis showed no evidence that there were too few significant results (Schimmack, 2018a). Moreover, the overall effect size across all labs was also non-significant.
Inadvertent problems, however, may explain some replication failures. For example, some replication studies reduced statistical power by replicating a study with a smaller sample than the original study (OSC, 2015; Ritchie et al., 2011). In this case, a replication failure could be a false negative (type-II error). Thus, it is problematic to conduct replication studies with smaller samples. At the same time, registered replication reports with thousands of participants should be given more weight than original studies with less than 100 participants. Size matters.
However, size is not the only factor that matters and researchers disagree about the implications of replication failures. Not surprisingly, authors of the original studies typically recognize some problems with the replication attempts (Baumeister & Vohs, 2016; Strack, 2016; cf. Skibba, 2016). Ideally, researchers would agree ahead of time on a research design that is acceptable to all parties involved. Kahneman called this model an adversarial collaboration (Kahneman, 2003). However, original researchers have either not participated in the planning of a study (Strack, 2016) or withdrawn their approval after the negative results were known (Baumeister & Vohs, 2016). None have acknowledged that their original results were obtained with questionable research practices that make it hard to replicate the results.
To make replication studies more meaningful, it would be important that leading researchers agree ahead of time on a research design. Failure to find agreement would itself undermine the value of published research because experts should be able to specify the optimal conditions for producing an effect.
In conclusion, replication failures can occur for a number of reasons, just like significant results in original studies can occur for a number of reasons. Inconsistent results are frustrating because they often require further research. This being said, there is no evidence that low quality of replication studies is the sole or the main cause of replication failures in social psychology.
No Crisis: Replication Failures are Normal
In an opinion piece for the New York Times, Lisa Feldmann Barrett, current president of the Association for Psychological Science, commented on the OSC results and claimed that “the failure to replicate is not a cause for alarm; in fact, it is a normal part of how science works” (Feldman Barrett, 2015). On the surface, Feldmann Barrett makes a valid point. It is true that replication failures are a normal part of science. First, if psychologists would conduct studies with 80% power, 1 out of 5 studies would fail to replicate, even if everything is going well and all predictions are true. Second, replication failures are expected when researchers test risky hypotheses (e.g., effects of candidate genes on personality) that have a high probability of being false. In this case, a significant result may be a false positive result and replication failures demonstrate that it was a false positive. Thus, honest reporting of replication failures plays an integral part in normal science, and the success rate of replication studies provides valuable information about the empirical support for a hypothesis. However, a success rate of 25% or less for social psychology is not a sign of normal science, especially when social psychology journals publish over 90% significant results (Motyl et al., 2017; Sterling, 1959; Sterling et al., 1995). This discrepancy suggests that the problem is not the low success rate in replication studies, but the high success rate in psychology journals. If social psychologists tested risky hypotheses that have a high probability of being false, journals should report a lot of non-significant results, especially in articles that report multiple tests of the same hypothesis like Bem’s (2011) incredible ESP article (cf. Schimmack, 2012).
Crisis: Original Studies Are Not Credible Because They Used NHST
Bem’s anomalous results were published with a commentary by Wagenmakers et al. (2011). This commentary made various points that are discussed in more detail below, but one unique and salient point of Wagenmakers et al.’s comment concerned the use of null-hypothesis significance testing (NHST). Bem presented 9 results with p-values below .05 as evidence for ESP. Wagenmakers et al. object to the use of a significance criterion of .05 and argues that this criterion makes it too easy to publish false positive results (see also Benjamin et al., 2016).
Wagenmakers et al. (2011) claimed that this problem can be avoided by using Bayes-Factors. When they used Bayes-Factors with default priors, several of Bem’s studies no longer showed evidence for ESP. Based on these findings they argued that psychologists must change the way they analyze their data. Since then, Wagenmakers has worked tirelessly to promote Bayes-Factors as an alternative to NHST. However, Bayes-Factors have their own problems. The biggest problem is that they depend on the choice of a prior, and the same data can lead to different inferences.
Bem, Utts, and Johnson (2011) pointed out that Wagenmakers et al.’s (2011) default prior assumed that there is a 50% probability that ESP works in the opposite direction (below chance accuracy) and a 25% probability that effect sizes are greater than d = 1. Only 25% of the prior distribution was allocated to effect sizes in the predicted direction between 0 and 1. This prior makes no sense for research on extrasensory perception processes that are expected to produce small effects.
When Bem et al. (2011) specified a more reasonable prior, Bayes-Factors actually showed more evidence for ESP than NHST. Moreover, the results of individual studies are less important than the combined evidence across studies. A meta-analysis of Bem’s studies shows that even with the default prior, Bayes-Factor reject the null-hypothesis with an odds-ratio of one-billion to 1. Thus, if we trust Bem’s data, Bayes-Factors also suggest that Bem’s results are robust and it remains unclear why Galak et al. (2012) were unable to replicate Bem’s results.
Another argument in favor of Bayes-Factors is that NHST is one-sided. Significant results are used to reject the null-hypothesis, but non-significant results cannot be used to affirm the null-hypothesis. This makes non-significant results difficult to publish, which leads to publication bias. The claim is that Bayes-Factors solve this problem because they can provide evidence for the null-hypothesis. However, this claim is false (Tendeiro & Kiers, 2019). Bayes-Factors are odds-ratios between two alternative hypotheses. Unlike in NHST, these two competing hypotheses are not mutually exclusive. That is, there is an infinite number of additional hypotheses that are not tested. Thus, if the data favor the null-hypothesis, they do not provide support for the null-hypothesis. They merely provide evidence against one specified alternative hypothesis. There is always another possible alternative hypothesis that fits the data better than the null-hypothesis. As a result, even Bayes-Factors that strongly favor H0 fail to provide evidence that the true effect size is exactly zero.
The solution to this problem is not new, but unfamiliar to many psychologists. To demonstrate the absence of an effect, it is necessary to specify a region of effect sizes around zero and to demonstrate that the population effect size is likely to be within this region. This can be achieved using NHST (equivalence tests, Lakens, Scheel, & Isager, 2018), or Bayesian statistics (Kruschke & Liddell, 2018). The main reason why psychologists are not familiar with tests that demonstrate the absence of an effect may be that typical sample sizes in psychology have too much sampling error to produce precise estimates of effect sizes that could justify the conclusion that the population effect size is too close to zero to be meaingful.
An even more radical approach was taken by the editors of Basic and Applied Social Psychology (Trafimow & Marks, 2015), who claimed that NHST is logically invalid (Trafimov, 2003). Based on this argument, the editors banned p-values from publications, which solves the problem of replication failures because there are no formal inferential tests. However, authors continue to draw causal inferences that are in line with NHST, but simply omit statements about p-values. It is not clear that this cosmetic change in the presentation of results is a solution to the replication crisis.
In conclusion, Wagenmakers et al. and others have blamed the use of NHST for the replication crisis, but this criticism ignores the fact that cognitive psychology also uses NHST and does not suffer a replication crisis. The problem with Bem’s results was not the use of NHST, but the use of questionable research practices to produce illusory evidence (Francis, 2012; Schimmack, 2012, 2016, 2020).
Crisis: Original Studies Report Many False Positives
An influential article by Ioanndis (2005) claimed that most published research findings are false. This eye-catching claim has been cited thousands of times. Few citing authors have bothered to point out that the claim is entirely based on hypothetical scenarios rather than empirical evidence. In psychology, fear that most published results are false positives was stoked by Simmons, Nelson, and Simonsohn’s (2011) “False Positive Psychology” article that showed with simulation studies that the aggressive use of questionable research practices can dramatically increase the probability that a study produces a significant result without a real effect. These articles shifted concerns about false negatives in the 1990s (e.g., Cohen, 1994) to concerns about false positives.
The problem with the current focus on false positive results is that it implies that replication failures reveal false positive results in original studies. This is not necessarily the case. There are two possible explanations for a replication failure. Either the original study had low power to show a true effect (the nil-hypothesis is false) or the original study reported a false positive result and the nil-hypothesis is true. Replication failures do not distinguish between true and false nil-hypothesis, but they are often falsely interpreted as if replication failures reveal that the original hypothesis was wrong. For example, Nelson, Simmons, and Simonsohn (2018) write “Experimental psychologists spent several decades relying on methods of data collection and analysis that make it too easy to publish false-positive, nonreplicable results. During that time, it was impossible to distinguish between findings that are true and replicable and those that are false and not replicable” (p. 512). This statement ignores that results can be true, but difficult to replicate, if studies have low power.
The false assumption that replication failures reveal false positive results has created a lot of confusion in the interpretation of replication failures (Maxwell, Lau, & Howard, 2015). For example, Gilbert et al. (2016) attribute the low replication rate in the reproducibility project to low power of the replication studies. This does not make sense, when the replication studies had the same or sometimes even larger sample sizes than the original studies. As a result, the replication studies had as much or more power than the original studies. So, how could low power explain that discrepancy between the 97% success rate in original studies and the 25% success rate in replication studies? It cannot.
Gilbert et al.’s (2016) criticism only makes sense if replication failures in the replication studies are falsely interpreted as evidence that the original results were false positives. Now it makes sense to argue that both the original studies and the replication studies had low power to detect true effects and that replication failures are expected when true effects are tested in studies with low power. The only question that remains is why original studies all reported significant results when they had low power, but Gilbert et al. (2016) do not address this question.
Aside from Simmons et al.’s (2011) simulation studies, a few articles tried to examine the rate of false positive results empirically. One approach is to examine sign changes in replication studies. If 100 true null-hypothesis are tested, 50 studies are expected to show a positive sign and 50 studies are expected to show a negative sign due to random sampling error. If these 100 studies are replicated this will happen again. Just like two coin-flips, we would therefore expect 50 studies with the same outcome (both positive or both negative) and 50 studies with different outcomes (one positive, one negative).
Wilson and Wixted (2018) found that 25% of social psychological results in the OSC project showed a sign reversal. This would suggest that 50% of the studies tested a true null-hypothesis. Of course, sign reversals are also possible when the effect size is not strictly zero. However, the probability of a sign reversal decreases as effect sizes increase. Thus, it is possible to say that about 50% of the replicated studies had an effect size close to zero. Unfortunately, this estimate is imprecise due to the small sample size.
Gronau et al. (2017) attempted to estimate the false discovery rate using a statistical model that is fitted to the exact p-values of original studies. The applied this model to three datasets, and found false discovery rates (FDR) of 34% -46% for cognitive psychology, 40-60% for social psychology in general, and 48%-88% for social priming. However, Schimmack and Brunner (2019) discovered a statistical flaw in this model that leads to the overestimation of the FDR. They also pointed out that it is impossible to provide exact estimates of the FDR because the distinction between absolutely no effect and a very, very small effect is arbitrary.
Bartoš and Schimmack (2020) developed a statistical model, called z-curve.2.0, that makes it possible to estimate the maximum False Discovery Rate. If this maximum is low, it suggests that most replication failures are due to low power. Applying z-curve2.0 to Gronau et al.’s (2017) datasets yields FDRs of 9% (95%CI = 2% to 24%) for cognitive psychology, 26% (4% to 100%) for social psychology, and 61% (19% to 100%) for social priming. A drop from 40% for Gronau et al.’s model to 9% for Z-Curve2.0 shows that Gronau’s model dramatically overestimates the rate of false positive results in cognitive journals. However, the Z-Curve estimate that up to 61% of social priming results could be false positives justifies Kahneman’s letter to Bargh that called out social priming research as the “poster child for doubts about the integrity of psychological research.” The difference between 9% for cognitive psychology and 61% for social priming make it clear that it is not possible to generalize from the replication crisis in social psychology to other areas of psychology.
In conclusion, it is impossible to specify exactly whether an original finding was a false positive result or not. There have been several attempts to estimate the number of false positives results in the literature, but there is no consensus about the proper method to do so. I believe that the distinction between false and true positives is not particularly helpful, if the null-hypothesis is specified as a value of zero. An effect size of d = .0001 is not any more meaningful than an effect size of d = 0000. To be meaningful, published results should be replicable given the same sample sizes as used in original research. Demonstrating a significant result in the same direction in a much larger sample with a much smaller effect size should not be considered to be a successful replication of a result with a large effect size in a small sample; it is actually an original discovery that also provides a better estimate of the population effect size.
Crisis: Original Studies Are Selected for Significance
The most obvious explanation for the replication crisis is the well-known bias to publish only significant results that confirm theoretical predictions. As a result, it is not necessary to read the results section of a psychological article. It will inevitably report confirmatory evidence, p < .05. This practice is commonly known as publication bias. Concerns about publication bias are nearly as old as empirical psychology (Sterling, 1959; Rosenthal, 1979). Without publication bias, the success rate should match the mean power of studies (Brunner & Schimmack, 2019). However, the large discrepancy between the success rate in journals and the outcome of the OSC replication project that produced only 25% suggests that mean power in social psychology is closer to 25% than to 95%.
The 1990s saw some concerns about false positives in social psychology. Kerr (1998) published his famous “HARKing” article and social psychology journals responded by demanding that researchers publish multiple replication studies within a single article (cf. Wegner, 1992). These multiple-study articles created a sense of rigor and made false positive results extremely unlikely. With five studies, the risk of a false positive result is smaller than the criterion used by particle physicists to claim a discovery (cf. Schimmack, 2012). Thus, Bem’s (2011) article that contained 9 successful studies exceeded the stringent criterion that was used to claim the discovery of the Higgs-Boson particle; the most celebrated findings in physics in the 2010s.
The key difference between the discovery of the Higgs-Boson particle in 2012 and Bem’s discovery of mental time-travel is that physicists conducted a single powerful experiment to test their predictions, while Bem conducted many studies and selectively published results that supported his claim (Schimmack, 2018b). Bem (2012) even admitted that he ran many small studies that were not included in the article. At the same time, he was willing to combine several small studies with promising trends into a single dataset. For example, Study 6 was really four studies with Ns = 50, 41, 19, and 40 (cf. Schimmack, Schultz, Carlsson, Schmukle, 2018). These questionable, to say the least, practices were so common in social psychology that leading social psychologists were unwilling to retract Bem’s article because his practices were considered acceptable at that time (Kitayama, 2018).
Empirical evidence for the use and acceptance of questionable research practices comes from an anonymous survey (John et al., 2012). The most widely used QRPs were (a) not reporting all dependent variables (65%), collecting more data after snooping (57%), and selectively reporting studies that worked (48%). Moreover, researchers found these QRPs acceptable with defensibility ratings (0-2) of 1.84, 1.79, and 1.66, respectively. It is unclear whether opinions about the use of questionable research practices have changed in response to the replication crisis.
Social psychologists have had two responses to John et al.’s (2012) article. One response is to question the importance of the findings. Stroebe and Strack (2014) argued that these practices may not be questionable, but they do not counter Sterling’s argument that these practices invalidate the meaning of significance testing and p-values. Fiedler and Schwarz (2016) argue that John et al.’s (2012) survey produced inflated estimates of the use of QRPs. However, they fail to provide an alternative explanation for the low replication rate of social psychological research.
Statistical methods that can reveal publication bias provide additional evidence about the use of QRPs. Although these tests often have low power in small sets of studies (Renkewitz & Keiner, 2019), they can provide clear evidence of publication bias when bias is large (Francis, 2012; Schimmack, 2012) or when the set of studies is large (Carter & McCollough, 2013, 2014; Carter et al., 2015). One group of bias tests compare the success rate to estimates of mean power. The advantage of these tests is that they provide clear evidence of QRPs. Francis used this approach to demonstrate that 82% of articles with four or more studies that were published between 2009 and 2012 in Psychological Science showed evidence of bias. This is remarkable given the low power of the test to reveal QRPs in small sets of studies unless bias is severe (Renkewitz & Keiner, 2019; Schimmack, 2020).
Social psychologists have mainly ignored evidence that QRPs were used to produce significant results. John et al.’s article has been cited over 500 times, but it has not been cited by social psychologists who commented on the replication crisis like Fiske, Baumeister, Gilbert, Wilson, or Nisbett. This is symptomatic of the response by eminent social psychologists to the replication crisis. Rather than engaging in a scientific debate about the causes of the crisis, social psychologists have remained silent or dismissed critics as unscientific. “Some critics go beyond scientific argument and counterargument to imply that the entire field is inept and misguided (e.g., Gelman, 2014; Schimmack, 2014)” (Fiske, 2017). Yet, Fiske does not engage with the scientific evidence that is used to criticize social psychology such as Francis’s demonstration that most articles reported too many significant results that were obtained with the help of QRPs.
Others have argued that Francis’s work is unnecessary because the presence of publication bias is a well-known fact. Therefore “one is guaranteed to eventually reject a null we already know is false” (Simonsohn, 2013). This argument ignores that bias tests can help to show that social psychology is improving. For example, bias tests show no bias in registered replication reports, indicating that this new format produces more credible results (Schimmack, 2018a).
Murayama, Pekrun, and Fiedler (2013) noted that demonstrating the presence of bias does not justify the conclusion that there is no effect. This is true, but not very relevant. Bias undermines the credibility of the evidence that is supposed to demonstrate an effect. Without credible evidence it remains uncertain whether an effect is present or not. Moreover, Murayama et al. (2013) acknowledge that bias always inflates effect size estimates, which makes it more difficult to assess the practical relevance of published results.
A more valid criticism of Francis’s bias analyses is that they do not reveal the amount of bias (Simonsohn, 2013). That is, when we see 95% significant results in a journal and there is bias, it is not clear whether mean power was 75% or 25%. To be more useful, bias tests should also provide information about the effect size of bias.
In conclusion, selective reporting of significant results inflates effect sizes and the observed discovery rate in journals gives a false impression of the power and replicability of published results. Surveys and bias tests show that the use of QRPs in social psychology were widespread. However, bias tests merely show that QRPs were used. They do not show how much QRPs influenced reported results.
Z-Curve: Quantifying the Crisis
Some psychologists developed statistical models that can quantify the influence of selection for significance on replicability. Brunner and Schimmack (2019) compared four methods to estimate the ERR, including the popular p-curve method (Brunner, 2019; Simonsohn, Nelson, & Simmons, 2014; Ulrich & Miller, 2018). They found that p-curve overestimated the expected replication rate (ERR) when studies varied in effect sizes. In contrast, a new method called z-curve performed well across many scenarios, especially when heterogeneity was present.
Bartoš and Schimmack (2020) validated an extended version of z-curve (z-curve2.0) that provides confidence intervals and provides estimates of the expected discovery rate, that is, the percentage of observed significant results for all tests that were conducted, even if they were not reported. To do so, z-curve estimates the size of the file-drawer of unpublished studies with non-significant results. Z-curve has already been applied to various datasets of results in social psychology (see R-Index blog for numerous examples).
The most important dataset was created by Motyl et al. (2017) who coded a representative sample of studies in social psychology journals. The main drawback of Motyl’s audit of social psychology was that they did not have a proper statistical tool to estimate replicability. I used this dataset to estimate the replicability of social psychology based on a representative sample of studies. To be included in the z-curve analysis, a study had to (a) use a t-test or F-test, (b) have a valid test-statistic, and (c) not be from the journal Psychological Science. The last criterion was used to focus on social psychology. I also excluded studies with more than 4 experimenter degrees of freedom (e.g., 177 df). This left 678 studies for analysis. The set included 450 between-subject studies, 139 mixed designs, and 67 within-subject designs. The preponderance of between-subject designs is typical of social psychology and one of the reasons for the low power of studies in social psychology.
The results in Figure 1 show an expected replication rate of 45%, 95%CI 40% to 52%. This result is a bit better than the 25% estimate obtained in the OSC project. There are a number of possible explanations for the discrepancy between the OSC estimate and the z-curve estimate. First of all, the number of studies in the OSC project is very small and sampling error alone could explain some of the differences. Second, the set of studies in the OSC project was not representative and may have selected studies with lower replicability. Third, some actual replication studies may have modified procedures in ways that lowered the chance of obtaining a significant result. Finally, it is never possible to exactly replicate a study (Stroebe & Strack, 2014; van Bavel et al., 2017). Thus, z-curve estimates are overly optimistic because they assume exact replications. If there is contextual sensitivity, selection for significance will produce additional regression to the mean and a better estimate of the actual replication rate is the expected discovery rate (Bartoš & Schimmack, 2020). The estimated EDR of 21% is close to the 25% estimate based on actual replication studies. In combination, the existing evidence suggests that the replicability of social psychological research is somewhere between 20% and 50%, which is clearly unsatisfactory and much lower than the observed discovery rate of 90% or more in social psychology journals.
Figure 1 also clearly shows that questionable research practices explain the gap between success rates in laboratories and success rates in journals. The z-curve estimate of non-significant results shows that a large proportion of non-significant results are expected, but hardly any of these expected studies every get published. This is reflected in an observed discovery rate of 90% and an expected discovery rate of 21%. The confidence intervals do not overlap, indicating that this discrepancy is highly significant. Given such extreme selection for significance, it is not surprising that published effect sizes are inflated and replication studies fail to reproduce significant results. In conclusion, out of all explanations for replication failures in psychology, the use of questionable research practices is the main factor.
Z-curve can also be used to examine the power of subgroups of studies. In the OSC project, studies with a z-score greater than 4 had an 80% chance to be replicated. To achieve an ERR of 80% with Motyl’s data, z-scores have to be greater than 3.5. In contrast, studies with just significant results (p < .05 & p > .01) have only an ERR of 28%. This information can be used to reevaluate published results. Studies with p-values between .05 and .01 should not be trusted unless other information suggests otherwise (e.g., a trustworthy meta-analysis). In contrast, results with z-scores greater than 4 can be used to plan new studies. Unfortunately, there are much more questionable results with p-values greater than .01 (42%) than trustworthy results with z > 4 (17%), but at least there are some findings that are likely to replicate even in social psychology.
An Inconvenient Truth
Every crisis is an opportunity to learn to avoid future mistakes. Lending practices were changed after the financial crisis in the 2000s. Psychologists and other sciences can learn from the replication crisis in social psychology, but only if they are honest and upfront about the real cause of the replication crisis. Social psychologists did not use the scientific method properly. Neither Fisher nor Neyman and Pearson, who created NHST, proposed that non-significant results are irrelevant or that only significant results should be published. The problems of selection for significance is evident and has been well-known (Rosenthal, 1979; Sterling, 1959). Cohen (1962) warned about low power, but the main concern were large file-drawers filled with type-II errors. Nobody could imagine that whole literatures with hundreds of studies are built on nothing but sampling error and selection for significance. Bem’s article and replication failures in the 2010s showed that the abuse of questionable research practices was much more excessive than anybody was willing to believe.
The key culprit were conceptual replication studies. Even social psychologists were aware that it is unethical not to report replication failures. For example, Bem advised researchers to use questionable research practices to find significant results in their data. “Go on a fishing expedition for something – anything – interesting” even if this meant to “err on the side of discovery” (Bem, 2010). However, even Bem made it clear that “this is not advice to suppress negative results. If your study was genuinely designed to test hypotheses that derive from a formal theory or are of wide general interest for some other reason, then they should remain the focus of your article. The integrity of the scientific enterprise requires the reporting of disconfirming results.”
How did social psychologists justify to themselves that it is OK to omit non-significant results (John et al., 2012), when it was made explicit that this undermines the integrity of the scientific enterprise (Bem, 2010)? One explanation is the distinction between direct and conceptual replications. Conceptual replications always vary at least a small detail of a study. Thus, a non-significant result is never a replication failure of a previous study. It is just a failure of a specific study to show the effect. Graduate students were explicitly given the advice to “never do a direct replication; that way, if a conceptual replication doesn’t work, you maintain plausible deniability” (Anonymous cited in Spellman, 2015). This is also how Morewedge, Gilbert, and Wilson explain why they only report significant results.
“Let us be clear: We did not run the same study over and over again until it yielded significant results and then report only the study that “worked.” Doing so would be clearly unethical. Instead, like most researchers who are developing new methods, we did some preliminary studies that used different stimuli and different procedures and that showed no interesting effects. Why didn’t these studies show interesting effects? We’ll never know.”
It was only in 2012 that psychologists realized that changing results in their studies were heavily influenced by sampling error and not by some minor changes in the experimental procedure. Only a few psychologists have been open about this. In a commendable editorial, Lindsay (2019) talks about his realization that his research practices were suboptimal.
“Early in 2012, Geoff Cumming blew my mind with a talk that led me to realize that I had been conducting underpowered experiments for decades. In some lines of research in my lab, a predicted effect would come booming through in one experiment but melt away in the next. “My students and I kept trying to find conditions that yielded consistent statistical significance—tweaking items, instructions, exclusion rules—but we sometimes eventually threw in the towel because results were maddeningly inconsistent.”
Rather than invoking some supernatural decline effect like Schooler, Lindsay realized that his research practices were suboptimal. A first step for social psychologists is to acknowledge their past mistakes and to learn from their mistakes. Unfortunately, there has been no collective admission of wrongdoing. Instead we have seen public displays of denial and anger and maybe some private experiences of shame and depression. Maybe it is time for acceptance. Making mistakes is a fact of life. What counts is the response to a mistake. So far, the response by social psychologists has been underwhelming. It is time for some leaders to step up or to step down and make room for a new generation of social psychologists that follow open and transparent practices.
The Way out of the Crisis
A clear analysis of the replication crisis points towards a clear path out of the crisis. Given that “lax data collection, analysis, and reporting” standards (Carpenter, 2012) allowed for the use of QRPs that undermine the credibility of social psychology, the most obvious solution is to ban the use of questionable research practices, and to treat them like other types of unethical behaviors (Engel, 2015). However, no scientific organization has clearly stated which practices are acceptable and which practices are not, and prominent social psychologists oppose clear rules of scientific misconduct (Fiske, 2016).
At present, the enforcement of good practices is left to editors of journals who can ask pertinent questions during the submission process (Lindsay, 2019). Another solution has been to ask researchers to preregister their studies, which limits researchers’ freedom to go on a fishing expedition (Nosek et al., 2018). Some journals reward preregistering with badges (JESP), but some social psychology journals do not (PSPB, SPPS). There has been a lot of debate about the value of preregistration and concerns that it may reduce creativity. However, pre-registration does not imply that all research has to be confirmatory. It merely makes it possible to distinguish clearly between exploratory and confirmatory analyses.
It is unlikely that pre-registration alone will solve all problems, especially because there are no clear standards about pre-registrations and how much they constrain the actual analyses. For example, Noah, Schul,and Mayo (2018) preregistered the prediction of an interaction between being observed and a facial feedback manipulation. Although the predicted interaction was not significant, they interpreted the non-significant pattern as confirming their prediction rather than stating that there was no support for their preregistered prediction. A z-curve analysis of pre-registered studies in JESP still found evidence of QRPs, although less so than for articles that were not pre-registered (Schimmack, 2020). To improve the value of pre-registration, societies should provide clear norms for research ethics that can be used to hold researchers accountable when they try to game pre-registration (Yamada, 2018).
Preregistration of studies alone will only produce more non-significant results and not increase the replicability of significant results because studies are underpowered. To increase replicability, social psychologists finally have to conduct power analysis to plan studies that can produce significant results without QRPs. This also means they need to publish less because more resources are needed for a single study (Schimmack, 2012).
To ensure that published results are credible and replicable, I argue that researchers should be rewarded for conducting high-powered studies. As a priori power-analysis are based on estimates of effect sizes, they cannot provide information about the actual power of studies. However, z-curve can provide information about the typical power of studies that are conducted within a lab. This information provides quantitative information about the research practices of a lab.
I illustrate the usefulness of taking replicability into account with Roy F. Baumeister’s z-curve. In terms of traditional metrics, like the H-Index, Roy F. Baumeister is one of the leading social psychologists with an H-Index of 105. However, Figure 2 shows that the studies that were used to attain eminence had very low power to produce significant results. The expected replication rate is only 22%, which is below the typical average of social psychology (Figure 1). One way to incorporate this information in an index is to weight the H-Index by the ERR, which produces a replication-weighted H-Index (RH-Index) of 105*.22 = 23. In contrast, Susan T. Fiske has an H-Index of 69, with an ERR of .59, which gives her an RH-Index of 41. The RH-Index suggests that Fiske has made a more positive contribution to social psychology than Baumeister because her work is more replicable.
By taking replicability into account, the incentive to publish as many discoveries as possible without carrying about their truth-value (i.e., “to err on the side of discovery”) is no longer the best strategy to achieve fame and recognition in a field. The RH-Index could also motivate researchers to retract articles that they no longer believe in, which would lower the H-Index, but increase the R-Index. For highly problematic papers this could produce a net gain in the RH-Index.
Social psychology is changing in response to a replication crisis. To (re)gain trust in social psychology as a science, social psychologists need to change research practices from the way studies are planned, conducted, and reported. The problem of low power has been known since Cohen (1962), but only in recent years, power of social psychological studies has increased (Schimmack, 2020). Aside from larger samples, social psychologists are also starting to use within-subject designs that increase power (Lin et al., 2020). Finally, social psychologists need to change the way they report their results. Most importantly, they need to stop reporting only results that confirm their predictions. Fiske (2016) recommended that scientists keep track of their questionable practices and Wicherts’ et al. (2016) provided a checklist to do so. This practice undermines the very purpose of empirical research and distorts the evidence in published journals. Journals need to evaluate studies based on their ability to make a valuable contribution independent of the outcome of a study (Chambers, 2013). Most important, once a discovery has been made, failures to replicate this finding provide valuable, new information and need to be published (Galek et al., 2012).
My personal contribution to improving science has been the development of tools that make it possible to examine whether reported results are credible or not (Schimmack, 2012; Schimmack & Brunner, 2019; Bartoš & Schimmack, 2020). My hope is that measurement of actual power and publication bias motivates researchers to increase power and decrease the use of QRPs. I agree with Fiske (2017) that science works better when we can trust scientists, but a science with a replication rate of 25% is not trustworthy. Ironically, z-curve analyses that show improvement may help to restore trust in social psychology.
Alogna, V. K., Attaya, M. K., Aucoin, P., Bahnik, S., Birch, S., Birt, A. R., . . . Zwaan, R. A. (2014). Registered replication report: Schooler & Engstler-Schooler (1990). Perspectives on Psychological Science, 9, 556–578.
Bem, D. J. (2000). Writing an empirical article. In R. J. Sternberg (Ed.), Guide to publishing in psychological journals (pp. 3–16). Cambridge, England: Cambridge University Press. doi:10.1017/CBO9780511807862.002
Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100, 407–425. doi:10.1037/a0021524
Bem, D. J., Utts, J., & Johnson, W. O. (2011). Must psychologists change the way they analyze their data? Journal of Personality and Social Psychology, 101(4), 716–719. https://doi.org/10.1037/a0024777
Brunner, J. & Schimmack, U. (2019). Estimating Population Mean Power Under Conditions of Heterogeneity and Selection for Significance. Meta-Psychology, In Press.
Bryan, C. J., Yeager, D. S., & O’Brien, J. M. (2019). Replicator degrees of freedom allow publication of misleading failures to replicate, 116, 25535-25545. Proceedings of the National Academy of Sciences, doi/10.1073/pnas.1910951116
Cacioppo, J. T., Petty, R. E., & Morris, K. (1983). Effects of need for cognition on message evaluation, argument recall, and persuasion. Journal of Personality and Social Psychology, 45, 805–818. http://dx.doi.org/10.1037/0022-35188.8.131.525
Carpenter, S. (2012). Psychology’s bold initiative: In an unusual attempt at scientific self-examination, psychology researchers are scrutinizing their field’s reproducibility. Science, 335, 1558–1560.
Carter, E. C., Kofler, L. M., Forster, D. E., & McCullough, M. E. (2015). A series of meta-analytic tests of the depletion effect: Self-control does not seem to rely on a limited resource. Journal of Experimental Psychology: General, 144(4), 796–815. https://doi.org/10.1037/xge0000083
Carter, E. C., and McCullough, M. E. (2013). Is ego depletion too incredible? Evidence for the overestimation of the depletion effect. Behav. Brain Sci. 36, 683–684. doi: 10.1017/S0140525X13000952
Carter, E. C., & McCullough, M. E. (2014). Publication bias and the limited strength model of self-control: Has the evidence for ego depletion been overestimated? Frontiers in Psychology, 5, Article 823.
Chambers, C. D. (2013). Registered Reports: A new publishing initiative at Cortex [Editorial]. Cortex: A Journal Devoted to the Study of the Nervous System and Behavior, 49(3), 609–610. https://doi.org/10.1016/j.cortex.2012.12.016
Cohen J. 1962. The statistical power of abnormal-social psychological research: a review. Journal of Abnormal and Social Psychology, 65:145–53
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997–1003.
Crandall, C. S., & Sherman, J. W. (2016). On the scientific superiority of conceptual replications for scientific progress. Journal of Experimental Social Psychology, 66, 93–99. https://doi.org/10.1016/j.jesp.2015.10.002
Cunningham, M. R., & Baumeister, R. F. (2016). How to make nothing out of something: Analyses of the impact of study sampling and statistical interpretation in misleading meta-analytic conclusions. Frontiers in Psychology, 7, Article 1639.
Ebersole, C. R., Atherton, O. E., Belanger, A. L., Skulborstad, H. M., Allen, J. M., Banks, J. B., …Nosek, B. A. (2016). Many Labs 3: Evaluating participant pool quality across the academic semester via replication. Journal of Experimental Social Psychology, 67, 68–82.
Elkins-Brown, N., Saunders, B., & Inzlicht, M. (2018). The misattribution of emotions and the error-related negativity: A registered report. Cortex: A Journal Devoted to the Study of the Nervous System and Behavior, 109, 124–140. https://doi.org/10.1016/j.cortex.2018.08.017
Francis, G. (2012). Too good to be true: Publication bias in two prominent studies from experimental psychology. Psychonomic Bulletin & Review, 19, 151–156. doi:10.3758/s13423-012-0227-9
Galak, J., LeBoeuf, R. A., Nelson, L. D., & Simmons, J. P. (2012). Correcting the past: Failures to replicate psi. Journal of Personality and Social Psychology, 103(6), 933–948. https://doi.org/10.1037/a0029709
Gilbert, D. T., King, G., Pettigrew, S., & Wilson, T. D. (2016). Comment on “Estimating the reproducibility of psychological science.” Science, 351, 1037–1103.
Gronau, Q. F., Duizer, M., Bakker, M., & Wagenmakers, E.-J. (2017). Bayesian mixture modeling of significant p values: A meta-analytic method to estimate the degree of contamination from H₀. Journal of Experimental Psychology: General, 146(9), 1223–1233. https://doi.org/10.1037/xge0000324
Hagger, M. S., Wood, C., Stiff, C., & Chatzisarantis, N. L. D. (2010). Ego depletion and the strength model of self-control: A meta-analysis. Psychological Bulletin, 136(4), 495–525. https://doi.org/10.1037/a0019486
Hagger, M. S., Chatzisarantis, N. L. D., Alberts, H., Anggono, C. O., Batailler, C., Birt, A. R., … Zwienenberg, M. (2016). A Multilab Preregistered Replication of the Ego-Depletion Effect. Perspectives on Psychological Science, 11(4), 546–573. https://doi.org/10.1177/1745691616652873
Lin, H., Saunders, B., Friese, M., Evans, N.J., & Inzlicht, M. (2020). Strong effort manipulations reduce response caution: A preregistered reinvention of the ego depletion paradigm, Psychological Science, in press.
Ioannidis, J.P.A. (2005) Why most published research findings are false. PLoS Med 2: e124.
Lakens, D., Scheel, A. M., & Isager, P. M. (2018). Equivalence testing for psychological research: A tutorial. Advances in Methods and Practices in Psychological Science, 1(2), 259–269. https://doi.org/10.1177/2515245918770963
Kruschke, J. K., & Liddell, T. M. (2018). The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychonomic Bulletin & Review, 25(1), 178–206. https://doi.org/10.3758/s13423-016-1221-4
Kvarven, E., Strømland, M., & Johannesson (2019). Comparing Meta-Analyses and Pre-Registered Multiple Labs Replication Projects. Preprint. (retrieved 1/6/2020)
Luttrell, A., Petty, R. E., & Xu, M. (2017). Replicating and fixing failed replications: The case of need for cognition and argument quality. Journal of Experimental Social Psychology, 69, 178–183. https://doi.org/10.1016/j.jesp.2016.09.006
Maxwell, S. E., Lau, M. Y., & Howard, G. S. (2015). Is psychology suffering from a replication crisis? What does “failure to replicate” really mean? American Psychologist, 70(6), 487–498. https://doi.org/10.1037/a0039400
Motyl, M., Demos, A. P., Carsel, T. S., Hanson, B. E., Melton, Z. J., Mueller, A. B., Prims, J. P., Sun, J., Washburn, A. N., Wong, K. M., Yantis, C., & Skitka, L. J. (2017). The state of social and personality science: Rotten to the core, not so bad, getting better, or getting worse? Journal of Personality and Social Psychology, 113(1), 34–58. https://doi.org/10.1037/pspa0000084
Murayama, K., Pekrun, R., & Fiedler, K. (2014). Research practices that can prevent an inflation of false-positive rates. Personality and Social Psychology Review, 18(2), 107–118. https://doi.org/10.1177/1088868313496330
Noah, T., Schul, Y., & Mayo, R. (2018). When both the original study and its failed replication are correct: Feeling observed eliminates the facial-feedback effect. Journal of Personality and Social Psychology, 114(5), 657–664. https://doi.org/10.1037/pspa0000121
Nosek, B. A., Ebersole, C. R., DeHaven, A. C., & Mellor, D. T. (2018). The preregistration revolution. PNAS Proceedings of the National Academy of Sciences of the United States of America, 115(11), 2600–2606. https://doi.org/10.1073/pnas.1708274114
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716–aac4716. doi:10.1126/science.aac4716
Ritchie, S. J., Wiseman, R., & French, C. C. (2012a). Failing the future: Three unsuccessful attempts to replicate Bem’s “retroactive facilitation of recall” effect. PLoS One, 7(3), Article e33423. doi:10.1371/journal.pone.0033423
Scheel, A. M., Schijen, M., & Lakens, D. (2020). An excess of positive results: Comparing the standard psychology literature with registered reports. Preprint. https://psyarxiv.com/p6e9c
Schimmack, U. (2018b). Why the Journal of Personality and Social Psychology Should Retract Article DOI: 10.1037/a0021524 “Feeling the Future: Experimental evidence for anomalous retroactive influences on cognition and affect” by Daryl J. Bem. https://replicationindex.com/2018/01/05/bem-retraction/ (blog post retrieved 1/6/2020)
Schooler, J. W. (2014). Turning the lens of science on itself: Verbal overshadowing, replication, and metascience. Perspectives on Psychological Science, 9(5), 579–584. https://doi.org/10.1177/1745691614547878
Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve and effect size: Correcting for publication bias using only significant results. Perspectives on Psychological Science, 9, 666–681.
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632
Sterling, T. D. (1959). Publication decision and the possible effects on inferences drawn from tests of significance – or vice versa. Journal of the American Statistical Association, 54, 30–34.
Sterling, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication decisions revisited: The effect of the outcome of statistical tests on the decision to publish and vice versa. The American Statistician, 49, 108–112.
Tendeiro, J. N., & Kiers, H. A. L. (2019). A review of issues about null hypothesis Bayesian testing. Psychological Methods, 24(6), 774–795. https://doi.org/10.1037/met0000221
Trafimow, D. (2003). Hypothesis testing and theory evaluation at the boundaries: Surprising insights from Bayes’s theorem. Psychological Review, 110, 526–535. doi:10.1037/0033-295X.110.3.526
Trafimow, D., & Marks, M. (2015). Editorial. Basic and Applied Social Psychology, 37, 1–2. doi:10.1080/01973533.2015.1012991
Ulrich, R., & Miller, J. (2018). Some properties of p-curves, with an application to gradual publication bias. Psychological Methods, 23(3), 546–560. https://doi.org/10.1037/met0000125
Van Bavel, J. J., Mende-Siedlecki, P., Brady, W. J., & Reinero, D. A. (2016). Contextual sensitivity in scientific reproducibility. Proceedings of the National Academy of Sciences of the United States of America, 113(23), 6454–6459. https://doi.org/10.1073/pnas.1521897113
Wagenmakers, E.-J., Beek, T., Dijkhoff, L., Gronau, Q. F., Acosta, A., Adams, R. B., … Zwaan, R. A. (2016). Registered Replication Report: Strack, Martin, & Stepper (1988). Perspectives on Psychological Science, 11(6), 917–928. https://doi.org/10.1177/1745691616674458
Wagenmakers EJ, Wetzels R, Borsboom D, van der Maas HL. 2011. Why psychologists must change the way they analyze their data: the case of psi: comment on Bem 2011. Journal of Personality and Social Psychology, 100(3), 426–32.
Wicherts, J. M., Veldkamp, C. L. S., Augusteijn, H. E. M., Bakker, M., van Aert, R. C. M., & van Assen, M. A. L. M. (2016). Degrees of freedom in planning, running, analyzing, and reporting psychological studies: A checklist to avoid p-hacking. Frontiers in Psychology, 7, Article 1832.
Wilson, B. M., & Wixted, J. T. (2018). The prior odds of testing a true effect in cognitive and social psychology. Advances in Methods and Practices in Psychological Science, 1(2), 186–197. https://doi.org/10.1177/2515245918767122
We all know what psychologists did before 2012. The name of the game was to get significant results that could be sold to a journal for publication. Some did it with more power and some did it with less power, but everybody did it.
In the beginning of the 2010s it became obvious that this was a flawed way to do science. Bem (2011) used this anything-goes to get significance approach to publish 9 significant demonstration of a phenomenon that does not exist: mental time-travel. The cat was out of the bag. There were only two questions. How many other findings were unreal and how would psychologists respond to the credibility crisis.
D. Steve Lindsay responded to the crisis by helping to implement tighter standards and to enforce these standards as editor of Psychological Science. As a result, Psychological Science has published more credible results over the past five years. At the end of his editorial term, Linday published a gutsy and honest account of his journey towards a better and more open psychological science. It starts with his own realization that his research practices were suboptimal.
Early in 2012, Geoff Cumming blew my mind with a talk that led me to realize that I had been conducting underpowered experiments for decades. In some lines of research in my lab, a predicted effect would come booming through in one experiment but melt away in the next. My students and I kept trying to find conditions that yielded consistent statistical significance—tweaking items, instructions, exclusion rules—but we sometimes eventually threw in the towelbecause results were maddeningly inconsistent. For example, a chapter by Lindsay and Kantner (2011) reported 16 experiments with an on-again/off-again effect of feedback on recognition memory. Cumming’s talk explained that p values are very noisy. Moreover, when between-subjects designs are used to study small- to medium-sized effects, statistical tests often yield nonsignificant outcomes (sometimes with huge p values) unless samples are very large.
Hard on the heels of Cumming’s talk, I read Simmons, Nelson, and Simonsohn’s (2011) “False-Positive Psychology” article, published in Psychological Science. Then I gobbled up several articles and blog posts on misuses of null-hypothesis significance testing (NHST). The authors of these works make a convincing case that hypothesizing after the results are known (HARKing; Kerr, 1998) and other forms of “p hacking” (post hoc exclusions, transformations, addition of moderators, optional stopping, publication bias, etc.) are deeply problematic. Such practices are common in some areas of scientific psychology, as well as in some other life sciences. These practices sometimes give rise to mistaken beliefs in effects that really do not exist. Combined with publication bias, they often lead to exaggerated estimates of the sizes of real but small effects.
This quote is exceptional because few psychologists have openly talked about their research practices before (or after) 2012. It is an open secrete that questionable research practices were widely used and anonymous surveys support this (John et al., 2012), but nobody likes to talk about it. Lindsay’s frank account is an honorable exception in the spirit of true leaders who confront mistakes head on, just like a Nobel laureate who recently retracted a Science article (Frances Arnold).
1. Acknowledge your mistakes.
2. Learn from your mistakes.
3. Teach others from your mistakes.
4. Move beyond your mistakes.
Lindsay’s acknowledgement also makes it possible to examine what these research practices look like when we examine published results, and to see whether this pattern changes in response to awareness that certain practices were questionable.
So, I z-curved Lindsay’s published results from 1998 to 2012. The graph shows some evidence of QRPs, in that the model assumes more non-significant results (grey line from 0 to 1.96) than are actually observed (histogram of non-significant results). This is confirmed by a comparison of the observed discovery rate (70% of published results are significant) and the expected discovery rate (44%). However, the confidence intervals overlap. So this test of bias is not significant.
The replication rate is estimated to be 77%. This means that there is a 77% probability that repeating a test with a new sample (of equal size) would produce a significant result again. Even for just significant results (z = 2 to 2.5), the estimated replicability is still 45%. I have seen much worse results.
Nevertheless, it is interesting to see whether things improved. First of all, being editor of Psychological Science is full-time job. Thus, output has decreased. Maybe research also slowed down because studies were conducted with more care. I don’t know. I just know that there are very few statistics to examine.
Although the small sample size of tests makes results somewhat uncertain, the graph shows some changes in research practices. Replicability increased further to 88% and there is no loner a discrepancy between observed and expected discovery rate.
If psychology as a whole had responded like D.S. Lindsay it would be in a good position to start the new decade. The problem is that this response is an exception rather than the rule and some areas of psychology and some individual researchers have not changed at all since 2012. This is unfortunate because questionable research practices hurt psychology, especially when undergraduates and the wider public learn more and more how untrustworthy psychological science has been and often still us. Hopefully, reforms will come sooner than later or we may have to sing a swan song for psychological science.
Citation: Francis G., (2014). The frequency of excess success for articles in Psychological Science. Psychon Bull Rev (2014) 21:1180–1187 DOI 10.3758/s13423-014-0601-x
The Open Science Collaboration article in Science has over 1,000 articles (OSC, 2015). It showed that attempting to replicate results published in 2008 in three journals, including Psychological Science, produced more failures than successes (37% success rate). It also showed that failures outnumbered successes 3:1 in social psychology. It did not show or explain why most social psychological studies failed to replicate.
Since 2015 numerous explanations have been offered for the discovery that most published results in social psychology cannot be replicated: decline effect (Schooler), regression to the mean (Fiedler), incompetent replicators (Gilbert), sabotaging replication studies (Strack), contextual sensitivity (vanBavel). Although these explanations are different, they share two common elements, (a) they are not supported by evidence, and (b) they are false.
A number of articles have proposed that the low replicability of results in social psychology are caused by questionable research practices (John et al., 2012). Accordingly, social psychologists often investigate small effects in between-subject experiments with small samples that have large sampling error. A low signal to noise ratio (effect size/sampling error) implies that these studies have a low probability of producing a significant result (i.e., low power and high type-II error probability). To boost power, researchers use a number of questionable research practices that inflate effect sizes. Thus, the published results provide the false impression that effect sizes are large and results are replicated, but actual replication attempts show that the effect sizes were inflated. The replicability projected suggested that effect sizes are inflated by 100% (OSC, 2015).
In an important article, Francis (2014) provided clear evidence for the widespread use of questionable research practices for articles published from 2009-2012 (pre crisis) in the journal Psychological Science. However, because this evidence does not fit the narrative that social psychology was a normal and honest science, this article is often omitted from review articles, like Nelson et al’s (2018) ‘Psychology’s Renaissance’ that claims social psychologists never omitted non-significant results from publications (cf. Schimmack, 2019). Omitting disconfirming evidence from literature reviews is just another sign of questionable research practices that priorities self-interest over truth. Given the influence that Annual Review articles hold, many readers maybe unfamiliar with Francis’s important article that shows why replication attempts of articles published in Psychological Science often fail.
Francis (2014) “The frequency of excess success for articles in Psychological Science”
Francis (2014) used a statistical test to examine whether researchers used questionable research practices (QRPs). The test relies on the observation that the success rate (percentage of significant results) should match the mean power of studies in the long run (Brunner & Schimmack, 2019; Ioannidis, J. P. A., & Trikalinos, T. A., 2007; Schimmack, 2012; Sterling et al., 1995). Statistical tests rely on the observed or post-hoc power as an estimate of true power. Thus, mean observed power is an estimate of the expected number of successes that can be compared to the actual success rate in an article.
It has been known for a long time that the actual success rate in psychology articles is surprisingly high (Sterling, 1995). The success rate for multiple-study articles is often 100%. That is, psychologists rarely report studies where they made a prediction and the study returns a non-significant results. Some social psychologists even explicitly stated that it is common practice not to report these ‘uninformative’ studies (cf. Schimmack, 2019).
A success rate of 100% implies that studies required 99.9999% power (power is never 100%) to produce this result. It is unlikely that many studies published in psychological science have the high signal-to-noise ratios to justify these success rates. Indeed, when Francis applied his bias detection method to 44 studies that had sufficient results to use it, he found that 82 % (36 out of 44) of these articles showed positive signs that questionable research practices were used with a 10% error rate. That is, his method could at most produce 5 significant results by chance alone, but he found 36 significant results, indicating the use of questionable research practices. Moreover, this does not mean that the remaining 8 articles did not use questionable research practices. With only four studies, the test has modest power to detect questionable research practices when the bias is relatively small. Thus, the main conclusion is that most if not all multiple-study articles published in Psychological Science used questionable research practices to inflate effect sizes. As these inflated effect sizes cannot be reproduced, the effect sizes in replication studies will be lower and the signal-to-noise ratio will be smaller, producing non-significant results. It was known that this could happen since 1959 (Sterling, 1959). However, the replicability project showed that it does happen (OSC, 2015) and Francis (2014) showed that excessive use of questionable research practices provides a plausible explanation for these replication failures. No review of the replication crisis is complete and honest, without mentioning this fact.
Limitations and Extension
One limitation of Francis’s approach and similar approaches like my incredibility Index (Schimmack, 2012) is that p-values are based on two pieces of information, the effect size and sampling error (signal/noise ratio). This means that these tests can provide evidence for the use of questionable research practices, when the number of studies is large, and the effect size is small. It is well-known that p-values are more informative when they are accompanied by information about effect sizes. That is, it is not only important to know that questionable research practices were used, but also how much these questionable practices inflated effect sizes. Knowledge about the amount of inflation would also make it possible to estimate the true power of studies and use it as a predictor of the success rate in actual replication studies. Jerry Brunner and I have been working on a statistical method that is able to to this, called z-curve, and we validated the method with simulation studies (Brunner & Schimmack, 2019).
I coded the 195 studies in the 44 articles analyzed by Francis and subjected the results to a z-curve analysis. The results are shocking and much worse than the results for the studies in the replicability project that produced an expected replication rate of 61%. In contrast, the expected replication rate for multiple-study articles in Psychological Science is only 16%. Moreover, given the fairly large number of studies, the 95% confidence interval around this estimate is relatively narrow and includes 5% (chance level) and a maximum of 25%.
There is also clear evidence that QRPs were used in many, if not all, articles. Visual inspection shows a steep drop at the level of significance, and the only results that are not significant with p < .05 are results that are marginally significant with p < .10. Thus, the observed discovery rate of 93% is an underestimation and the articles claimed an amazing success rate of 100%.
Correcting for bias, the expected discovery rate is only 6%, which is just shy of 5%, which would imply that all published results are false positives. The upper limit for the 95% confidence interval around this estimate is 14, which would imply that for every published significant result there are 6 studies with non-significant results if file-drawring were the only QRP that was used. Thus, we see not only that most article reported results that were obtained with QRPs, we also see that massive use of QRPs was needed because many studies had very low power to produce significant results without QRPs.
Social psychologists have used QRPs to produce impressive results that suggest all studies that tested a theory confirmed predictions. These results are not real. Like a magic show they give the impression that something amazing happened, when it is all smoke and mirrors. In reality, social psychologists never tested their theories because they simply failed to report results when the data did not support their predictions. This is not science. The 2010s have revealed that social psychological results in journals and text books cannot be trusted and that influential results cannot be replicated when the data are allowed to speak. Thus, for the most part, social psychology has not been an empirical science that used the scientific method to test and refine theories based on empirical evidence. The major discovery in the 2010s was to reveal this fact, and Francis’s analysis provided valuable evidence to reveal this fact. However, most social psychologists preferred to ignore this evidence. As Popper pointed out, this makes them truly ignorant, which he defined as “the unwillingness to acquire knowledge.” Unfortunately, even social psychologists who are trying to improve it wilfully ignore Francis’s evidence that makes replication failures predictable and undermines the value of actual replication studies. Given the extent of QRPs, a more rational approach would be to dismiss all evidence that was published before 2012 and to invest resources in new research with open science practices. Actual replication failures were needed to confirm predictions made by bias tests that old studies cannot be trusted. The next decade should focus on using open science practices to produce robust and replicable findings that can provide the foundation for theories.
Wegner’s article “The Premature Demise of the Solo Experiment” in PSPB (1992) is an interesting document for meta-psychologists. It provides some insight into the thinking of leading social psychologists at the time; not only the author, but reviewers and the editor who found this article worthy of publishing, and numerous colleagues who emailed Wegner with approving comments.
The article starts with the observation that in the 1990s social psychology journals increasingly demanded that articles contain more than one study. Wegner thinks that the preference of multiple-study articles is a bias rather than a preference in favour of stronger evidence.
“it has become evident that a tremendous bias against the “solo” experiment exists that guides both editors and reviewers” (p. 504).
The idea of bias is based on the assumption that rejection a null-hypothesis with a long-run error-probability of 5% is good enough to publish exciting new ideas and give birth to wonderful novel theories. Demanding even just one replication of this finding would create a lot more burden without any novel insights just to lower this probability to 0.25%.
“But let us just think a moment about the demise of the solo experiment. Here we have a case in which skepticism has so overcome the love of ideas that we seem to have squared the probability of error we are willing to allow. Once, p < .05 was enough. Now, however, we must prove things twice. The multiple experiment ethic has surreptitiously changed alpha to .0025 or below.”
That’s right. The move from solo-experiment to multiple-study articles shifted the type-I error probability. Even a pair of studies reduced the type-I error probability more than the highly cited and controversial call to move alpha from .05 to .005. A pair of studies with p < .05 reduces the .005 probability by 50%!
Wegner also explains why journals started demanding multiple studies.
After all, the statistical reasons for multiple experiments are obvious-what better protection of the truth than that each article contain its own replication? (p. 505)
Thus, concerns about replicabilty in social psychology were prominent in the early 1990s, twenty years before the replication crisis. And demanding replication studies was considered to be a solution to this problem. If researchers were able to replicate their findings, ideally with different methods, stimuli, and dependent variables, the results are robust and generalizable. So much for the claim that psychologists did not value or conduct replication studies before the open science movement was born in the early 2010.
Wegner also reports about his experience with attempting to replicate his perfectly good first study.
“Sometimes it works wonderfully….more often than not, however, we find the second experiment is harder to do than the first…Even if we do the exact same experiment again” (p. 506).
He even cheerfully acknowledge that the first results are difficult to replicate because the first results were obtained with some good fortune.
“Doing it again, we will be less likely to find the same thing even if it is true, because the error variance regresses our effects to the mean. So we must add more subjects right off the bat. The joy of discovery we felt on bumbling into the first study is soon replaced by the strain of collecting an all new and expanded set of data to fend off the pointers [pointers = method-terrorists]” (p. 506).
Wegner even thinks that publishing these replication studies is pointless because readers expect the replication study to work. Sure, if the first study worked, so will the second.
This is something of a nuisance in light of the reception that our second experiment will likely get Readers who see us replicate our own findings roll their eyes and say “Sure,” and we wonder why we’ve even gone to the trouble.
However, he fails to examine more carefully why a successful replication study receives only a shoulder-shrug from readers. After all, his own experience was that it was quite difficult to get these replication studies to work. Doesn’t this mean readers should be at the edge of their seats and wonder whether the original result was a false positive or whether it can actually be replicated? Isn’t the second study the real confirmatory test where the rubber hits the road? Insiders of course know that this is not the case. The second study works because it would not have been included in the multiple-study article if it hadn’t worked. That is after all how the field operated. Everybody had the same problems to get studies to work that Wegner describes, but many found a way to get enough studies to work to meet the demands of the editor. The number of studies was just a test of the persistence of a researcher, not a test of a theory. And that is what Wegner rightfully criticized. What is the point of producing a set of studies with p < .05, if more studies do not strengthen the evidence for a claim. We might as well publish a single finding and then move on to find more interesting ideas and publish them with p-values less than .05. Even 9 studies with p < .05 don’t mean that people can foresee the future (Bem, 2011), but it is surely an interesting idea.
Wegner also comments on the nature of replication studies that are now known as conceptual replication studies. The justification for conceptual replication studies is that they address limitations that are unavoidable in a single study. For example, including a manipulation check may introduce biases, but without one, it is not clear whether a manipulation worked. So, ideally the effect could be demonstrated with and without a manipulation check. However, this is not how conceptual replication studies are conducted.
We must engage in a very delicate “tuning” process to dial in a second experiment that is both sufficiently distant from and sufficiently similar to the original. This tuning requires a whole set of considerations and skills that have nothing to do with conducting an experiment. We are not trained in multi experiment design, only experimental design, and this enterprise is therefore largely one of imitation, inspiration, and luck.
So, to replicate original results that were obtained with a healthy dose of luck, more luck is needed in finding a condition that works, or simply to try often enough until luck strikes again.
Given the negative attitude towards rigor, Wegner and colleagues also used a number of tricks to make replication studies work.
“Some of us use tricks to disguise our solos. We run “two experiments” in the same session with the same subjects and write them up separately. Or we run what should rightfully be one experiment as several parts, analyzing each separately and writing it up in bite-sized pieces as a multi experiment Many times, we even hobble the first experiment as a way of making sure there will be something useful to do when we run another.” (p. 506).
If you think this sounds like some charlatans who enjoy pretending to be scientists, your impression is rather accurate because the past decade has shown that many of these internal replications in multiple study articles were obtained with tricks and provide no empirical test of empirical hypotheses; p-values are just for show so that it looks like science, but it isn’t.
My own views on this issue are that the multiple study format was a bad fix for a real problem. The real problem was that it was all to easy to get p < .05 in a single study to make grand claims about the causes of human behavior. Multiple-study articles didn’t solve this problem because researchers found ways to get significant results again and again even when their claims were false.
The failure of multiple-study articles to fix psychology has some interesting lessons for the current attempts to improve psychology. Badges for data sharing and preregistration will not improve psychology, if they are being gamed like psychologists gamed the multiple-study format. Ultimately, science can only advance if results are reported honestly and if results are finally able to falsify theoretical predictions. Psychology will only become a science when brilliant novel ideas can be proven false and scientific rigor is prized as much as the creation of interesting ideas. Coming up with interesting ideas is philosophy. Psychology emerged as a distinct discipline in order to subject those theories to empirical tests. After a century of pretending to do so, it is high time to do so for real.
“As was evident from my questions after the talk, I was less enthused by the idea of doing a large, replication of Darryl Bem’s studies on extra-sensory perception. Zoltán Kekecs and his team have put in a huge amount of work to ensure that this study meets the highest standards of rigour, and it is a model of collaborative planning, ensuring input into the research questions and design from those with very different prior beliefs. I just wondered what the point was. If you want to put in all that time, money and effort, wouldn’t it be better to investigate a hypothesis about something that doesn’t contradict the laws of physics?”
I think she makes a valid and important point. Bem’s (2011) article highlighted everything that was wrong with the research practices in social psychology. Other articles in JPSP are equally incredible, but this was ignored because naive readers found the claims more plausible (e.g., blood glucose is the energy for will power). We know now that none of these published results provide empirical evidence because the results were obtained with questionable research practices (Schimmack, 2014; Schimmack, 2018). It is also clear that these were not isolated incidents, but that hiding results that do not support a theory was (and still is) a common practice in social psychology (John et al., 2012; Schimmack, 2019).
A large attempt at estimating the replicability of social psychology revealed that only 25% of published significant results could be replicated (OSC). The rate for between-subject experiments was even lower. Thus, the a-priori probability (base rate) that a randomly drawn study from social psychology will produce a significant result in a replication attempt is well below 50%. In other words, a replication failure is the more likely outcome.
The low success rate of these replication studies was a shock. However, it is sometimes falsely implied that the low replicability of results in social psychology was not recognized earlier because nobody conducted replication studies. This is simply wrong. In fact, social psychology is one of the disciplines in psychology that required researchers to conduct multiple studies that showed the same effect to ensure that a result was not a false positive result. Bem had to present 9 studies with significant results to publish his crazy claims about extrasensory perception (Schimmack, 2012). Most of the studies that failed to replicate in the OSC replication project were taken from multiple-study articles that reported several successful demonstrations of an effect. Thus, the problem in social psychology was not that nobody conducted replication studies. The problem was that social psychologists only reported replication studies that were successful.
The proper analyses of the problem also suggests a different solution to the problem. If we pretend that nobody did replication studies, it may seem useful to starting doing replication studies. However, if social psychologists conducted replication studies, but did not report replication failures, the solution is simply to demand that social psychologists report all of their results honestly. This demand is so obvious that undergraduate students are surprised when I tell them that this is not the way social psychologists conduct their research.
In sum, it has become apparent that questionable research practices undermine the credibility of the empirical results in social psychology journals, and that the majority of published results cannot be replicated. Thus, social psychology lacks a solid empirical foundation.
It is implied by information theory that little information is gained by conducting actual replication studies in social psychology because a failure to replicate the original result is likely and uninformative. In fact, social psychologists have responded to replication failures by claiming that these studies were poorly conducted and do not invalidate the original claims. Thus, replication studies are both costly and have not advanced theory development in social psychology. More replication studies are unlikely to change this.
A better solution to the replication crisis in social psychology is to characterize research in social psychology from Festinger’s classic small-sample, between-subject study in 1957 to research in 2017 as exploratory and hypotheses generating research. As Bem suggested to his colleagues, this was a period of adventure and exploration where it was ok to “err on the side of discovery” (i.e., publish false positive results, like Bem’s precognition for erotica). Lot’s of interesting discoveries were made during this period; it is just not clear which of these findings can be replicated and what they tell us about social behavior.
Thus, new studies in social psychology should not try to replicate old studies. For example, nobody should try to replicate Devine’s subliminal priming study with racial primes with computers and software from the 1980s (Devine, 1989). Instead, prominent theoretical predictions should be tested with the best research methods that are currently available. Thus, the way forward is not to do more replication studies, but rather to use open science (a.k.a. honest science) that uses experiments to subject theories to empirical tests that may also falsify a theory (e.g., subliminal racial stimuli have no influence on behavior). The main shift that is required is to get away from research that can only confirm theories and to allow for empirical data to falsify theories.
This was exactly the intent of Danny Kahneman’s letter, when he challenged social priming researchers to respond to criticism of their work by going into their labs and to demonstrate that these effects can be replicated across many labs.
Kahneman makes it clear that the onus of replication is on the original researchers who want others to believe their claims. The response to this letter speaks volumes. Not only did social psychologists fail to provide new and credible evidence that their results can be replicated, they also demonstrated defiant denial in the face of replication failures by others. The defiant denial by prominent social psychologists (e.g., Baumeister, 2019) make it clear that they will not be convinced by empirical evidence, while others who can look at the evidence objectively do not need more evidence to realize that the social psychological literature is a train-wreck (Schimmack, 2017; Kahneman, 2017). Thus, I suggest that young social psychologists search the train wreck for survivors, but do not waste their time and resources on replication studies that are likely to fail.
A simple guide through the wreckage of social psychology is to distrust any significant result with a p-value greater than .01 (Schimmack, 2019). Prediction markets also suggest that readers are able to distinguish credible and incredible results (Atlantic). Thus, I recommend to build on studies that are credible and to stay clear of sexy findings that are unlikely to replicate. As Danny Kahneman pointed out, young social psychologists who work in questionable areas face a dilemma. Either they have to replicate the questionable methods that were used to get the original results, which is increasingly considered unethical, or they end up with results that are not very informative. On the positive side, the replication crisis implies that there are many important topics in social psychology that need to be studied properly with the scientific method. Addressing these important questions may be the best way to rescue social psychology.
Roy Baumeister wrote a book chapter with the title “Self-Control, Ego Depletion, and Social Psychology’s Replication CrisisRoy” (preprint). I think this chapter will make a valuable contribution to the history of psychology and provides valuable insights into the minds of social psychologists.
I fact-checked the chapter and comment on 31 misleading or false statements.