“For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).
DEFINITION OF REPLICABILITY: In empirical studies with sampling error, replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion (Schimmack, 2017).
This post presented the first replicability ranking and explains the methodology that is used to estimate the typical power of a significant result published in a journal. The post provides an explanation of the new method to estimate observed power based on the distribution of test statistics converted into absolute z-scores. The method has been developed further to estimate power for a wider range of z-scores by developing a model that allows for heterogeneity in power across tests. A description of the new method will be published when extensive simulation studies are completed.
The R-Index can be used to predict whether a set of published results will replicate in a set of exact replication studies. It combines information about the observed power of the original studies with information about the amount of inflation in observed power due to publication bias (R-Index = Observed Median Power – Inflation). The R-Index has predicted the outcome of actual replication studies.
The Test of Insufficient Variance is the most powerful test of publication bias and/or dishonest reporting practices. It can be used even if only two independent statistical results are available, although power to detect bias increases with the number of studies. After converting test results into z-scores, z-scores are expected to have a variance of one. Unless power is very high, some of these z-scores will not be statistically significant (z .05 two-tailed). If these non-significant results are missing, the variance shrinks, and TIVA detects that the variance is insufficient. The observed variance is compared against the expected variance of 1 with a left-tailed chi-square test. The usefulness of TIVA is illustrated with Bem’s (2011) “Feeling the Future” data.
5. MOST VIEWED POST (with comment by Noble Laureate Daniel Kahneman)
This blog post examines the replicability of priming studies cited in Daniel Kahneman’s popular book “Thinking fast and slow.” The results suggest that many of the cited findings are difficult to replicate.
Stereotype-threat has been used by social psychologists to explain gender differences in math performance. Accordingly, the stereotype that men are better at math than women is threatening to women and threat leads to lower performance. This theory has produced a large number of studies, but a recent meta-analysis showed that the literature suffers from publication bias and dishonest reporting. After correcting for these effects, the stereotype-threat effect was negligible. This blog post shows a low R-Index for the first article that appeared to provide strong support for stereotype-threat. These results show that the R-Index can warn readers and researchers that reported results are too good to be true.
Some Bayesian statisticians have proposed Bayes-Factors to provide evidence for a Null-Hypothesis (i.e., there is no effect). They used Bem’s (2011) “Feeling the Future” data to argue that Bayes-Factors would have demonstrated that extra-sensory perception does not exist. This blog post shows that Bayes-Factors depend on the specification of the alternative hypothesis and that support for the null-hypothesis is often obtained by choosing an unrealistic alternative hypothesis (e.g., there is a 25% probability that effect size is greater than one standard deviation, d > 1). As a result, Bayes-Factors can favor the null-hypothesis when there is an effect, but the effect size is small (d = .2). A Bayes-Factor in favor of the null is more appropriately interpreted as evidence that the alternative hypothesis needs to decrease the probabilities assigned to large effect sizes. The post also shows that Bayes-Factors based on a meta-analysis of Bem’s data provide misleading evidence that an effect is present because Bayesian statistics do not take publication bias and dishonest reporting practices into account.
9. Hidden figures: Replication failures in the stereotype threat literature. A widespread problem is that failed replication studies are often not published. This blog post shows that another problem is that failed replication studies are ignored even when they are published. Selective publishing of confirmatory results undermines the credibility of science and claims about the importance of stereotype threat to explain gender differences in mathematics.
10. My journey towards estimation of replicability. In this blog post I explain how I got interested in statistical power and replicability and how I developed statistical methods to reveal selection bias and to estimate replicability.
Stephan Lewandowsky and Klaus Oberauer (2020) published an article titled “Low replicability can support robust and efficient science.” A good example of low replicability is the cited finding that only 25% of results published in social psychology could be replicated (OSC, 2015). Thus, the title suggests that social psychology can be a robust science even if only one quarter of published findings are replicable. This seems to be a surprising conclusion that goes against initiatives to improve psychological science.
The authors note that low replication rates in social psychology have been used to call for more replication studies (Zwaan et al., 201z). The main point of the article is that this is not necessarily the best response to the replication crisis in social psychology.
“Highlighting the virtues of replications is, however, not particularly helpful without careful consideration of when, how, why, and by whom experiments should be replicated.”
Examining when replication studies are valuable or a waste of resources is an important and interesting question. However, this question is different from the replicability of the original studies that are being published in social psychology. That is, we can distinguish two questions: (a) how replicable should original studies be and (b) how many original studies should be replicated?
The first question about replicability is essentially a question about statistical power (Cohen, 1962; Brunner & Schimmack, 2019). Stating that low replicability can support a robust science implies that social psychology can be a robust science, even if average power of published studies is only 25%. Although the title implies that this is the topic of the article, the article does not address this question.
The second question is not about replicability. Rather it is about the value of actual replication studies. There is a way to connect the two questions. It may seem obvious that the value of replication studies decreases (a) the more studies test true hypotheses and (b) the higher the power of original studies is. The reason is that most replication studies are likely to be successful and confirm that the original result was a true positive result. In contrast, if most studies test false hypotheses and power is low, a high percentage of significant results are false positives and true positives are published with inflated effect sizes. In this case, replication studies are likely to fail, and only a small number of studies that succeed in a replication attempt actually contribute robust evidence; the rest is wasted. Not surprisingly, calls for more direct replication studies have arisen in social psychology where false positive rates are relatively high and power is low.
The need for replication studies also increases when researchers use questionable research practices to produce significant results. The use of these practices inflates the risk of false positives, which means only results that have been replicated in honest replication studies can be trusted. Stephan Lewandowsky and Klaus Oberauer (2020) recognize that QRPs are a problem, but they are not interested in addressing this issue. Instead, they “simulate an idealized and transparent scientific community that eschews p-hacking and other questionable research practices and conducts studies with adequate power (P = .8). ”
In philosophy it is well known that any conclusions that are logically valid, but rest on false assumptions may be false. Thus, the article simply does not speak to the actual problems or the consequences of low replicability for psychology as a science. However, articles about the replication crisis attract a lot of attention and citations, so the authors decided to present their fantasy simulations as if they speak about the replication crisis or ways to make psychology more robust and credible. They do not.
So, is it possible to have a robust science with low replicability? If you ask me, I don’t think so. I think social psychology needs to first become the science that Lewandowsky and Oberauer simulate: ban the use of QRPs, and conduct studies with 80% power. The open science movement is trying to make this happen. Lewandowsky and Oberauer seems to suggest that the open science movement is misguided by emphasizing replication studies. “Perhaps ironically, waste is reduced by withholding replication until after publication” The problem with this conclusion is that it rests on the assumption that original results are trustworthy (no QRPs, adequate power). These are exactly the conditions that make actual replication studies less important. Once original results have high replicability, it is less important to probe replicability with replication studies. Thus, we may all agree that a robust science cannot thrive with low replicability of original studies.
The 2010s have seen a replication crisis in social psychology (Schimmack, 2020). The main reason why it is difficult to replicate results from social psychology is that researchers used questionable research practices (QRPs, John et al., 2012) to produce more significant results than their low-powered designs warranted. A catchy term for these practices is p-hacking (Simonsohn, 2014).
New statistical techniques made it possible to examine whether published results were obtained with QRPs. In 2012, I used the incredibility index to show that Bem (2011) used QRPs to provide evidence for extrasensory perception (Schimmack, 2012). In the same article, I also suggested that Gailliot, Baumeister, DeWall, Maner, Plant, Tice, and Schmeichel, (2007) used QRPs to present evidence that suggested will-power relies on blood glucose levels. During the review process of my manuscript, Baumeister confirmed that QRPs were used (cf. Schimmack, 2014). Baumeister defended the use of these practices with a statement that the use of these practices was the norm in social psychology and that the use of these practices was not considered unethical.
The revelation that research practices were questionable casts a shadow on the history of social psychology. However, many also saw it as an opportunity to change and improve these practices (Świątkowski and Dompnier, 2017). Over the past decades, the evaluation of QRPs has changed. Many researchers now recognize that these practices inflate error rates, make published results difficult to replicate, and undermine the credibility of psychological science (Lindsay, 2019).
However, there are no general norms regarding these practices and some researchers continue to use them (e.g., Adam D. Galinsky, cf. Schimmack, 2019). This makes it difficult for readers of the social psychological literature to identify research that can be trusted or not, and the answer to this question has to be examined on a case by case basis. In this blog post, I examine the responses of Baumeister, Vohs, DeWall, and Schmeichel to the replication crisis and concerns that their results provide false evidence about the causes of will-power (Friese, Loschelder , Gieseler , Frankenbach & Inzlicht, 2019; Inzlicht, 2016).
To examine this question scientifically, I use test-statistics that are automatically extracted from psychology journals. I divide the test-statistics into those that were obtained until 2012, when awareness about QRPs emerged, and those published after 2012. The test-statistics are examined using z-curve (Brunner & Schimmack, 2019; Bartos & Schimmack, 2020). Results provide information about the expected replication rate and discovery rate. The use of QRPs is examined by comparing the observed discovery rate (how many published results are significant) to the expected discovery rate (how many tests that were conducted produced significant results).
Roy F. Baumeister’s replication rate was 60% (53% to 67%) before 2012 and 65% (57% to 74%) after 2012. The overlap of the 95% confidence intervals indicates that this small increase is not statistically reliable. Before 2012, the observed discovery rate was 70% and it dropped to 68% after 2012. Thus, there is no indication that non-significant results are reported more after 2012. The expected discovery rate was 32% before 2012 and 25% after 2012. Thus, there is also no change in the expected discovery rate and the expected discovery rate is much lower than the observed discovery rate. This discrepancy shows that QRPs were used before 2012 and after 2012. The 95%CI do not overlap before and after 2012, indicating that this discrepancy is statistically significant. Figure 1 shows the influence of QRPs when the observed non-significant results (histogram of z-scores below 1.96 in blue) is compared to the model prediction (grey curve). The discrepancy suggests a large file drawer of unreported statistical tests.
An old saying is that you can’t teach an old dog new tricks. So, the more interesting question is whether the younger contributors to the glucose paper changed their research practices.
The results for C. Nathan DeWall show no notable response to the replication crisis (Figure 2). The expected replication rate increased slightly from 61% to 65%, but the difference is not significant and visual inspection of the plots suggests that it is mostly due to a decrease in reporting p-values just below .05. One reason for this might be a new goal to p-hack at least to the level of .025 to avoid detection of p-hacking by p-curve analysis. The observed discovery rate is practically unchanged from 68% to 69%. The expected discovery rate increased only slightly from 28% to 35%, but the difference is not significant. More important, the expected discovery rates are significantly lower than the observed discovery rates before and after 2012. Thus, there is evidence that DeWall used questionable research practices before and after 2012, and there is no evidence that he changed his research practices.
The results for Brandon J. Schmeichel are even more discouraging (Figure 3). Here the expected replication rate decreased from 70% to 56%, although this decrease is not statistically significant. The observed discovery rate decreased significantly from 74% to 63%, which shows that more non-significant results are reported. Visual inspection shows that this is particularly the case for test-statistics close to zero. Further inspection of the article would be needed to see how these results are interpreted. More important, The expected discovery rates are significantly lower than the observed discovery rates before 2012 and after 2012. Thus, there is evidence that QRPs were used before and after 2012 to produce significant results. Overall, there is no evidence that research practices changed in response to the replication crisis.
The results for Kathleen D. Vohs also show no response to the replication crisis (Figure 4). The expected replication rate dropped slightly from 62% to 58%; the difference is not significant. The observed discovery rate dropped slightly from 69% to 66%, and the expected discovery rate decreased from 43% to 31%, although this difference is also not significant. Most important, the observed discovery rates are significantly higher than the expected discovery rates before 2012 and after 2012. Thus, there is clear evidence that questionable research practices were used before and after 2012 to inflate the discovery rate.
After concerns about research practices and replicability emerged in the 2010s, social psychologists have debated this issue. Some social psychologists changed their research practices to increase statistical power and replicability. However, other social psychologists have denied that there is a crisis and attributed replication failures to a number of other causes. Not surprisingly, some social psychologists also did not change their research practices. This blog post shows that Baumeister and his students have not changed research practices. They are able to publish questionable research because there has been no collective effort to define good research practices and to ban questionable practices and to treat the hiding of non-significant results as a breach of research ethics. Thus, Baumeister and his students are simply exerting their right to use questionable research practices, whereas others voluntarily implemented good, open science, practices. Given the freedom of social psychologists to decide which practices they use, social psychology as a field continuous to have a credibility problem. Editors who accept questionable research in their journals are undermining the credibility of their journal. Authors are well advised to publish in journals that emphasis replicability and credibility with open science badges and with a high replicability ranking (Schimmack, 2019).
The new year and decade just started and I am excited to announce that publication of a pre-print that introduces Z-Curve.2.0 (Bartoš & Schimmack, 2020; Preprint). The ms. “Z-Curve.2.0: Estimating Replication Rates and Discovery Rates” is the product of a nearly year-long collaboration with František Bartoš.
Last year, František emailed me to introduce a new way of estimating z-curves finite mixture models with an EM-algorithm. We started working together on evaluating the density approach from Brunner & Schimmack (2019) and the EM algorithm. In the end, the EM algorithm performs a bit better although the density approach leaves some wiggle room to improve coverage for confidence intervals. Both methods produce useful estimates and confidence intervals with good coverage.
The collaboration with František was amazing and provides another example of the power of social media. Not only does it allow fast exchange of ideas, it also makes it possible to collaborate with people you might never meet otherwise. Just like I met Rickard Carlsson in person only several years after we became Facebook friends and started a journal together, I still have to meet František in person (hopefully this year).
František also created an R-package for z-curve ( Z-Curve Package ). We are pleased to make this package publicly available. Please try it out and give feedback so that we can improve it before František submits it to the R-team as an official package that can be downloaded.
Here is the abstract of the ms. and a figure that was created with the zcurve package.
This article introduces z-curve.2.0 as a method that estimates the expected replication rate and the expected discovery rate based on the test-statistics of studies selected for significance. Z-curve.2.0 extends the work by Brunner and Schimmack (2019) in several ways. First, we show that a new estimation method using expectation-maximization outperforms the kernel-density approach of z-curve.1.0. Second, we examine the coverage of bootstrapped confidence intervals to provide information about the uncertainty in z-curve estimates. Third, we extended z-curve to estimate the number of all studies that were conducted, including studies with non-significant results that may not have been reported, solely on the basis of significant results. This allows us to estimate the expected discovery rate (EDR); that is, the percentage of significant results that were obtained in all studies. EDR can be used to assess the size of the file-drawer, estimate the maximum number of false positive results, and may provide a better estimate of the success rate in actual replication studies than the expected replication rate because exact replications are impossible.
Example -Figure created with graph.zcurve package.
Data are original test-statistics of 90 studies with good replication studies from the Open Science Collaboration(OSC) rep. project (OSC, 2015). Publication bias is indicated by the observed discovery rate (85/90 = 94% significant results) when the z-curve estimate of the expected discovery rate is only 39%. The expected replication rate of 62% successful replications is based on the assumption that studies can be replicated exactly. However, with contextual sensitivity, the expected discovery rate is a better estimate of the success rate in replication studies and it is more in line with the actual success rate (well, failure rate, really) in the OSC project.
[this blog post is a draft for an article in a special issue. Comments are welcome.]
The Big Bem: The Universe Implodes
The 2010s started with a bang. Journal clubs were discussing the preprint of Bem’s (2011) article “Feeling the future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect.” Psychologists were confronted with a choice. Either they had to believe in anomalous effects or they had to believe that psychology was an anomalous science. In a discussion group at the University of Toronto, one of my colleagues noted the danger of making the wrong choice only to end up on the wrong side of history. Ten years later, it is possible to look back at Bem’s article with the hindsight of 2020.
It is now clear that Bem used questionable practices to produce false evidence for his outlandish claims (Francis, 2012; Schimmack, 2012, 2018, 2020). Moreover, it has become apparent that these practices were the norm and that many other findings in social psychology cannot be replicated. This realization has led to major changes in the way social psychologists conduct and report their work. The speed and the extent of these changes has been revolutionary. Akin to the cognitive revolution in the 1960s and the affective revolution in the 1980s, the 2010s have witnessed a method revolution that has produced a new field called meta-psychology and two new journals that publish articles addressing methodological problems and improvements; Advances in Methods and Practices in Psychological Science and Meta-Psychology. For researchers who are spending most of their time pursuing primary research, it can be confusing to keep up with the rapid developments in meta-psychology that often emerge on blogs, pre-prints, twitter, and Facebook before they are appear in peer-reviewed journals.
In this review article, I present an overview of major developments in meta-psychology that are shaping the future of psychological science. Most of the review focuses on replication failures in experimental social psychology, and the different explanations for these failures. I argue that the use of questionable research practices accounts for many replication failures and point out how social psychologists have responded to this realization. Other disciplines may learn from these lessons and may need to reform their research practices in the coming decade.
Arguably, the most important development in psychology has been the normalization of publishing replication failures. When Bem (2011) published his abnormal results supporting paranormal phenomena, researchers quickly failed to replicate these sensational results. However, they had a hard time publishing these results. The Editor of JPSP at that time, Eliot Smith, did not even send the manuscript out for review. This was probably the last attempt to suppress negative evidence and it failed for two reasons. First, online-only journals with unlimited journal space like PlusOne or Frontiers were more than happy to publish these articles (Ritchie, Wiseman, & French, 2012). Second, the decision to reject the replication studies was made public and created a lot of attention because Bem’s article had attracted so much attention (Aldhous, 2011). This created social pressure and in 2012, JPSP did publish a major replication failure of Bem’s results (Galak, LeBouef, Nelson, & Simmons, 2012).
Over the past decade, new article formats have evolved that make it easier to publish articles that fail to confirm theoretical predictions such as registered reports (Chambers, 2019) and registered-replication reports (APS, 2015). Registered reports are articles that are accepted for publication before the results are known; thus, avoiding the problem of publication bias that only confirmatory findings are published. Registered replication reports are registered reports that aim to replicate an original study in a high-powered study with many laboratories. Although, registered replication reports can produce significant and non-significant results, they have produced stunning replication failures. These failures are especially stunning because RRR had a much higher chance to produce a significant result than the original studies with much smaller samples. Thus, the fact that RRR’s of ego-depletion (Hagger et al. 2016), or facial feedback (Wagenmakers et al., 2016) produced non-significant results with thousands of participants were surprising, to say the least.
Replication failures of specific studies are important for specific research questions, but they do not examine the crucial meta-psychological question whether these failures are anomalies or symptomatic of a wider problem in psychological science. After all, Bem’s studies were replicated because researchers were skeptical that these results can be replicated. These replication failures are unable to answer the question whether there is a replication crisis. Answering this broader question requires a representative sample of studies from the population of results published in psychology journals. Given the diversity of psychology, this is a monumental task.
A first step towards this goal was the Reproducibility Project that focused on results published in three psychology journals in the year 2008. The journals represented social/personality psychology (JPSP), cognitive psychology (JEP:LMC), and all areas of psychology (Psychological Science). Although all articles published in 2008 were eligible, not all studies were replicated, in part because some studies were very expensive or difficult to replicate. In the end, 97 studies with significant results were replicated as closely as possible. The headline finding was that only 37% of the replication studies replicated a statistically significant result.
This finding has been widely cited as evidence that psychology has a replication problem. However, headlines tend to blur over the fact that results varied as a function of discipline. While the success rate for cognitive psychology was 50% and even higher for typical within-subject designs with many observations per participant, the success rate was only 25% for social psychology, and even lower for the typical between-subject design that was employed to study ego-depletion, facial feedback or other prominent effects in social psychology.
These results do not warrant the broad claim that psychology has a replication crisis or that most results published in psychology are false. A more nuanced conclusion is that social psychology has a replication crisis and that methodological factors account for these differences. Disciplines that rely on within-subject designs with many repeated measures or intervention studies with a pre-post design are likely to suffer less than disciplines that compare a single measure across participants.
No Crisis: Experts can Reliably Produce Effects
After some influential priming results could not be replicated, Daniel Kahneman wrote a letter to John Bargh (Yong, 2012). He suggested that leading priming researchers should conduct a series of replication studies to demonstrate that their original results are replicable. In response, John Bargh and other prominent social psychologists conducted numerous studies that showed the effects are robust. At least, this is what might have happened in an alternative universe. In this universe, however, there have been few attempts to self-replicate original findings. Bartlett asked Bargh why he did not prove his critics wrong by doing the study again (Batlett, 2013). The answer is not particularly convincing.
“So why not do an actual examination? Set up the same experiments again, with additional safeguards. It wouldn’t be terribly costly. No need for a grant to get undergraduates to unscramble sentences and stroll down a hallway. Bargh says he wouldn’t want to force his graduate students, already worried about their job prospects, to spend time on research that carries a stigma. Also, he is aware that some critics believe he’s been pulling tricks, that he has a “special touch” when it comes to priming, a comment that sounds like a compliment but isn’t. “I don’t think anyone would believe me,” he says” (Bartlett, 2013).
A few self-replications ended with a replication failure (Elkins-Brown, Saunders, & Inzlicht, 2018). One notable successful self-replication was conducted by Petty and colleagues (Latrell, Petty, & Xu, 2017). The authors not only replicated the original finding, they also reproduced the non-significant result of the replication study. In addition, they found a significant interaction, indicating that procedural differences made the effect stronger or weaker. This study has been widely celebrated as an exemplary way to respond to replication failures. It also suggests that flaws in replication studies are sometimes responsible for replication failures. However, it is impossible to generalize from this single instance to other replication failures. Thus, it remains unclear how many replication failures were caused by problems with the replication studies.
To conclude, the 2010s have seen a rise in publications of non-significant results that fail to replicate original results and that contradict theoretical predictions. The evidence produced by these studies has demonstrated a replication crisis in social psychology, but not in cognitive psychology. Other areas have been slow to investigate the replicability of their published results.
No-Crisis: Decline Effect
Jonathan Schooler’s idea that replication failures occur because effects weaken over time was proposed by Johnathan Schooler and popularized in a New Yorker article (Lehrer, 2010). Schooler coined the term decline effect for the observation that effect sizes often decrease over time. Unfortunately, it does not work for more mundane behaviors like eating cheesecake. No matter how often you eat cheese cakes, they still add calories and pounds to your weight. However, for more elusive effects like social priming or verbal overshadowing, it seems to be the case that it is easier to discovery effects than to replicate them (Wegner, 1992), but it is not clear what causes decline effects for social psychology experiments. A team of researchers conducted a registered replication study of Schooler and Engstler-Schooler’s (1990) verbal overshadowing study (Alogna et al., 2014). The results replicated a statistically significant effect, but with smaller effect sizes. Schooler (2014) considered this finding a win-win because his original results had been replicated and the reduced effect size supported the presence of a decline effect. However, the notion of a decline effect is misleading because it merely describes a phenomenon rather than providing an explanation for it. Schooler (2014) offered several possible explanations. One possible explanation was regression to the mean (see next paragraph). A second explanation was that slight changes in experimental procedures can reduce effect sizes (more detailed discussion below). More controversial, Schooler also eludes to the possibility that some paranormal processes may produce a decline effect. “Perhaps, there are some parallels between VO [verbal overshadowing] effects and parapsychology after all, but they reflect genuine unappreciated mechanisms of nature (Schooler, 2011) and not simply the product of publication bias or other artifact” (p. 582). Schooler, however, fails to acknowledge that a mundane explanation for the decline effect are questionable research practices that inflate effect size estimates in original studies. Using statistical tools, Francis (2012) showed that Schooler’s original verbal overshadowing studies showed signs of bias. Thus, there is no need to look for paranormal explanation of the decline effect in verbal overshadowing. The normal practices of selectively publishing only significant results are sufficient to explain it. In sum, the decline effect is descriptive rather than explanatory and Schooler’s suggestion that it reflects some paranormal phenomena is not supported by scientific evidence.
No Crisis: Regression to the Mean is Normal
Regression to the mean has been invoked as one possible explanation for the decline effect (Schooler, 2014; Fiedler, 2015). Fiedler’s argument is that random measurement error in psychological measures is sufficient to produce replication failures. However, random measurement error is neither necessary nor sufficient to produce replication failures. The outcome of a replication study is determined solely by the studies statistical power and if the replication study is an exact replication of an original study, both studies have the same amount of random measurement error and power (Brunner & Schimmack, 2019). Thus, if the OSC project found 97 significant results in 100 published studies, the observed discovery rate of 97% suggests that the studies had 97% power to obtain a significant result. Random measurement error would have the same effect on power and therefore have the same effect on the outcome of original studies and replication studies. Therefore, Fiedler’s claim that random measurement error alone explains replication failures is simply wrong and based on a misunderstanding of statistics. Moreover, regression to the mean requires that studies were selected for significance. Schooler (2014) ignores this aspect of regression to the mean when he suggests that regression to the mean is normal and expected. However, the effect sizes of eating cheesecake do not decrease over time because there is no selection process. In contrast, the effect sizes of social psychological experiments decrease when original articles selected significant results and replication studies do not select for significance. Thus, it is not normal for success rates to decrease from 97% to 25%, just like it would not be normal for a basketball players’ free-throw percentage to drop from 97% to 25%. Thus, regression to the mean does not warrant the label of being normal and this argument cannot be used to claim that there is no replication crisis.
No Crisis: Exact Replications are Impossible
Heraclitus, an ancient Greek philosopher, observed that you can never step into the same river twice. Similarly, it is impossible to exactly recreate the conditions of a psychological experiment. This trivial observation has been used to argue that replication failures are neither surprising nor problematic, but rather the norm. We should never expect to get the same result from the same paradigm because the actual experiments are never identical, just like a river is always changing (Strobe & Strack, 2014). This argument has led to a heated debate about the distinction and value of direct versus conceptual replication studies (Zwaan, Etz, Lucas, & Donnellan, 2018).
The purpose of direct replication studies is to replicate an original study as closely as possible. Critics argue that direct replication studies are uninformative because there are only two possible outcomes. Either the replication study is successful and nothing new is learned, or the replication study fails and this only shows that the replication study differed from the original study.
This argument ignores the surprising finding that researchers are seemingly able to alter conditions at will and get the effect in their own laboratories (conceptual replication studies always work), but suddenly even close replications fail to show the effect when the research is registered or carried out by other researchers. It is simply not plausible that conceptual replications that intentionally change features of a study are always successful, while direct replication that try to reproduce the original conditions as closely as possible fail.
This argument also ignores the difference between disciplines. Why is there no replication crisis in cognitive psychology, if each experiment is like a new river? And why does eating cheesecake always lead to a weight gain, no matter whether it is chocolate cheesecake, raspberry white-truffle cheesecake, or caramel fudge cheesecake? The reason is that the main features of rivers remain the same. Even if the river is not identical, you still get wet every time you step into it.
To explain the higher replicability of results in cognitive psychology than in social psychology, vanBavel et al. (2016) proposed that social psychological studies are more difficult to replicate for a number of reasons. They called this property of studies contextual sensitivity. Coding studies for contextual sensitivity showed the predicted negative correlation between contextual sensitivity and replicability. However, Inbar (2016) found that this correlation was no longer significant when discipline was included as a predictor. Thus, the results suggested that social psychological studies are more contextual sensitive and less replicable, but that contextual sensitivity did not explain the lower replicability of social psychology.
It is also not clear that contextual sensitivity implies that social psychology does not have a crisis. Replicability is not the only criterion of good science, especially if exact replications are impossible. Findings that can only be replicated when conditions are reproduced exactly lack generalizability, which makes them rather useless for applications and for construction of broader theories. Take verbal-overshadowing as an example. Even a small change in experimental procedures reduced a practically significant effect size of 16% to a no longer meaningful effect size of 4% (Alogna et al., 2014), and neither of these experimental conditions were similar to real-world situations of eye-witness identification. Thus, the practical implications of this phenomenon remain unclear because it depends too much on the specific context. In conclusion, empirical results are only meaningful, if researchers have a clear understanding of the conditions that can produce a statistically significant result most of the time (Fisher, 1926). Contextual sensitivity makes it harder to do so. Thus, it is one potential factor that may contribute to the replication crisis in social psychology because social psychologists do not know under which conditions their results can be reproduced. For example, I asked Roy F. Baumeister to specify optimal conditions to replicate ego-depletion. He was unable or unwilling to do so.
No Crisis: The Replication Studies are Flawed
The argument that replication studies are flawed comes in two flavors. One argument is that replication studies are often carried out by young researchers with less experience and expertise. They did their best, but they are just not very good experimenters (Gilbert, King, Pettigrew, & Wilson, 2016). Cunningham and Baumeister (2016) proclaim “Anyone who has served on university thesis committees can attest to the variability in the competence and commitment of new researchers. Nonetheless, a graduate committee may decide to accept weak and unsuccessful replication studies to fulfill degree requirements if the student appears to have learned from the mistakes” (p. 4). There is little evidence to support this claim. In fact, a meta-analysis found no differences in effect sizes between studies carried out by Baumeister’s lab and other labs (Hagger et al., 2010).
The other argument is that replication failures are sexier and more attention grabbing than successful replications. Thus, replication researchers sabotage their studies or data analyses to produce non-significant results (Bryana, Yeager, & O’Brien, 2019; Strack, 2016). The latter accusations have been made without empirical evidence to support this claim. For example, Strack (2016) used a positive correlation between sample size and effect size to claim that some labs were motivated to produce non-significant results, presumably by using a smaller sample size. However, a proper bias analysis showed no evidence that there were too few significant results (Schimmack, 2018). Moreover, the overall effect size across all labs was also non-significant.
Inadvertent problems, however, may explain some replication failures. For example, some replication studies reduced statistical power by replicating a study with a smaller sample than the original study (Open Science Collaboration, 2015; Ritchie et al., 2011). In this case, a replication failure could be a false negative (type-II error). Consistent with the logic of meta-analysis, studies with larger sample sizes should be given more weight. Thus, it is problematic to conduct replication studies with smaller samples. At the same time, registered replication reports with thousands of participants should be given more weight than original studies with less than 100 participants. Size matters.
However, size is not the only factor that matters and researchers disagree about the implications of replication failures. Not surprisingly, authors of the original studies typically recognize some problems with the replication attempts (Baumeister & Vohs, 2016; Strack, 2016). Ideally, researchers would agree ahead of time on a research design that is acceptable to all parties involved. Kahneman (2003) called this model an adversarial collaboration. However, original researchers have either not participated in the planning of a study (Strack, 2016) or withdrawn their approval after the negative results were known (Baumeister & Vohs, 2016). None have acknowledged that their original results were obtained with questionable research practices that make it hard to replicate the results. To make replication studies more meaningful, it would be important that leading researchers agree ahead of time on a research design. Failure to find agreement would itself undermine the value of published research because experts should be able to specify the optimal conditions for producing an effect.
In conclusion, replication failures can occur for a number of reasons, just like significant results in original studies can occur for a number of reasons. Inconsistent results are frustrating because they often require further research. This being said, there is no evidence that low quality of replication studies is the sole or the main cause of replication failures in social psychology.
No Crisis: Replication Failures are Normal
In an opinion piece for the New York Times, Lisa Feldmann Barrett, current president of the Association for Psychological Science, commented on the OSC results and claimed that “the failure to replicate is not a cause for alarm; in fact, it is a normal part of how science works” (Feldman Barrett, 2015). On the surface, Feldmann Barrett makes a valid point. It is true that replication failures are a normal part of science. First, if psychologists would conduct studies with 80% power, 1 out of 5 studies would fail to replicate, even if everything is going well and all predictions are true. Second, replication failures are expected when researchers test risky hypotheses (e.g., effects of candidate genes on personality) that have a high probability of being false. In this case, a significant result may be a false positive result and replication failures demonstrate that it was a false positive. Thus, honestly reported replication failures play an integral part in normal science, and the success rate of replication studies provides valuable information about the empirical support for a hypothesis. However, a success rate of 25% or less for social psychology is not a sign of normal science. It is this stark discrepancy between the success rates in journals and honest replication attempts that suggested social psychology is not a normal science. If social psychological theories make risky predictions that are often false, journals should be filled with non-significant results, but they are not (Sterling, 1959; Sterling et al., 1995). This suggest that the problem is not the low success rate in replication studies, but the high success rate in psychology journals.
Crisis: Original Studies Are Not Credible Because They Used NHST
Bem’s anomalous results were published with a commentary by Wagenmakers et al. (2011). This commentary made various points that are discussed in more detail below, but one unique and salient point of Wagenmakers et al.’s comment concerned the use of null-hypothesis significance testing (NHST). In numerous publications, Wagenmakers has argued that NHST is fundamentally flawed (Wagenmakers, 2007). Bem presented 9 results with p-values below .05 as evidence for ESP Wagenmakers et al. object to the use of a significance criterion of .05 and argues that this criterion makes it too easy to publish false positive results (see also Benjamin et al., 2016).
Wagenmakers et al. (2011) claimed that this problem can be avoided by using Bayes-Factors. When they used Bayes-Factors with default priors, several of Bem’s studies no longer showed evidence for ESP. Based on these findings they argued that psychologists must change the way they analyze their data. Since then, Wagenmakers has worked tirelessly to promote Bayes-Factors as an alternative to NHST. However, Bayes-Factors depend on the choice of a prior, and the same data can lead to different inferences with different priors.
Bem, Utts, and Johnson (2011) pointed out that Wagenmakers et al.’s (2011) default prior assumed that there is a 50% probability that ESP works in the opposite direction (below chance accuracy) and a 25% probability that effect sizes are greater than d = 1. Only 25% of the prior distribution was allocated to effect sizes in the predicted direction between 0 and 1. This prior makes no sense for research on extrasensory perception processes that are expected to produce small effects.
When Bem et al. (2011) specified a more reasonable prior, Bayes-Factors actually showed more evidence for ESP than NHST. Moreover, the results of individual studies are less important than the combined evidence across studies. A meta-analysis of Bem’s studies shows that even with the default prior, Bayes-Factor reject the null-hypothesis with an odds-ratio of one-billion to 1. Thus, if we trust Bem’s data, Bayes-Factors also suggest that Bem’s results are robust and it remains unclear why Galak et al. (2012) were unable to replicate Bem’s results.
One argument in favor of Bayes-Factors is that NHST is one-sided. Significant results are used to reject the null-hypothesis, but non-significant results cannot be used to affirm the null-hypothesis. As a result, empirical results rarely falsify a theory unless a theory predicts effects in the opposite direction of the population effect. This means psychological theories are never subjected to real tests that they may fail (Popper, This makes non-significant results difficult to publish, which leads to publication bias. The claim is that Bayes-Factors solve this problem because they can provide evidence for the null-hypothesis. However, this claim is false. Bayes-Factors are odds-ratios between two alternative hypotheses. Unlike in NHST, these two competing hypotheses are not mutually exclusive. That is, there is an infinite number of additional hypotheses that are not tested. Thus, if the data favor the null-hypothesis, they do not provide support for the null-hypothesis. They merely provide evidence against one specified alternative hypothesis. There is always another possible alternative hypothesis that fits the data better than the null-hypothesis. As a result, even Bayes-Factors that strongly favor H0 fail to provide evidence that the true effect size is exactly zero.
The solution to this problem is not new, but unfamiliar to many psychologists. To demonstrate the absence of an effect, it is necessary to specify a region of effect sizes around zero and to demonstrate that the population effect size is likely to be within this region. This can be achieved using NHST (equivalence tests, Lakens, Scheel, & Isager, 2018), or Bayesian statistics (Kruschke, 2018). The main reason why psychologists are not familiar with tests that demonstrate the absence of an effect is that typical sample sizes in psychology have too much sampling error to produce precise estimates of effect sizes.
In conclusion, Wagenmakers et al. claimed that the NHST contributed to the replication crisis, but there is no evidence that replication failures are caused by the use of the wrong statistical approach. The problem with Bem’s results was not the use of NHST, but the use of questionable research practices to produce illusory evidence (Francis, 2012; Schimmack, 2012, 2016, 2020).
Crisis: Original Studies Report Many False Positives
An influential article by Ioanndis (2005) claimed that most published research findings are false. This eye-catching claim has been cited thousands of times. Few citing authors have bothered to point out that the claim is entirely based on hypothetical scenarios rather than empirical evidence. In psychology, fear that most published results are false positives was stoked by Simonsohn, Nelson, and Simmons’ (2011) “False Positive Psychology” article that showed with simulation studies that the aggressive use of questionable research practices can dramatically increase the probability that a study produces a significant result without a real effect. These articles shifted concerns about false negatives (Cohen, 1994) to concerns about false positives.
The problem with this focus on false positive results is that it implies that replication failures reveal false positive results. For example, Nelson, Simmons, and Simonsohn (2018) write “Experimental psychologists spent several decades relying on methods of data collection and analysis that make it too easy to publish false-positive, nonreplicable results. During that time, it was impossible to distinguish between findings that are true and replicable and those that are false and not replicable” (p. 512). However, replication failures do not reveal that original findings were false positive results. Another reason is that replication results are false negative results. That is, the population effect size is not zero, but the replication study had insufficient power to correctly reject the null-hypothesis. The false assumption that replication failures reveal false positive results has created a lot of confusion in the interpretation of replication failures (Maxwell, Lau, & Howard, 2015).
For example, Gilbert et al. (2016) attribute the low replication rate in the reproducibility project to low power of the replication studies. This does not make sense, when the replication studies had the same or sometimes even larger sample sizes than the original studies. As a result, the replication studies had as much or more power than the original studies. So, how could low power explain that discrepancy between the 97% success rate in original studies and the 25% success rate in replication studies? It cannot.
Gilbert et al.’s (2016) criticism only makes sense if replication failures in the replication studies are falsely interpreted as evidence that the original results were false positives. To test this, one could conduct a study with a much larger sample size that is able to detect much smaller effect sizes than the original studies. If these studies produce a statistically significant result, it is possible to conclude that the original study reported a true positive result and that the replication study reported a false negative result. While this is true, it is also true that the original studies had insufficient power to produce significant results with the small population effect sizes that the large replication study revealed. Thus, it remains a mystery how journals can report over 90% significant results with small sample sizes. Moreover, many of the effect sizes that are different from zero may lack practical significance. Thus, the real empirical evidence is provided by the large-scale replication studies, while the original results published in journals provide no credible evidence in themselves.
There have been attempts to estimate the false positive rate in social psychology. One approach is to examine sign changes in replication studies. If 100 true null-hypothesis are tested, 50 studies are expected to show a positive sign and 50 studies are expected to show a negative sign due to random sampling error. If these 100 studies are replicated this will happen again. Just like two coin-flips, we would therefore expect 50 successful replications and 50 unsuccessful replications by chance alone. A higher frequency of outcomes with the same sign suggests that sometimes the null-hypothesis was false. Wilson and Wixted found that 25% of social psychological results in the OSC project showed a sign reversal. This would suggest that 50% of the studies tested a true null-hypothesis. Of course, sign reversals are also possible when the effect size is not strictly zero. However, the probability of a sign reversal decreases as effect sizes increase. Thus, it is possible to say that about 50% of the replicated studies had an effect size close to zero. Unfortunately, this estimate is imprecise due to the small sample size.
Gronau et al. (2017) attempted to estimate the false discovery rate using a statistical model that is fitted to the exact p-values of original studies. The applied this model to three datasets, and found FDRs of 34% -46% for cognitive psychology, 40-60% for social, and 48%-88% for social priming. The problem with these estimates is that they are obtained with a model that limits heterogeneity. Simulation studies show that this dogmatic prior inflates FDR estimates (Schimmack & Brunner, 2019). The 40% FDR for cognitive psychology is particularly implausible because 50% of studies actually replicated with a significant result and sign-reversals were only observed in 10% of studies. It is implausible that cognitive psychologists test either false hypothesis or have nearly 100% power when they test a real effect. It is much more likely that many of the non-significant results are false negatives due to modest power.
Bartoš and Schimmack (2020) developed a statistical model, called z-curve.2.0, that makes it possible to estimate the discovery rate based on the test-statistics in published articles. The model fits a finite mixture model to the significant p-values (converted into z-scores), and then projects the model into the range of non-significant results. This makes it possible to compute the expected discovery rate; that is the percentage of results that are significant. This estimate of the discovery rate can be used to compute the maximum FDR using a simple formula (Soric, 1989). Applying this model to Gronau et al.’s (2017) datasets yields FDRs of 9% (95%CI = 2% to 24%) for cognitive, 26% (4% to 100%), and 61% (19% to 100%) for social priming. The results confirm the general rank-ordering with cognitive being more replicable than social psychology, especially social priming research. Thus, Kahneman was right to direct a letter at Bargh and to describe this line of research as the “poster child for doubts about the integrity of psychological research.” However, the results also make clear that major problems with social priming research cannot be generalized to all areas of psychology.
In conclusion, it is impossible to specify exactly whether an original finding was a false positive result or not. There have been several attempts to estimate the number of false positives results in the literature, but there is no consensus about the proper method to do so. I believe that the distinction between false and true positives is not particularly helpful, if the null-hypothesis is specified as a value of zero. An effect size of d = .0001 is not any more meaningful than an effect size of d = 0000. To be meaningful, published results should be replicable given the same sample sizes as used in original research. Demonstrating an significant result in the same direction in a much larger sample with a much smaller effect size should not be considered to be a successful replication of a result with a large effect size in a small sample; it is actually an original discovery.
Z-Curve: Quantifying the Crisis
Some psychologists developed statistical models that can quantify the influence of selection for significance on replicability. Brunner and Schimmack (2019) demonstrated mathematically that mean power predicts the expected replication rate (ERR) if the original studies could be replicated exactly (including the same sample size). The tricky part is to estimate mean power based on published test statistics.
The first models were p-curve and p-uniform (Simonsohn et al., 2014; van Aert, Wicherts, & van Assen, 2016). However, the focus of these methods was on effect-size estimation. A p-curve app also produces estimates of power (Simonsohn, Nelson, and Simmons, 2014). Brunner and Schimmack (2019) compared four methods to estimate the ERR, including p-curve. They found that p-curve overestimated the expected replication rate (ERR) when studies varied in effect sizes (the p-curve app also overestimates when there is only variability in sample sizes, Brunner, 2019). In contrast, a new method called z-curve performed very well across many scenarios, especially when heterogeneity was present.
Bartoš and Schimmack (2020) validated an extended version of z-curve (z-curve2.0) that provides confidence intervals and provides estimates of the expected discovery rate, that is, the percentage of observed significant results over the observed significant and the unobserved non-significant results. Z-curve has already been applied to various datasets of results in social psychology (see R-Index blog for numerous examples).
The most important dataset was created by Motyl et al. (2017) who coded a representative sample of studies in social psychology journals. The main drawback of Motyl’s audit of social psychology was that they did not have a proper statistical tool to estimate replicability. The closest to an estimator of replicability was the R-Index, although the R-Index provides biased estimates, especially when power deviates in either direction from 50%. Fortunately, the estimate was close to 50% (62% for 2003-2004 & 52% for 2013-2014). The average estimate is slightly above 50%, suggesting that social psychology has a replication crisis, but not as bad as the 25% estimate from the OSC project suggested.
A better way to estimate replicability is to fit z-curve to Motyl et al.’s data. To be included in the z-curve analysis, a study had to (a) use a t-test or F-test, (b) have a valid test-statistic, and (c) not be from the journal Psychological Science. The last criterion was used to focus on social psychology. I also excluded studies with more than 4 experimenter degrees of freedom (e.g., 177 df). This left 678 studies for analysis. The set included 450 between-subject studies, 139 mixed designs, and 67 within-subject designs. The preponderance of between-subject designs is typical of social psychology and one of the reasons for the low power of studies in social psychology.
There are a number of explanations for the discrepancy between the OSC estimate and the z-curve estimate. First of all, the number of studies in the OSC project is very small and sampling error alone could explain some of the differences. Second, the set of studies in the OSC project was not representative and may have selected studies with lower replicability. Third, some actual replication studies may have modified procedures in ways that lowered the chance of obtaining a significant result (e.g., reduced sample size). Fourth, as Stroebe and Strack (2014) pointed out, it is never possible to exactly replicate a study. Thus, z-curve estimates are overly optimistic because they assume exact replications. If there is contextual sensitivity, selection for significance will produce additional regression to the mean and a better estimate of the actual replication rate is the expected discovery rate ( Bartoš & Schimmack, 2020). Z-curve estimated an EDR of 21% (an alternative fitting algorithm produced an even lower estimate of 15%), which is indeed more closely aligned with the success rate in actual replication studies. In combination, the existing evidence suggests that the replicability of social psychological research is somewhere between 20% and 50%, which is clearly unsatisfactory and much lower than the illusory success rates of 90% and more in social psychological journals. Even the success rate of 90% is an underestimation because most of the non-significant results are in the range of marginally significant results (z = 1.65 to z = 1.96) that are often used to claim support for a prediction. Thus, the observed success rate is close to 100%.
Figure 1 also clearly shows that questionable research practices explain the gap between success rates in laboratories and success rates in journals. The z-curve estimate of non-significant results shows that a large proportion of non-significant results are expected, but hardly any of these expected studies every get published. This is reflected in an observed discovery rate of 90% and an expected discovery rate of 21%. The confidence intervals do not overlap, indicating that this discrepancy is highly significant. Given such extreme selection for significance, it is not surprising that published effect sizes are inflated and replication studies fail to reproduce significant results. In conclusion, out of all explanations for replication failures in psychology, the use of questionable research practices is the main factor. In comparison to other explanations, it is the only explanation that is supported by empirical evidence.
Z-curve can also be used to examine the power of subgroups of studies. In the OSC project, studies with a z-score greater than 4 had an 80% chance to be replicated. To achieve an ERR of 80% with Motyl’s data, z-scores have to be greater than 3.5. In contrast, studies with just significant results (p < .05 & p > .01) have only an ERR of 28%. This information can be used to reevaluate published results. Studies with p-values between .05 and .01 should not be trusted unless other information suggests otherwise (e.g., a trustworthy meta-analysis). In contrast, results with z-scores greater than 4 can be used to plan new studies. Unfortunately, there are much more questionable results with p-values greater than .01 (42%) than trustworthy results with z > 4 (17%), but at least there are some findings that are likely to replicate even in social psychology.
An Inconvenient Truth
Every crisis is an opportunity to learn to avoid future mistakes. Lending practices were changed after the financial crisis in the 2000s. Psychologists and other sciences can learn from the replication crisis in social psychology, but only if they are honest and upfront about the real cause of the replication crisis. Social psychologists did not use the scientific method properly. Neither Fisher nor Neyman and Pearson, who created NHST, proposed that non-significant results are irrelevant or that only significant results should be published. The problems of selection for significance is evident and has been well-known (Rosenthal, 1979). Cohen (1962) warned about low power, but the main concern were large file-drawers filled with type-II errors. Nobody could imagine that whole literatures with hundreds of studies are built on nothing but sampling error and selection for significance. Bem’s article and replication failures in the 2010s showed that the abuse of questionable research practices was much more excessive than anybody was willing to believe.
The key culprit were conceptual replication studies. Even social psychologists were aware that it is unethical not to report replication failures. For example, Bem advised researchers to use questionable research practices to find significant results in their data. “Go on a fishing expedition for something – anything – interesting” even if this meant to “err on the side of discovery” (Bem, 2010). However, even Bem made it clear that “this is not advice to suppress negative results. If your study was genuinely designed to test hypotheses that derive from a formal theory or are of wide general interest for some other reason, then they should remain the focus of your article. The integrity of the scientific enterprise requires the reporting of disconfirming results.”
How then is it possible that Bem himself and other social psychologists never reported disconfirming results? The solution to this problem was to never replicate a study exactly and to always vary some feature of the study. “Never do a direct replication; that way, if a conceptual replication doesn’t work, you maintain plausible deniability” (Anonymous cited in Spellman, 2015). This is how Morewedge, Gilbert, and Wilson describe their research process.
“Let us be clear: We did not run the same study over and over again until it yielded significant results and then report only the study that “worked.” Doing so would be clearly unethical. Instead, like most researchers who are developing new methods, we did some preliminary studies that used different stimuli and different procedures and that showed no interesting effects. Why didn’t these studies show interesting effects? We’ll never know.”
It was only in 2012 that psychologists realized that changing results in their studies were heavily influenced by sampling error and not by some minor changes in the experimental procedure (e.g., as a graduate student we were joking that the color of experiments’ underwear might influence results). Only a few psychologists have been open about this. In a commendable editorial, Lindsay (2019) talks about his realization that his research practices were suboptimal.
“Early in 2012, Geoff Cumming blew my mind with a talk that led me to realize that I had been conducting underpowered experiments for decades. In some lines of research in my lab, a predicted effect would come booming through in one experiment but melt away in the next. “My students and I kept trying to find conditions that yielded consistent statistical significance—tweaking items, instructions, exclusion rules—but we sometimes eventually threw in the towel because results were maddeningly inconsistent.”
Rather than invoking some supernatural decline effect like Schooler, Lindsay realized that his research practices were suboptimal. A first step for social psychologists is to acknowledge their past mistakes and to learn from their mistakes. Unfortunately, there has been no collective admission of wrongdoing. Instead we have seen public displays of denial and anger and maybe some private experiences of shame and depression. Maybe it is time for acceptance. Making are a fact of life. It is the response to error that counts (Nikki Giovanni). So far, the response by social psychologists has been underwhelming. It is time for some leaders to step up.
The Way out of the Crisis
The most obvious solution to the replication crisis is to ban the use of questionable research practices, and to consider their use a violation of research ethics. Kitayama claimed that collating promising small pilot studies into one dataset was an acceptable practice in the past, but no scientific organization has clearly stated that this practice is no longer acceptable. Why should stakeholders trust publications if this is still a tolerated practice?
Professional organizations have made no effort to discuss questionable research practices and to specify which practices are acceptable and which ones are not. Thus, researchers can still use the same practices that Bem used to produce false evidence for extresensory perception to produce false evidence for their theories.
At present, the enforcement of good practices is left to editors of journals who can ask pertinent questions during the submission process (Lindsay, 2019). Another solution has been to ask researchers to preregister their studies, which limits researchers’ freedom to go on a fishing expedition. There has been a lot of debate about the value of preregistration and some resistance. Some journal editors introduced badges for preregisration (Roger Giner-Sorolla, JESP), but others did not (Chris Crandall, PSPB) (cf. Open Science Foundation, 2020).
There are also no clear standards about pre-registration or how much researchers are bound by pre-registration. For example, Noah, Schul, and Mayo (2018) preregistered the prediction of an interaction between being observed and a facial feedback manipulation. Although the predicted interaction was not significant, they interpreted the non-significant pattern as confirming their prediction rather than stating that there was no support for their preregistered prediction. These lax standards impede the necessary improvement to make social psychological publications credible.
Finally, preregistration of studies alone will only produce more non-significant results with underpowered designs and not increase the replicability of significant results. To increase replicability, social psychologists finally have to conduct power analysis to plan studies that can produce significant results without QRPs. Although higher power is essential to the improvement of research, there are no badges for good a prior power analyses.
To ensure that published results are credible and replicable, I argue that researchers should be rewarded for conducting high-powered studies. As a priori power-analysis are based on estimates of effect sizes, this evaluation should be based on the actual power that is achieved in studies. This can be estimated using z-curve.
Z-Curve can be used to quantify the expected replication rate of individual researchers. This information can then be used in combination with existing measures of research quality like number of publications, citation counts or the H-Index.
I illustrate the value of doing so with two eminent social psychologists. Roy F. Baumeister is one of the leading social psychologists in terms of traditional impact measures. Currently, Roy Baumeister has an H-Index of 105. During the 2010s, there have been concerns about his research practices to provide evidence for his theory of glucose-fueled will-power (Carter et al., 2014, 2015; Schimmack, 2012), and a major replication failure (Hagger et al., 2016). A z-curve analysis of Bameister’s research articles that contribute to his H-Index shows that the expected replication rate is only 22% (Figure 2), which is below the average for social psychology (cf. Figure 1).
Susan T. Fiske has an H-Index of 69, which is impressive, but notably lower than Baumeister’s H-Index. Thus, if we rely on productivity and impact without considering replicability, Baumeister is the more successful social psychologist. However, a z-curve analysis of Fiske’s work shows higher replicability, 59% (Figure 3).
To combine quantity and quality of impact, I propose to weight the H-Index by replicability. This HR-Index would be 23 for Baumeister and 41 for Fiske. This reflects more accurately that Fiske has made a more positive contribution to social psychology than Baumeister because her work is more replicable.
By taking replicability into account, the incentive to publish as many discoveries as possible without caring about their truth-value (i.e., “to err on the side of discovery”) is no longer the best strategy to achieve fame and recognition in a field. The HR-Index could also motivate researchers to retract articles that they no longer believe in, which would lower the H-Index but increase the R-Index. For highly problematic papers this could produce a net gain in the HR-Index.
In conclusion, to improve social psychology, and to make it an empirical science, research practices have to change. To do so, it is important to identify good practices and to reward researchers who use good practices. In addition to open-science badges, researchers should be rewarded for publishing studies with good power that can be replicated.
The 2010s have revealed major flaws in the way social psychologists conduct and report their research. Selective publishing of significant results based on studies with low statistical power produced results that are difficult to replicate because published effect sizes are inflated by sampling error. The chance that published results replicate, especially those obtained in between-subject designs with small samples, is estimated to be between 20% and 40%. Meta-analyses do not solve this problem because questionable research practices inflate effect size estimates in meta-analyses. Thus, many theories in social psychology lack empirical support.
A few social psychologists have acknowledged this painful truth. “I want a better tomorrow, I want social psychology to change. But, the only way we can really change is if we reckon with our past, coming clean that we erred; and erred badly” (Inzlicht, 2016), but the vast majority of social psychologists have responded with defiant silence, denial, or lashed out against critics. As a result, a whole decade has been wasted, rather than confronting problems head on. Not a single senior social psychologist has responded to the replication crisis by calling for major reforms and holding researchers accountable for their research practices.
Fortunately, some younger social psychologists are pushing for reforms, but they lack the social power to implement these forms. This means that progress is slow and uneven. While some social psychologists follow open science practices, others continue to do business as usual. As quick and dirty studies produce statistically significant results much faster, the incentive structure continues to reward bad practices.
It is therefore necessary to reveal and measure the use of good versus bad practices. The R-Index provides this valuable information and should be used to reward researchers who produce replicable results that provide credible scientific evidence and an empirical foundation for theories of human behavior. The R-Index can also be used to evaluate and compare other disciplines in psychology. Demonstrating that scientific results are replicable is of utmost importance to ensure that the general public and paying undergraduate students do not loose trust in psychology.
Bartoš, F. & Schimmack, U. (2020). Z-Curve.2.0: Estimating Replication and Discovery Rates. Manuscript Submitted for Publication.
Bem, D. J. (2000). Writing an empirical article. In R. J. Sternberg (Ed.), Guide to publishing in psychological journals (pp. 3–16). Cambridge, England: Cambridge University Press. doi:10.1017/CBO9780511807862.002
Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100, 407–425. doi:10.1037/a0021524
Bem, D. J., Utts, J., & Johnson, W. O. (2011). Must psychologists change the way they analyze their data? Journal of Personality and Social Psychology, 101(4), 716–719. https://doi.org/10.1037/a0024777
Brunner, J. & Schimmack, U. (2019). Estimating Population Mean Power Under Conditions of Heterogeneity and Selection for Significance. Meta-Psychology, In Press.
Bryan, C. J., Yeager, D. S., & O’Brien, J. M. (2019). Replicator degrees of freedom allow publication of misleading failures to replicate, 116, 25535-25545. Proceedings of the National Academy of Sciences, doi/10.1073/pnas.1910951116
Carter, E. C., Kofler, L. M., Forster, D. E., & McCullough, M. E. (2015). A series of meta-analytic tests of the depletion effect: Self-control does not seem to rely on a limited resource. Journal of Experimental Psychology: General, 144(4), 796–815. https://doi.org/10.1037/xge0000083
Carter, E. C., and McCullough, M. E. (2013). Is ego depletion too incredible? Evidence for the overestimation of the depletion effect. Behav. Brain Sci. 36, 683–684. doi: 10.1017/S0140525X13000952
Carter, E. C., & McCullough, M. E. (2014). Publication bias and the limited strength model of self-control: Has the evidence for ego depletion been overestimated? Frontiers in Psychology, 5, Article 823.
Cohen J. 1962. The statistical power of abnormal-social psychological research: a review. Journal of Abnormal and Social Psychology, 65:145–53
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997–1003.
Cunningham, M. R., & Baumeister, R. F. (2016). How to make nothing out of something: Analyses of the impact of study sampling and statistical interpretation in misleading meta-analytic conclusions. Frontiers in Psychology, 7, Article 1639.
Elkins-Brown, N., Saunders, B., & Inzlicht, M. (2018). The misattribution of emotions and the error-related negativity: A registered report. Cortex: A Journal Devoted to the Study of the Nervous System and Behavior, 109, 124–140. https://doi.org/10.1016/j.cortex.2018.08.017
Fisher, R. A. The arrangement of field experiments. Journal of the Ministry of Agriculture, 1926, 33, 503-513.
Francis, G. (2012). Too good to be true: Publication bias in two prominent studies from experimental psychology. Psychonomic Bulletin & Review, 19, 151–156. doi:10.3758/s13423-012-0227-9
Galak, J., LeBoeuf, R. A., Nelson, L. D., & Simmons, J. P. (2012). Correcting the past: Failures to replicate psi. Journal of Personality and Social Psychology, 103(6), 933–948. https://doi.org/10.1037/a0029709
Gilbert, D. T., King, G., Pettigrew, S., & Wilson, T. D. (2016). Comment on “Estimating the reproducibility of psychological science.” Science, 351, 1037–1103.
Gronau, Q. F., Duizer, M., Bakker, M., & Wagenmakers, E.-J. (2017). Bayesian mixture modeling of significant p values: A meta-analytic method to estimate the degree of contamination from H₀. Journal of Experimental Psychology: General, 146(9), 1223–1233. https://doi.org/10.1037/xge0000324
Hagger, M. S., Wood, C., Stiff, C., & Chatzisarantis, N. L. D. (2010). Ego depletion and the strength model of self-control: A meta-analysis. Psychological Bulletin, 136(4), 495–525. https://doi.org/10.1037/a0019486
Hagger, M. S., Chatzisarantis, N. L. D., Alberts, H., Anggono, C. O., Batailler, C., Birt, A. R., … Zwienenberg, M. (2016). A Multilab Preregistered Replication of the Ego-Depletion Effect. Perspectives on Psychological Science, 11(4), 546–573. https://doi.org/10.1177/1745691616652873
Ioannidis JPA (2005) Why most published research findings are false. PLoS Med 2: e124.
Lakens, D., Scheel, A. M., & Isager, P. M. (2018). Equivalence testing for psychological research: A tutorial. Advances in Methods and Practices in Psychological Science, 1(2), 259–269. https://doi.org/10.1177/2515245918770963
Kruschke, J. K., & Liddell, T. M. (2018). The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychonomic Bulletin & Review, 25(1), 178–206. https://doi.org/10.3758/s13423-016-1221-4
Kvarven, E., Strømland, M., & Johannesson (2019). Comparing Meta-Analyses and Pre-Registered Multiple Labs Replication Projects. Preprint. (retrieved 1/6/2020)
Luttrell, A., Petty, R. E., & Xu, M. (2017). Replicating and fixing failed replications: The case of need for cognition and argument quality. Journal of Experimental Social Psychology, 69, 178–183. https://doi.org/10.1016/j.jesp.2016.09.006
Maxwell, S. E., Lau, M. Y., & Howard, G. S. (2015). Is psychology suffering from a replication crisis? What does “failure to replicate” really mean? American Psychologist, 70(6), 487–498. https://doi.org/10.1037/a0039400
Motyl, M., Demos, A. P., Carsel, T. S., Hanson, B. E., Melton, Z. J., Mueller, A. B., Prims, J. P., Sun, J., Washburn, A. N., Wong, K. M., Yantis, C., & Skitka, L. J. (2017). The state of social and personality science: Rotten to the core, not so bad, getting better, or getting worse? Journal of Personality and Social Psychology, 113(1), 34–58. https://doi.org/10.1037/pspa0000084
Noah, T., Schul, Y., & Mayo, R. (2018). When both the original study and its failed replication are correct: Feeling observed eliminates the facial-feedback effect. Journal of Personality and Social Psychology, 114(5), 657–664. https://doi.org/10.1037/pspa0000121
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716–aac4716. doi:10 . 1126 / science . aac4716
Popper, K. R. (1959). The logic of scientific discovery. London, England: Hutchinson.
Ritchie, S. J., Wiseman, R., & French, C. C. (2012a). Failing the future: Three unsuccessful attempts to replicate Bem’s “retroactive facilitation of recall” effect. PLoS One, 7(3), Article e33423. doi:10.1371/journal.pone.0033423
Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17, 551–566
Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551–566. https://doi.org/10.1037/a0029487
Schimmack, U. (2018). Why the Journal of Personality and Social Psychology Should Retract Article DOI: 10.1037/a0021524 “Feeling the Future: Experimental evidence for anomalous retroactive influences on cognition and affect” by Daryl J. Bem. https://replicationindex.com/2018/01/05/bem-retraction/ (blog post retrieved 1/6/2020)
Schooler, J. W. (2014). Turning the lens of science on itself: Verbal overshadowing, replication, and metascience. Perspectives on Psychological Science, 9(5), 579–584. https://doi.org/10.1177/1745691614547878
Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve and effect size: Correcting for publication bias using only significant results. Perspectives on Psychological
Science, 9, 666–681.
Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2017). Pcurve app 4.06. Retrieved May 30, 2019, from http://www.p-curve.com
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632
Sterling, T. D. (1959). Publication decision and the possible effects on inferences drawn from tests of significance – or vice versa. Journal of the American Statistical Association,
Sterling, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication decisions revisited: The effect of the outcome of statistical tests on the decision to publish and vice versa. The American Statistician, 49, 108–112.
van Aert, R. C. M., Wicherts, J. M., & van Assen, M. A. L. M. (2016). Conducting meta-analyses based on p values: Reservations and recommendations for applying p-uniform and pcurve. Perspectives on Psychological Science, 11, 713–729.
Wagenmakers, E.-J., Beek, T., Dijkhoff, L., Gronau, Q. F., Acosta, A., Adams, R. B., … Zwaan, R. A. (2016). Registered Replication Report: Strack, Martin, & Stepper (1988). Perspectives on Psychological Science, 11(6), 917–928. https://doi.org/10.1177/1745691616674458
Wagenmakers EJ, Wetzels R, Borsboom D, van der Maas HL. 2011. Why psychologists must change the way they analyze their data: the case of psi: comment on Bem 2011. Journal of Personality and Social Psychology, 100(3), 426–32.
We all know what psychologists did before 2012. The name of the game was to get significant results that could be sold to a journal for publication. Some did it with more power and some did it with less power, but everybody did it.
In the beginning of the 2010s it became obvious that this was a flawed way to do science. Bem (2011) used this anything-goes to get significance approach to publish 9 significant demonstration of a phenomenon that does not exist: mental time-travel. The cat was out of the bag. There were only two questions. How many other findings were unreal and how would psychologists respond to the credibility crisis.
D. Steve Lindsay responded to the crisis by helping to implement tighter standards and to enforce these standards as editor of Psychological Science. As a result, Psychological Science has published more credible results over the past five years. At the end of his editorial term, Linday published a gutsy and honest account of his journey towards a better and more open psychological science. It starts with his own realization that his research practices were suboptimal.
Early in 2012, Geoff Cumming blew my mind with a talk that led me to realize that I had been conducting underpowered experiments for decades. In some lines of research in my lab, a predicted effect would come booming through in one experiment but melt away in the next. My students and I kept trying to find conditions that yielded consistent statistical significance—tweaking items, instructions, exclusion rules—but we sometimes eventually threw in the towelbecause results were maddeningly inconsistent. For example, a chapter by Lindsay and Kantner (2011) reported 16 experiments with an on-again/off-again effect of feedback on recognition memory. Cumming’s talk explained that p values are very noisy. Moreover, when between-subjects designs are used to study small- to medium-sized effects, statistical tests often yield nonsignificant outcomes (sometimes with huge p values) unless samples are very large.
Hard on the heels of Cumming’s talk, I read Simmons, Nelson, and Simonsohn’s (2011) “False-Positive Psychology” article, published in Psychological Science. Then I gobbled up several articles and blog posts on misuses of null-hypothesis significance testing (NHST). The authors of these works make a convincing case that hypothesizing after the results are known (HARKing; Kerr, 1998) and other forms of “p hacking” (post hoc exclusions, transformations, addition of moderators, optional stopping, publication bias, etc.) are deeply problematic. Such practices are common in some areas of scientific psychology, as well as in some other life sciences. These practices sometimes give rise to mistaken beliefs in effects that really do not exist. Combined with publication bias, they often lead to exaggerated estimates of the sizes of real but small effects.
This quote is exceptional because few psychologists have openly talked about their research practices before (or after) 2012. It is an open secrete that questionable research practices were widely used and anonymous surveys support this (John et al., 2012), but nobody likes to talk about it. Lindsay’s frank account is an honorable exception in the spirit of true leaders who confront mistakes head on, just like a Nobel laureate who recently retracted a Science article (Frances Arnold).
1. Acknowledge your mistakes.
2. Learn from your mistakes.
3. Teach others from your mistakes.
4. Move beyond your mistakes.
Lindsay’s acknowledgement also makes it possible to examine what these research practices look like when we examine published results, and to see whether this pattern changes in response to awareness that certain practices were questionable.
So, I z-curved Lindsay’s published results from 1998 to 2012. The graph shows some evidence of QRPs, in that the model assumes more non-significant results (grey line from 0 to 1.96) than are actually observed (histogram of non-significant results). This is confirmed by a comparison of the observed discovery rate (70% of published results are significant) and the expected discovery rate (44%). However, the confidence intervals overlap. So this test of bias is not significant.
The replication rate is estimated to be 77%. This means that there is a 77% probability that repeating a test with a new sample (of equal size) would produce a significant result again. Even for just significant results (z = 2 to 2.5), the estimated replicability is still 45%. I have seen much worse results.
Nevertheless, it is interesting to see whether things improved. First of all, being editor of Psychological Science is full-time job. Thus, output has decreased. Maybe research also slowed down because studies were conducted with more care. I don’t know. I just know that there are very few statistics to examine.
Although the small sample size of tests makes results somewhat uncertain, the graph shows some changes in research practices. Replicability increased further to 88% and there is no loner a discrepancy between observed and expected discovery rate.
If psychology as a whole had responded like D.S. Lindsay it would be in a good position to start the new decade. The problem is that this response is an exception rather than the rule and some areas of psychology and some individual researchers have not changed at all since 2012. This is unfortunate because questionable research practices hurt psychology, especially when undergraduates and the wider public learn more and more how untrustworthy psychological science has been and often still us. Hopefully, reforms will come sooner than later or we may have to sing a swan song for psychological science.
This blog post is heavily based on one of my first blog-posts in 2014 (Schimmack, 2014). The blog post reports a meta-analysis of ego-depletion studies that used the hand-grip paradigm. When I first heard about the hand-grip paradigm, I thought it was stupid because there is so much between-subject variance in physical strength. However, then I learned that it is the only paradigm that uses a pre-post design, which removes between-subject variance from the error term. This made the hand-grip paradigm the most interesting paradigm because it has the highest power to detect ego-depletion effects. I conducted a meta-analysis of the hand-grip studies and found clear evidence of publication bias. This finding is very damaging to the wider ego-depletion research because other studies used between-subject designs with small samples which have very low power to detect small effects.
This prediction was confirmed in meta-analyses by Carter,E.C., Kofler, L.M., Forster, D.E., and McCulloch,M.E. (2015) that revealed publication bias in ego-depletion studies with other paradigms.
The results also explain why attempts to show ego-depletion effects with within-subject designs failed (Francis et al., 2018). Within-subject designs increase power by removing fixed between-subject variance such as physical strength. However, given the lack of evidence with the hand-grip paradigm it is not surprising that within-subject designs also failed to show ego-depletion effects with other dependent variables in within-subject designs. Thus, these results further suggest that ego-depletion effects are too small to be used for experimental investigations of will-power.
Of course, Roy F. Baumeister doesn’t like this conclusion because his reputation is to a large extent based on the resource model of will-power. His response to the evidence that most of the evidence is based on questionable practices that produced illusory evidence has been to attack the critics (cf. Schimmack, 2019).
In 2016, he paid to publish a critique of Carter’s (2015) meta-analysis in Frontiers of Psychology (Cunningham & Baumeister, 2016). In this article, the authors question the results obtained by bias-tests that reveal publication bias and suggest that there is no evidence for ego-depletion effects.
Unfortunately, Cunningham and Baumeister’s (2016) article is cited frequently as if it contained some valid scientific arguments.
For example, Christodoulou, Lac, and Moore (2017) cite the article to dismiss the results of a PEESE analysis that suggests publication bias is present and there is no evidence that infants can add and subtract. Thus, there is a real danger that meta-analysts will use Cunningham & Baumeister’s (2016) article to dismiss evidence of publication bias and to provide false evidence for claims that rest on questionable research practices.
Fact Checking Cunningham and Baumeister’s Criticisms
Cunningham and Baumeister (2016) claim that results from bias tests are difficult to interpret, but there criticism is based on false arguments and inaccurate claims.
Confusing Samples and Populations
This scientifically sounding paragraph is a load of bull. The authors claim that inferential tests require sampling from a population and raise a question about the adequacy of a sample. However, bias tests do not work this way. They are tests of the population, namely the population of all of the studies that could be retrieved that tested a common hypothesis (e.g., all handgrip studies of ego-depletion). Maybe more studies exist than are available. Maybe the results based on the available studies differ from results if all studies were available, but that is irrelevant. The question is only whether the available studies are biased or not. So, why do we even test for significance? That is a good question. The test for significance only tells us whether bias is merely a product of random chance or whether it was introduced by questionable research practices. However, even random bias is bias. If a set of studies reports only significant results, and the observed power of the studies is only 70%, there is a discrepancy. If this discrepancy is not statistically significant, there is still a discrepancy. If it is statistically significant, we are allowed to attribute it to questionable research practices such as those that Baumeister and several others admitted using.
“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.” (Roy Baumeister, personal email communication) (Schimmack, 2014).
Given the widespread use of questionable research practices in experimental social psychology, it is not surprising that bias-tests reveal bias. It is actually more surprising when these tests fail to reveal bias, which is most likely a problem of low statistical power (Renkewitz & Keiner, 2019).
The claims about power are not based on clearly defined constructs in statistics. Statistical power is a function of the strength of a signal (the population effect size) and the amount of noise (sampling error). Researches skills are not a part of statistical power. Results should be independent of a researcher. A researcher could of course pick procedures that maximize a signal (powerful interventions) or reduce sampling error (e.g., pre-post designs), but these factors play a role in the designing of a study. Once a study is carried out, the population effect size is what it was and the sampling error is what it was. Thus, honestly reported test statistics tell us about the signal-to-noise ratio in a study that was conducted. Skillful researchers would produce stronger test-statistics (higher t-values, F-values) than unskilled researchers. The problem for Baumeister and other ego-depletion researchers is that the t-values and F-values tend to be weak and suggest questionable research practices rather than skill produced significant results. In short, meta-analysis of test-statistics reveal whether researchers used skill or questionable research practices to produce significant results.
The reference to Morey (2013) suggests that there is a valid criticism of bias tests, but that is not the case. Power-based bias tests are based on sound statistical principles that were outlined by a statistician in the journal American Statistician (Sterling, Rosenbaum, & Weinkam, 1995). Building on this work, Jerry Brunner (professor of statistics) and I published theorems that provide the basis of bias tests like TES to reveal the use of questionable research practices (Brunner & Schimmack, 2019). The real challenge for bias tests is to estimate mean power without information about the population effect sizes. In this regard, TES is extremely conservative because it relies on a meta-analysis of observed effect sizes to estimate power. These effect sizes are inflated when questionable research practices were used, which makes the test conservative. However, there is a problem with TES when effect sizes are heterogeneous. This problem is avoided by alternative bias tests like the R-Index that I used to demonstrate publication bias in the handgrip studies of ego-depletion. In sum, bias tests like the R-Index and TES are based on solid mathematical foundations and simulation studies show that they work well in detecting the use of questionable research practices.
Confusing Absence of Evidence with Evidence of Absence
PET and PEESE are extension of Eggert’s regression test of publication bias. All methods relate sample sizes (or sampling error) to effect size estimates. Questionable research practices tend to introduce a negative correlation between sample size and effect sizes or a positive correlation between sampling error and effect sizes. The reason is that significance requires a signal to noise ratio of 2:1 for t-tests or 4:1 for F-tests to produce a significant result. To achieve this ratio with more noise (smaller sample, more sampling error), the signal has to be inflated more.
The novel contribution of PET and PEESE was to use the intercept of the regression model as an effect size estimate that corrects for publication bias. This estimate needs to be interpreted in the context of the sampling error of the regression model, using a 95%CI around the point estimate.
Carter et al. (2015) found that the 95%CI often included a value of zero, which implies that the data are too weak to reject the null-hypothesis. Such non-significant results are notoriously difficult to interpret because they neither support nor refute the null-hypothesis. The main conclusion that can be drawn from this finding is that the existing data are inconclusive.
This main conclusion does not change when the number of studies is less than 20. Stanley and Doucouliagos (2014) were commenting on the trustworthiness of point estimates and confidence intervals in smaller samples. Smaller samples introduce more uncertainty and we should be cautious in the interpretation of results that suggest there is an effect because the assumptions of the model are violated. However, if the results already show that there is no evidence, small samples merely further increase uncertainty and make the existing evidence even less conclusive.
Aside from the issues regarding the interpretation of the intercept, Cunningham and Baumeister also fail to address the finding that sample sizes and effect sizes were negatively correlated. If this negative correlation is not caused by questionable research practices, it must be caused by something else. Cunningham and Baumeister fail to provide an answer to this important question.
No Evidence of Flair and Skill
Earlier Cunningham and Baumeister (2016) claimed that power depends on researchers’ skills and they argue that new investigators may be less skilled than the experts who developed paradigms like Baumeister and colleagues.
However, they then point out that Carter et al.’s (2015) examined lab as a moderator and found no difference between studies conducted by Baumeister and colleagues or other laboratories.
Thus, there is no evidence whatsoever that Baumeister and colleagues were more skillful and produced more credible evidence for ego-depletion than other laboratories. The fact that everybody got ego-depletion effects can be attributed to the widespread use of questionable research practices that made it possible to get significant results even for implausible phenomena like extrasensory perception (John et al., 2012; Schimmack, 2012). Thus, the large number of studies that support ego-depletion merely shows that everybody used questionable research practices like Baumeister did (Schimmack, 2014; Schimmack, 2016), which is also true for many other areas of research in experimental social psychology (Schimmack, 2019). Francis (2014) found that 80% of articles showed evidence that QRPs were used.
Handgrip Replicability Analysis
The meta-analysis included 18 effect sizes based on handgrip studies. Two unpublished studies (Ns = 24, 37) were not included in this analysis. Seeley & Gardner (2003)’s study was excluded because it failed to use a pre-post design, which could explain the non-significant result. The meta-analysis reported two effect sizes for this study. Thus, 4 effects were excluded and the analysis below is based on the remaining 14 studies.
All articles presented significant effects of will-power manipulations on handgrip performance. Bray et al. (2008) reported three tests; one was deemed not significant (p = .10), one marginally significant (.06), and one was significant at p = .05 (p = .01). The results from the lowest p-value were used. As a result, the success rate was 100%.
Median observed power was 63%. The inflation rate is 37% and the R-Index is 26%. An R-Index of 22% is consistent with a scenario in which the null-hypothesis is true and all reported findings are type-I errors. Thus, the R-Index supports Carter and McCullough’s (2014) conclusion that the existing evidence does not provide empirical support for the hypothesis that will-power manipulations lower performance on a measure of will-power.
The R-Index can also be used to examine whether a subset of studies provides some evidence for the will-power hypothesis, but that this evidence is masked by the noise generated by underpowered studies with small samples. Only 7 studies had samples with more than 50 participants. The R-Index for these studies remained low (20%). Only two studies had samples with 80 or more participants. The R-Index for these studies increased to 40%, which is still insufficient to estimate an unbiased effect size.
One reason for the weak results is that several studies used weak manipulations of will-power (e.g., sniffing alcohol vs. sniffing water in the control condition). The R-Index of individual studies shows two studies with strong results (R-Index > 80). One study used a physical manipulation (standing one leg). This manipulation may lower handgrip performance, but this effect may not reflect an influence on will-power. The other study used a mentally taxing (and boring) task that is not physically taxing as well, namely crossing out “e”s. This task seems promising for a replication study.
Power analysis with an effect size of d = .2 suggests that a serious empirical test of the will-power hypothesis requires a sample size of N = 300 (150 per cell) to have 80% power in a pre-post study of will-power.
Baumeister has lost any credibility as a scientist. He is pretending to engage in a scientific dispute about the validity of ego-depletion research, but he is ignoring the most obvious evidence that has accumulated during the past decade. Social psychologists have misused the scientific method and engaged in a silly game of producing significant p-values that support their claims. Data were never used to test predictions and studies that failed to support hypotheses were not published.
“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.” (Roy Baumeister, personal email communication)
As a result, the published record lacks credibility and cannot be used to provide empirical evidence for scientific claims. Ego-depletion is a glaring example of everything that went wrong in experimental social psychology. This is not surprising because Baumeister and his students used questionable research practices more than other social psychologists (Schimmack, 2018). Now he is trying to to repress this truth, which should not surprise any psychologist familiar with motivated biases and repressive coping. However, scientific journals should not publish his pathetic attempts to dismiss criticism of his work. Cunningham and Baumeister’s article provides not a single valid scientific argument. Frontiers of Psychology should retract the article.
Carter,E.C.,Kofler,L.M.,Forster,D.E.,and McCulloch,M.E. (2015).A series of meta-analytic tests of the depletion effect: Self-control does not seem to rely on a limited resource. J. Exp.Psychol.Gen. 144, 796–815. doi:10.1037/xge0000083
Methods for the detection of publication bias in meta-analyses were first introduced in the 1980s (Light & Pillemer, 1984). However, existing methods tend to have low statistical power to detect bias, especially when population effect sizes are heterogeneous (Renkewitz & Keiner, 2019). Here I show that the Replicability Index (RI) is a powerful method to detect selection for significance while controlling the type-I error risk better than the Test of Excessive Significance (TES). Unlike funnel plots and other regression methods, RI can be used without variation in sampling error across studies. Thus, it should be a default method to examine whether effect size estimates in a meta-analysis are inflated by selection for significance. However, the RI should not be used to correct effect size estimates. A significant results merely indicates that traditional effect size estimates are inflated by selection for significance or other questionable research practices that inflate the percentage of significant results.
Evaluating the Power and Type-I Error Rate of Bias Detection Methods
Just before the end of the year, and decade, Frank Renkewitz and Melanie Keiner published an important article that evaluated the performance of six bias detection methods in meta-analyses (Renkewitz & Keiner, 2019).
The article makes several important points.
1. Bias can distort effect size estimates in meta-analyses, but the amount of bias is sometimes trivial. Thus, bias detection is most important in conditions where effect sizes are inflated to a notable degree (say more than one-tenth of a standard deviation, e.g., from d = .2 to d = .3).
2. Several bias detection tools work well when studies are homogeneous (i.e. ,the population effect sizes are very similar). However, bias detection is more difficult when effect sizes are heterogeneous.
3. The most promising tool for heterogeneous data was the Test of Excessive Significance (Francis, 2013; Ioannidis, & Trikalinos, 2013). However, simulations without bias showed that the higher power of TES was achieved by a higher false-positive rate that exceeded the nominal level. The reason is that TES relies on the assumption that all studies have the same population effect size and this assumption is violated when population effect sizes are heterogeneous.
This blog post examines two new methods to detect publication bias and compares them to the TES and the Test of Insufficient Variance (TIVA) that performed well when effect sizes were homogeneous (Renkewitz & Keiner , 2019). These methods are not entirely new. One method is the Incredibility Index, which is similar to TES (Schimmack, 2012). The second method is the Replicability Index, which corrects estimates of observed power for inflation when bias is present.
The Basic Logic of Power-Based Bias Tests
The mathematical foundations for bias tests based on statistical power were introduced by Sterling et al. (1995). Statistical power is defined as the conditional probability of obtaining a significant result when the null-hypothesis is false. When the null-hypothesis is true, the probability of obtaining a significant result is set by the criterion for a type-I error, alpha. To simplify, we can treat cases where the null-hypothesis is true as the boundary value for power (Brunner & Schimmack, 2019). I call this unconditional power. Sterling et al. (1995) pointed out that for studies with heterogeneity in sample sizes, effect sizes or both, the discoery rate; that is the percentage of significant results, is predicted by the mean unconditional power of studies. This insight makes it possible to detect bias by comparing the observed discovery rate (the percentage of significant results) to the expected discovery rate based on the unconditional power of studies. The empirical challenge is to obtain useful estimates of unconditional mean power, which depends on the unknown population effect sizes.
Ioannidis and Trialinos (2007) were the first to propose a bias test that relied on a comparison of expected and observed discovery rates. The method is called Test of Excessive Significance (TES). They proposed a conventional meta-analysis of effect sizes to obtain an estimate of the population effect size, and then to use this effect size and information about sample sizes to compute power of individual studies. The final step was to compare the expected discovery rate (e.g., 5 out of 10 studies) with the observed discovery rate (8 out of 10 studies) with a chi-square test and to test the null-hypothesis of no bias with alpha = .10. They did point out that TES is biased when effect sizes are heterogeneous (see Renkewitz & Keiner, 2019, for a detailed discussion).
Schimmack (2012) proposed an alternative approach that does not assume a fixed effect sizes across studies, called the incredibility index. The first step is to compute observed-power for each study. The second step is to compute the average of these observed power estimates. This average effect size is then used as an estimate of the mean unconditional power. The final step is to compute the binomial probability of obtaining as many or more significant results that were observed for the estimated unconditional power. Schimmack (2012) showed that this approach avoids some of the problems of TES when effect sizes are heterogeneous. Thus, it is likely that the Incredibility Index produces fewer false positives than TES.
Like TES, the incredibility index has low power to detect bias because bias inflates observed power. Thus, the expected discovery rate is inflated, which makes it a conservative test of bias. Schimmack (2016) proposed a solution to this problem. As the inflation in the expected discovery rate is correlated with the amount of bias, the discrepancy between the observed and expected discovery rate indexes inflation. Thus, it is possible to correct the estimated discovery rate by the amount of observed inflation. For example, if the expected discovery rate is 70% and the observed discovery rate is 90%, the inflation is 20 percentage points. This inflation can be deducted from the expected discovery rate to get a less biased estimate of the unconditional mean power. In this example, this would be 70% – 20% = 50%. This inflation-adjusted estimate is called the Replicability Index. Although the Replicability Index risks a higher type-I error rate than the Incredibility Index, it may be more powerful and have a better type-I error control than TES.
To test these hypotheses, I conducted some simulation studies that compared the performance of four bias detection methods. The Test of Insufficient Variance (TIVA; Schimmack, 2015) was included because it has good power with homogeneous data (Renkewitz & Keiner, 2019). The other three tests were TES, ICI, and RI.
Selection bias was simulated with probabilities of 0, .1, .2, and 1. A selection probability of 0 implies that non-significant results are never published. A selection probability of .1 implies that there is a 10% chance that a non-significant result is published when it is observed. Finally, a selection probability of 1 implies that there is no bias and all non-significant results are published.
Effect sizes varied from 0 to .6. Heterogeneity was simulated with a normal distribution with SDs ranging from 0 to .6. Sample sizes were simulated by drawing from a uniform distribution with values between 20 and 40, 100, and 200 as maximum. The number of studies in a meta-analysis were 5, 10, 20, and 30. The focus was on small sets of studies because power to detect bias increases with the number of studies and power was often close to 100% with k = 30.
Each condition was simulated 100 times and the percentage of significant results with alpha = .10 (one-tailed) was used to compute power and type-I error rates.
Figure 1 shows a plot of the mean observed d-scores as a function of the mean population d-scores. In situations without heterogeneity, mean population d-scores corresponded to the simulated values of d = 0 to d = .6. However, with heterogeneity, mean population d-scores varied due to sampling from the normal distribution of population effect sizes.
The figure shows that bias could be negative or positive, but that overestimation is much more common than underestimation. Underestimation was most likely when the population effect size was 0, there was no variability (SD = 0), and there was no selection for significance. With complete selection for significance, bias always overestimated population effect sizes, because selection was simulated to be one-sided. The reason is that meta-analysis rarely show many significant results in both directions.
An Analysis of Variance (ANOVA) with number of studies (k), mean population effect size (mpd), heterogeneity of population effect sizes (SD), range of sample sizes (Nmax) and selection bias (sel.bias) showed a four-way interaction, t = 3.70. This four-way interaction qualified main effects that showed bias decreases with effect sizes (d), heterogeneity (SD), range of sample sizes (N), and increased with severity of selection bias (sel.bias).
The effect of selection bias is obvious in that effect size estimates are unbiased when there is no selection bias and increases with severity of selection bias. Figure 2 illustrates the three way interaction for the remaining factors with the most extreme selection bias; that is, all non-significant results are suppressed.
The most dramatic inflation of effect sizes occurs when sample sizes are small (N = 20-40), the mean population effect size is zero, and there is no heterogeneity (light blue bars). This condition simulates a meta-analysis where the null-hypothesis is true. Inflation is reduced, but still considerable (d = .42), when the population effect is large (d = .6). Heterogeneity reduces bias because it increases the mean population effect size. However, even with d = .6 and heterogeneity, small samples continue to produce inflated estimates by d = .25 (dark red). Increasing sample sizes (N = 20 to 200) reduces inflation considerably. With d = 0 and SD = 0, inflation is still considerable, d = .52, but all other conditions have negligible amounts of inflation, d < .10.
As sample sizes are known, they provide some valuable information about the presence of bias in a meta-analysis. If studies with large samples are available, it is reasonable to limit a meta-analysis to the larger and more trustworthy studies (Stanley, Jarrell, & Doucouliagos, 2010).
If all results are published, there is no selection bias and effect size estimates are unbiased. When studies are selected for significance, the amount of bias is a function of the amount of studies with non-significant results that are suppressed. When all non-significant results are suppressed, the amount of selection bias depends on the mean power of the studies before selection for significance which is reflected in the discovery rate (i.e., the percentage of studies with significant results). Figure 3 shows the discovery rates for the same conditions that were used in Figure 2. The lowest discovery rate exists when the null-hypothesis is true. In this case, only 2.5% of studies produce significant results that are published. The percentage is 2.5% and not 5% because selection also takes the direction of the effect into account. Smaller sample sizes (left side) have lower discovery rates than larger sample sizes (right side) because larger samples have more power to produce significant results. In addition, studies with larger effect sizes have higher discovery rates than studies with small effect sizes because larger effect sizes increase power. In addition, more variability in effect sizes increases power because variability increases the mean population effect sizes, which also increases power.
In conclusion, the amount of selection bias and the amount of inflation of effect sizes varies across conditions as a function of effect sizes, sample sizes, heterogeneity, and the severity of selection bias. The factorial design covers a wide range of conditions. A good bias detection method should have high power to detect bias across all conditions with selection bias and low type-I error rates across conditions without selection bias.
Overall Performance of Bias Detection Methods
Figure 4 shows the overall results for 235,200 simulations across a wide range of conditions. The results replicate Renkewitz and Keiner’s finding that TES produces more type-I errors than the other methods, although the average rate of type-I errors is below the nominal level of alpha = .10. The error rate of the incredibility index is practically zero, indicating that it is much more conservative than TES. The improvement for type-I errors does not come at the cost of lower power. TES and ICI have the same level of power. This finding shows that computing observed power for each individual study is superior than assuming a fixed effect size across studies. More important, the best performing method is the Replicability Index (RI), which has considerably more power because it corrects for inflation in observed power that is introduced by selection for significance. This is a promising results because one of the limitation of the bias tests examined by Renkewitz and Keiner was the low power to detect selection bias across a wide range of realistic scenarios.
Logistic regression analyses for power showed significant five-way interactions for TES, IC, and RI. For TIVA, two four-way interactions were significant. For type-I error rates no four-way interactions were significant, but at least one three-way interaction was significant. These results show that results systematic vary in a rather complex manner across the simulated conditions. The following results show the performance of the four methods in specific conditions.
Number of Studies (k)
Detection of bias is a function of the amount of bias and the number of studies. With small sets of studies (k = 5), it is difficult to detect power. In addition, low power can suppress false-positive rates because significant results without selection bias are even less likely than significant results with selection bias. Thus, it is important to examine the influence of the number of studies on power and false positive rates.
Figure 5 shows the results for power. TIVA does not gain much power with increasing sample sizes. The other three methods clearly become more powerful as sample sizes increase. However, only the R-Index shows good power with twenty studies and still acceptable studies with just 10 studies. The R-Index with 10 studies is as powerful as TES and ICI with 10 studies.
Figure 6 shows the results for the type-I error rates. Most important, the high power of the R-Index is not achieved by inflating type-I error rates, which are still well-below the nominal level of .10. A comparison of TES and ICI shows that ICI controls type-I error much better than TES. TES even exceeds the nominal level of .10 with 30 studies and this problem is going to increase as the number of studies gets larger.
Renkewitz and Keiner noticed that power decreases when there is a small probability that non-significant results are published. To simplify the results for the amount of selection bias, I focused on the condition with n = 30 studies, which gives all methods the maximum power to detect selection bias. Figure 7 confirms that power to detect bias deteriorates when non-significant results are published. However, the influence of selection rate varies across methods. TIVA is only useful when only significant results are selected, but even TES and ICI have only modest power even if the probability of a non-significant result to be published is only 10%. Only the R-Index still has good power, and power is still higher with a 20% chance to select a non-significant result than with a 10% selection rate for TES and ICI.
Population Mean Effect Size
With complete selection bias (no significant results), power had ceiling effects. Thus, I used k = 10 to illustrate the effect of population effect sizes on power and type-I error rates. (Figure 8)
In general, power decreased as the population mean effect sizes increased. The reason is that there is less selection because the discovery rates are higher. Power decreased quickly to unacceptable levels (< 50%) for all methods except the R-Index. The R-Index maintained good power even with the maximum effect size of d = .6.
Figure 9 shows that the good power of the R-Index is not achieved by inflating type-I error rates. The type-I error rate is well below the nominal level of .10. In contrast, TES exceeds the nominal level with d = .6.
Variability in Population Effect Sizes
I next examined the influence of heterogeneity in population effect sizes on power and type-I error rates. The results in Figure 10 show that hetergeneity decreases power for all methods. However, the effect is much less sever for the RI than for the other methods. Even with maximum heterogeneity, it has good power to detect publication bias.
Figure 11 shows that the high power of RI is not achieved by inflating type-I error rates. The only method with a high error-rate is TES with high heterogeneity.
Variability in Sample Sizes
With a wider range of sample sizes, average power increases. And with higher power, the discovery rate increases and there is less selection for significance. This reduces power to detect selection for significance. This trend is visible in Figure 12. Even with sample sizes ranging from 20 to 100, TIVA, TES, and IC have modest power to detect bias. However, RI maintains good levels of power even when sample sizes range from 20 to 200.
Once more, only TES shows problems with the type-I error rate when heterogeneity is high (Figure 13). Thus, the high power of RI is not achieved by inflating type-I error rates.
The following analyses examined RI’s performance more closely. The effect of selection bias is self-evident. As more non-significant results are available, power to detect bias decreases. However, bias also decreases. Thus, I focus on the unfortunately still realistic scenario that only significant results are published. I focus on the scenario with the most heterogeneity in sample sizes (N = 20 to 200) because it has the lowest power to detect bias. I picked the lowest and highest levels of population effect sizes and variability to illustrate the effect of these factors on power and type-I error rates. I present results for all four set sizes.
The results for power show that with only 5 studies, bias can only be detected with good power if the null-hypothesis is true. Heterogeneity or large effect sizes produce unacceptably low power. This means that the use of bias tests for small sets of studies is lopsided. Positive results strongly indicate severe bias, but negative results are inconclusive. With 10 studies, power is acceptable for homogeneous and high effect sizes as well as for heterogeneous and low effect sizes, but not for high effect sizes and high heterogeneity. With 20 or more studies, power is good for all scenarios.
The results for the type-I error rates reveal one scenario with dramatically inflated type-I error rates, namely meta-analysis with a large population effect size and no heterogeneity in population effect sizes.
The high type-I error rate is limited to cases with high power. In this case, the inflation correction over-corrects. A solution to this problem is found by considering the fact that inflation is a non-linear function of power. With unconditional power of .05, selection for significance inflates observed power to .50, a 10 fold increase. However, power of .50 is inflated to .75, which is only a 50% increase. Thus, I modified the R-Index formula and made inflation contingent on the observed discovery rate.
RI2 = Mean.Observed.Power – (Observed Discovery Rate – Mean.Observed.Power)*(1-Observed.Discovery.Rate). This version of the R-Index reduces power, although power is still superior to the IC.
It also fixed the type-I error problem at least with sample sizes up to N = 30.
Example 1: Bem (2011)
Bem’s (2011) sensational and deeply flawed article triggered the replication crisis and the search for bias-detection tools (Francis, 2012; Schimmack, 2012). Table 1 shows that all tests indicate that Bem used questionable research practices to produce significant results in 9 out of 10 tests. This is confirmed by examination of his original data (Schimmack, 2018). For example, for one study, Bem combined results from four smaller samples with non-significant results into one sample with a significant result. The results also show that both versions of the Replicability Index are more powerful than the other tests.
Example 2: Francis (2014) Audit of Psychological Science
Francis audited multiple-study articles in the journal Psychological Science from 2009-2012. The main problem with the focus on single articles is that they often contain relatively few studies and the simulation studies showed that bias tests tend to have low power if 5 or fewer studies are available (Renkewitz & Keiner, 2019). Nevertheless, Francis found that 82% of the investigated articles showed signs of bias, p < .10. This finding seems very high given the low power of TES in the simulation studies. It would mean that selection bias in these articles was very high and power of the studies was extremely low and homogeneous, which provides the ideal conditions to detect bias. However, the high type-I error rates of TES under some conditions may have produced more false positive results than the nominal level of .10 suggests. Moreover, Francis (2014) modified TES in ways that may have further increased the risk of false positives. Thus, it is interesting to reexamine the 44 studies with other bias tests. Unlike Francis, I coded one focal hypothesis test per study.
I then applied the bias detection methods. Table 2 shows the p-values.
Anderson, Kraus, Galinsky, & Keltner
Bauer, Wilkie, Kim, & Bodenhausen
Birtel & Crisp
Converse & Fishbach
Converse, Risen, & Carter Karmic
Keysar, Hayakawa, &
Leung et al.
Rounding, Lee, Jacobson, & Ji
Savani & Rattan
van Boxtel & Koch
Evans, Horowitz, & Wolfe
Inesi, Botti, Dubois, Rucker, & Galinsky
Nordgren, Morris McDonnell, & Loewenstein
Savani, Stephens, & Markus
Todd, Hanko, Galinsky, & Mussweiler
Tuk, Trampe, & Warlop
Balcetis & Dunning
Bowles & Gelfand
Damisch, Stoberock, & Mussweiler
de Hevia & Spelke
Ersner-Hershfield, Galinsky, Kray, & King
Gao, McCarthy, & Scholl
Lammers, Stapel, & Galinsky
Li, Wei, & Soman
Maddux et al.
McGraw & Warren
Sackett, Meyvis, Nelson, Converse, & Sackett
Savani, Markus, Naidu, Kumar, & Berlia
Senay, Albarracín, & Noguchi
West, Anderson, Bedwell, & Pratt
Alter & Oppenheimer
Ashton-James, Maddux, Galinsky, & Chartrand
Fast & Chen
Fast, Gruenfeld, Sivanathan, & Galinsky
Garcia & Tor
González & McLennan
Hahn, Close, & Graf
Hart & Albarracín
Janssen & Caramazza
Jostmann, Lakens, & Schubert
Labroo, Lambotte, & Zhang
Nordgren, van Harreveld, & van der Pligt
Wakslak & Trope
Zhou, Vohs, & Baumeister
The Figure shows the percentage of significant results for the various methods. The results confirm that despite the small number of studies, the majority of multiple-study articles show significant evidence of bias. Although statistical significance does not speak directly to effect sizes, the fact that these tests were significant with a small set of studies implies that the amount of bias is large. This is also confirmed by a z-curve analysis that provides an estimate of the average bias across all studies (Schimmack, 2019).
A comparison of the methods shows with real data that the R-Index (RI1) is the most powerful method and even more powerful than Francis’s method that used multiple studies from a single study. The good performance of TIVA shows that population effect sizes are rather homogeneous as TIVA has low power with heterogeneous data. The Incredibility Index has the worst performance because it has an ultra-conservative type-I error rate. The most important finding is that the R-Index can be used with small sets of studies to demonstrate moderate to large bias.
In 2012, I introduced the Incredibility Index as a statistical tool to reveal selection bias; that is, the published results were selected for significance from a larger number of results. I compared the IC with TES and pointed out some advantages of averaging power rather than effect sizes. However, I did not present extensive simulation studies to compare the performance of the two tests. In 2014, I introduced the replicability index to predict the outcome of replication studies. The replicability index corrects for the inflation of observed power when selection for significance is present. I did not think about RI as a bias test. However, Renkewitz and Keiner (2019) demonstrated that TES has low power and inflated type-I error rates. Here I examined whether IC performed better than TES and I found it did. Most important, it has much more conservative type-I error rates even with extreme heterogeneity. The reason is that selection for significance inflates observed power which is used to compute the expected percentage of significant results. This led me to see whether the bias correction that is used to compute the Replicability Index can boost power, while maintaining acceptable type-I error rates. The present results shows that this is the case for a wide range of scenarios. The only exception are meta-analysis of studies with a high population effect size and low heterogeneity in effect sizes. To avoid this problem, I created an alternative R-Index that reduces the inflation adjustment as a function of the percentage of non-significant results that are reported. I showed that the R-Index is a powerful tool that detects bias in Bem’s (2011) article and in a large number of multiple-study articles published in Psychological Science. In conclusion, the replicability index is the most powerful test for the presence of selection bias and it should be routinely used in meta-analyses to ensure that effect sizes estimates are not inflated by selective publishing of significant results. As the use of questionable practices is no longer acceptable, the R-Index can be used by editors to triage manuscripts with questionable results or to ask for a new, pre-registered, well-powered additional study. The R-Index can also be used in tenure and promotion evaluations to reward researchers that publish credible results that are likely to replicate.
Ioannidis, J. P. A., & Trikalinos, T. A. (2007). An exploratory test for an excess of significant findings. Clinical Trials: Journal of the Society for Clinical Trials, 4, 245–253. https://doi.org/10.1177/1740774507079441
Renkewitz, F., & Keiner, M. (2019). How to Detect Publication Bias in Psychological Research A Comparative Evaluation of Six Statistical Methods. Zeitschrift für Psychologie, 227, 261-279. https://doi.org/10.1027/2151-2604/a000386.
Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17, 551–566. doi:10.1037/a0029487
Sterling, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication decisions revisited: The effect of the outcome of statistical tests on the decision to publish and vice versa. The American Statistician, 49, 108–112.