Category Archives: Replication Crisis

Heterogeneity in the Replicability of Psychological and Social Sciences

Concerns about research credibility have stimulated the growth of meta-science, a field that examines the reproducibility, robustness, and replicability of scientific findings (Ioannidis, 2005; Munafò et al., 2017). This literature has documented publication bias, low statistical power, inflated effect size estimates, and disappointing replication rates in some areas of research (Button et al., 2013; Ioannidis, 2005; Open Science Collaboration, 2015; Tyner et al., 2026). While initial studies focused on psychology and neuroscience, but a recent article suggested that the problems are more general. Tyner et al. (2026) reported that only about 50% of originally significant claims were successfully replicated.

A replication rate of 50% invites different interpretations. An optimistic interpretation is that most original studies detected effects in the correct direction, but that the average probability of obtaining another significant result in a new sample was only about 50%. In this scenario, selective publication of significant results inflates observed effect sizes, so replication studies often fail even when the original studies were not false positives. Many of the failures are therefore false negatives. A pessimistic interpretation is that many original results were false positives, whereas the remaining studies examined true effects with high power. In that case, the same 50% replication rate could arise from a mixture of null effects and highly powered true effects. Thus, the average replication rate alone is consistent with very different underlying realities.

To move beyond average replication rates, it is necessary to avoid reducing results to a dichotomy of significant versus non-significant. A cutoff at z = 1.96 is useful for decision making, but it discards quantitative information about the strength of evidence. A result with z = 6 provides much stronger evidence for a positive effect than a result with z = 2, just as z = -6 provides much stronger evidence for a negative effect than z = -2. This point is straightforward, but broad evaluations of replication outcomes have largely ignored differences in original evidential strength.

I used z-curve to examine heterogeneity in the strength of evidence across the original significant findings included in the two large replication projects (Brunner & Schimmack, 2020; Bartoš & Schimmack, 2022). Z-curve uses the distribution of significant z-values and corrects for the inflation in observed test statistics introduced by selection for significance. It provides two key estimates. The first is the Expected Replication Rate (ERR), which is the average probability that a significant result would be significant again in an exact replication with a new sample of the same size. The second is the Expected Discovery Rate (EDR), which is the estimated proportion of all studies, including unpublished non-significant ones, that would be expected to yield a significant result.

The EDR can be used to evaluate publication bias and to derive an upper bound on the false discovery rate using Sorić’s (1989) formula. Performance of z-curve has been examined in extensive simulation studies, which show that its 95% confidence intervals perform well when at least 100 significant results are available (Bartoš & Schimmack, 2022). Because z-curve is designed to accommodate heterogeneity in evidential strength, it is especially suitable for a diverse set of studies such as those included in the replication projects. Previous applications have shown substantial variation in ERR and EDR across research areas (Schimmack, 2020; Schimmack & Bartoš, 2023; Soto & Schimmack, 2024; Credé & Sotola, 2024; Sotola, 2022, 2024).”One limitation of previous applications is that they sometimes relied on automatically extracted p-values or focused on specific literatures. The replication projects provide gold-standard test statistics from a representative sample of social science research, avoiding both concerns. This makes it possible to examine heterogeneity in replicability across a broad range of research areas.

All original studies in the two replication projects were eligible for inclusion. For articles with multiple claims, the focal claim was identified from the abstract using a large language model (see OSF for details and cross-validation). When exact p-values were not reported in the project materials, the original articles were consulted to recover the necessary information. Articles without exact p-values were excluded. Original studies that claimed an effect without meeting the conventional significance threshold of p < .05 were also excluded. A small number of studies were further excluded because the replication reports did not provide sufficient information to evaluate the replication outcome. This screening process yielded k = 222 significant results (k1 = 88, k2 = 134), including k = 130 from psychology and k = 92 from other social sciences. The replication rate in this subset was similar to that in the full set of studies: 43% overall (project 1: 33%, project 2: 49%; psychology: 37%; other social sciences: 51%; see OSF for details). Figure 1 shows the z-curve analysis of these 222 original significant results.

The most striking result is that the expected replication rate (ERR) is substantially higher than the observed replication rate in the replication studies (68% versus 42%). Even the lower bound of the 95% confidence interval for the ERR, 59%, exceeds the observed replication rate. This discrepancy is especially noteworthy because the replication studies often used larger sample sizes than the original studies, which should have increased, not decreased, the probability of obtaining a significant result. Thus, the lower effect sizes observed in the replication studies cannot be attributed to regression to the mean alone. An additional factor appears to be that population effect sizes in the replication studies were systematically smaller than in the original studies.

Z-curve also limits the range of scenarios that are compatible with the data. The estimated EDR of 48% implies that no more than 6% of the significant results can be false positive results (Soric, 1989). Even the lower limit of the EDR confidence interval, 17%, limits the false positive rate to no more than 26%. With 50% replication failures, this suggests that no more than half of the replication failures are false positives. This finding shows the importance of distinguishing clearly between replication rates and false positive rates (Maxwell et al., 2015).

The false positive risk also varies as a function of the significance criterion. Marginally significant results are more likely to be false positives than results with high z-values (Benjamin et al., 2018). Z-curve makes it possible to address Benjamini and Hechtlinger’s (2014) call to control, rather than merely estimate, the science-wise false discovery rate. A stricter alpha criterion reduces the discovery rate, but it reduces the false discovery rate more. Benjamin et al. (2018) suggested reducing the false positive risk by lowering the significance criterion to alpha = .005. A z-curve analysis with this criterion estimated the FDR at 2% and the upper limit of the 95% CI was 6%. This finding provides empirical support for Benjamin et al.’s (2018) suggestion. It also addresses Lakens et al.’s (2018) concern that alpha levels should be justified. Here the strength of evidence provides the justification. In other literatures, alpha = .01 is sufficient to keep the FDR below 5% (Schimmack & Bartoš, 2023; Soto & Schimmack, 2024), but sometimes even alpha = .001 is insufficient to control false positives (Chen et al., 2025; Schimmack, 2025).

Heterogeneity in strength of evidence also makes it possible to predict replication outcomes as a function of z-values. Figure 1 shows power for z-value intervals below the x-axis. Expected replication rates increase from 54% for just significant results to over 90% for z-values greater than 5. Another 36 z-values have z-values greater than 6 that are practically guaranteed to replicate in exact replication studies. Figure 2 shows the expected replication rates and the observed replication rates for z-value ranges.

Studies with modest evidence (z = 2 to 3.5) replicate at significantly lower rates than expected based on z-curve. As expected, replication rates increase with stronger evidence. Given the small number of observations per bin, it is not possible to test whether z-curve predictions remain too optimistic at moderate z-values. The most surprising finding is that observed replication rates for studies with strong evidence (z > 6) fall below the expected rate.

In exploratory analyses, I examined possible reasons for these surprising replication failures. I used two large language models (ChatGPT and Claude) to score the replication reports of studies with strong original evidence (z > 6). Studies were coded on five dimensions (match of populations, materials, design, time period, and implementation) with scores from 0 to 2 each to produce total scores ranging from 0 to 10. Inter-rater agreement for the total scores was high, ICC(A,1) = .85, 95%CI = .73, .92. I averaged the two scores and used a total of 7 or higher as the criterion for a close match. Of the 24 close replications, 21 were successful (88%). Of the 12 studies that were not close replications, only 6 were successful (50%).

I further examined the three close replications that failed. While Farris et al. (2008) closely matched the original in many aspects, the original participants were from the US and the replication was conducted in the UK. Subsequent studies have replicated the finding with US samples (Farris et al., 2009/2010; Treat et al., 2017), ruling out a simple false positive explanation. The replication failure of Hurst and Kavanagh (2017) likely reflects a sampling problem in the original study. Participants from the general population and users of community mental health services were analyzed in a single analysis, which can inflate effect sizes (Preacher et al., 2005). McDevitt examined the influence of plumbing business names starting with numbers or A to be first in the yellow pages. A replication in 2020 cannot reproduce this effect because google searches replaced yellow pages.

While these exploratory results are based on a small sample, they support the broader claim that original results with strong evidence (z > 6) are likely to replicate in close replications and that failures may stem from meaningful differences in study design.

Conclusion

Z-curve analysis of two major replication projects reveals that replicability in the social sciences is not a single number. The expected replication rate based on the strength of original evidence (68%) substantially exceeds the observed replication rate (42%), indicating that effect size shrinkage beyond statistical regression to the mean contributes to replication failures. The false discovery rate is low (6%), confirming that most replication failures reflect reduced effect sizes rather than false positives. Adjusting the significance criterion to alpha = .005 reduces the estimated false discovery rate to 2%.

The most practically useful finding is that original results with strong evidence (z > 6) are highly replicable when the replication closely matches the original study design (88% success rate). Replication failures among these strong results were attributable to identifiable differences between the original and replication studies — different populations, changed market conditions, or heterogeneous samples. This suggests that the strength of statistical evidence, combined with methodological similarity, is a reliable predictor of replication success.

These findings argue against treating all significant results as equally credible and against interpreting average replication rates as informative about any particular study. Replicability is predictable from information already available in the original publication.

“Valid Replications Require Valid Methods—And Originals Don’t?”

Harmon-Jones, E., Harmon-Jones, C., Amodio, D. M., Gable, P. A., & Schmeichel, B. J. (2025). Valid replications require valid methods: Recommendations for best methodological practices with lab experiments. Motivation Science, 11(3), 235–245

“Far from over.” (Frank Wang, tennis buddy when he is down 2:5)

The replication crisis shook social psychology in the 2010s. Heated debates—often on social media—divided critics, reformers, and defenders of the published record. The heat has cooled, but the crisis is far from over. The central empirical problems remain: unusually high rates of statistically significant results in journals, implausible success rates given typical power, and repeated failures to reproduce headline findings under rigorous conditions.

A striking pattern in parts of the methodological commentary that followed is explanatory asymmetry. Replication failures are readily attributed to contextual factors, subtle procedural differences, or “messy methods,” while the same standards are not applied with equal force to original studies. If minor contextual differences can wipe out an effect, then original results should also be unstable—yet the published record historically looks unnaturally successful. Any account that explains failure must also explain success.

There is also an ironic subtext: some of the strongest defenses of fragile effects come from researchers who study motivation and bias, yet methodological narratives can display their own motivated reasoning—favoring interpretations that protect prior conclusions. None of this requires imputing bad faith. It is enough to recognize that professional stakes and identity can shape what kinds of explanations feel plausible.

To avoid my own biases, I asked ChatGPT to evaluate bias in this article. More importantly, ChatGPT also provided an explanation for the rating.

Bias Evaluation

Harmon-Jones et al. (2025) argue that many replication failures in motivation and emotion research arise not from invalid theories or false positives, but from “messy methods.” They provide extensive practical recommendations regarding laboratory setup, experimenter behavior, manipulation strength, measurement sensitivity, replication design, data management, and statistical interpretation. The article is methodologically rich and offers useful guidance for improving internal validity in lab experiments.

However, when situated within the broader replication debate, the paper exhibits a consistent asymmetry in explanatory framing. On a scale from –10 (strongly defensive of existing literature) to +10 (strongly skeptical that most results are true), this article falls around –4 to –5: moderately biased in defense of established findings.

The basis for this rating is outlined below.


  1. Core Contribution: Internal Validity Matters

The article’s strongest contribution is its detailed emphasis on internal validity. The authors correctly note that laboratory experiments are sensitive systems in which:

  • Subtle environmental cues may influence participant motivation.
  • Experimenter demeanor and appearance can affect outcomes.
  • Manipulations must be strong and construct-valid.
  • Dependent variables must be sensitive and properly timed.
  • Multilab projects introduce coordination risk.
  • Data handling errors can contaminate results.

These are real methodological concerns. The paper provides concrete, experience-based guidance that would likely improve experimental rigor if widely adopted. It is especially valuable as a practical resource for researchers conducting lab-based motivation studies.


  1. Asymmetry in Causal Attribution

The principal concern is not methodological advice but explanatory direction.

Replication failures are repeatedly attributed to:

  • Context sensitivity
  • Weak or improperly implemented manipulations
  • Insensitive measures
  • Experimenter variability
  • Procedural deviations in multilab collaborations
  • Data management errors

These are legitimate explanations in some cases. However, the article does not apply equivalent scrutiny to original studies.

There is little engagement with:

  • Publication bias
  • Inflated effect sizes
  • Researcher degrees of freedom
  • Selective reporting
  • Power deficiencies in original work
  • Theory elasticity

The explanatory burden for null replications is placed largely on replication implementation rather than on possible inflation or fragility in the original literature.

This directional asymmetry is what produces the defensive tilt.


  1. Context Sensitivity as a Buffer

The authors cite contextual sensitivity as a key explanation for replication variability. Conceptually, psychological effects can depend on time, culture, and population. However, the article treats contextual sensitivity as supporting evidence for interpreting replication failures, without addressing debate over the empirical robustness of this claim.

More importantly, the paper does not quantify how strong contextual sensitivity would need to be to account for large-scale null findings in well-powered, preregistered, multilab studies. If minor environmental differences are sufficient to eliminate effects, then those effects are fragile by definition. That implication is not confronted directly.


  1. Treatment of Ego Depletion

The article references the large preregistered multisite ego-depletion test (Vohs et al., 2021), which included proponents of the theory. Rather than interpreting the null results as evidence about true effect size, the authors emphasize coordination errors and deviations across labs.

While procedural complications can occur, the study was high-powered and preregistered. A pattern of near-zero effects across many sites cannot be explained solely by minor procedural noise without implying extreme fragility.

The possibility that the true effect is very small or nonexistent is not seriously engaged. This reinforces the asymmetry in explanatory weighting.


  1. The “Psychological Sledgehammer” Standard

The recommendation that manipulations should function like a “psychological sledgehammer” raises an additional issue. If only very strong manipulations count as valid tests, then many real-world operationalizations will be deemed insufficient. This narrows the acceptable domain of theory testing and increases the probability that null findings are attributed to weakness of implementation rather than limitations of theory.

That standard shifts the evidentiary burden in a way that implicitly protects established effects.


  1. The Excess Success Gap

A major omission concerns the historically high rate of statistically significant findings in psychology journals—often described as exceeding 90%.

If effects are highly context-sensitive and fragile, then original studies should also frequently fail. Minor variations in lab setup, experimenter behavior, and measurement sensitivity would generate many null outcomes. Yet the published literature overwhelmingly reports positive results.

These two claims cannot comfortably coexist without additional explanation.

There are only a few ways to reconcile fragile effects with excess success:

  1. Many effects are actually robust and high-powered.
  2. Publication bias and selective reporting filter out null results.
  3. Researchers iteratively tune operationalizations and analyses until significance is obtained.
  4. Journals selectively publish successful implementations.

The article does not engage this macro-level constraint. It does not integrate publication bias or excess-success analysis into its explanatory framework. As a result, replication failures are treated as methodologically suspect, while the structural inflation of original literatures remains largely unaddressed.

This omission materially strengthens the case for a defensive bias rating.


  1. What Prevents a More Extreme Rating

Despite these concerns, the article does not:

  • Deny p-hacking or questionable practices.
  • Reject replication as essential.
  • Claim that all replication failures are invalid.
  • Dismiss preregistration.
  • Attack statistical reform movements.

It offers constructive methodological advice and acknowledges complexity in statistical inference. The tilt is moderate, not extreme.


Overall Assessment

Harmon-Jones et al. provide valuable, concrete guidance on improving internal validity in laboratory research. Their emphasis on methodological nuance is important and often neglected in replication debates.

However, the paper consistently places greater explanatory weight on replication imperfections than on possible inflation or fragility in original findings. It does not reconcile its fragility narrative with the excess success of published psychology, nor does it engage deeply with quantitative evidence regarding effect size shrinkage and false discovery rates.

For these reasons, the article can be fairly characterized as moderately biased in defense of existing literature — approximately –4 to –5 on a –10 to +10 scale.

P.S. Why my bias rating would be more extreme

The most important cue for bias is that success rates over 90% in psychology journals have been documented repeatedly since Sterling (1959). Any article that avoids talking about this implausible result that undermines the meaning of statistical significance is often biased and downplays the amount of selection bias in psychology. Insiders know that only significant results can be published. This unscientific incentive structure is the root cause of the replication crisis, not contextual sensitivity. Failure to mention Sterling is a red flag.

The second red flag is the citation of van Bavel as a reference to contextual sensitivity. Van Bavel claimed to have shown that contextual sensitivity explains the lower replication rate in social psychology. However, Inbar showed that they did not present the critical tests of an interaction and that this test was not significant. There is no evidence that contextual sensitivity contributes to low success rates in social psychology. Rather, social psychologists never ran direct replications and used contextual sensitivity as a way to protect their theories from disconfirming evidence. Change something trivial and you get significance again: great, the effect is robust. If not, clearly the effect was real before but not in this context. Now publish only the significant results and claim that the theory is universally true across time, place, and populations. This is how it was done, and it was wrong. Sadly, some social psychologists cannot just say, sorry, we messed up, now let’s move on.

Once a p-hacker, always a p-hacker?

The 2010s have seen a replication crisis in social psychology (Schimmack, 2020). The main reason why it is difficult to replicate results from social psychology is that researchers used questionable research practices (QRPs, John et al., 2012) to produce more significant results than their low-powered designs warranted. A catchy term for these practices is p-hacking (Simonsohn, 2014).

New statistical techniques made it possible to examine whether published results were obtained with QRPs. In 2012, I used the incredibility index to show that Bem (2011) used QRPs to provide evidence for extrasensory perception (Schimmack, 2012). In the same article, I also suggested that Gailliot, Baumeister, DeWall, Maner, Plant, Tice, and Schmeichel, (2007) used QRPs to present evidence that suggested will-power relies on blood glucose levels. During the review process of my manuscript, Baumeister confirmed that QRPs were used (cf. Schimmack, 2014). Baumeister defended the use of these practices with a statement that the use of these practices was the norm in social psychology and that the use of these practices was not considered unethical.

The revelation that research practices were questionable casts a shadow on the history of social psychology. However, many also saw it as an opportunity to change and improve these practices (Świątkowski and Dompnier, 2017). Over the past decades, the evaluation of QRPs has changed. Many researchers now recognize that these practices inflate error rates, make published results difficult to replicate, and undermine the credibility of psychological science (Lindsay, 2019).

However, there are no general norms regarding these practices and some researchers continue to use them (e.g., Adam D. Galinsky, cf. Schimmack, 2019). This makes it difficult for readers of the social psychological literature to identify research that can be trusted or not, and the answer to this question has to be examined on a case by case basis. In this blog post, I examine the responses of Baumeister, Vohs, DeWall, and Schmeichel to the replication crisis and concerns that their results provide false evidence about the causes of will-power (Friese, Loschelder , Gieseler , Frankenbach & Inzlicht, 2019; Inzlicht, 2016).

To examine this question scientifically, I use test-statistics that are automatically extracted from psychology journals. I divide the test-statistics into those that were obtained until 2012, when awareness about QRPs emerged, and those published after 2012. The test-statistics are examined using z-curve (Brunner & Schimmack, 2019; Bartos & Schimmack, 2020). Results provide information about the expected replication rate and discovery rate. The use of QRPs is examined by comparing the observed discovery rate (how many published results are significant) to the expected discovery rate (how many tests that were conducted produced significant results).

Roy F. Baumeister’s replication rate was 60% (53% to 67%) before 2012 and 65% (57% to 74%) after 2012. The overlap of the 95% confidence intervals indicates that this small increase is not statistically reliable. Before 2012, the observed discovery rate was 70% and it dropped to 68% after 2012. Thus, there is no indication that non-significant results are reported more after 2012. The expected discovery rate was 32% before 2012 and 25% after 2012. Thus, there is also no change in the expected discovery rate and the expected discovery rate is much lower than the observed discovery rate. This discrepancy shows that QRPs were used before 2012 and after 2012. The 95%CI do not overlap before and after 2012, indicating that this discrepancy is statistically significant. Figure 1 shows the influence of QRPs when the observed non-significant results (histogram of z-scores below 1.96 in blue) is compared to the model prediction (grey curve). The discrepancy suggests a large file drawer of unreported statistical tests.

An old saying is that you can’t teach an old dog new tricks. So, the more interesting question is whether the younger contributors to the glucose paper changed their research practices.

The results for C. Nathan DeWall show no notable response to the replication crisis (Figure 2). The expected replication rate increased slightly from 61% to 65%, but the difference is not significant and visual inspection of the plots suggests that it is mostly due to a decrease in reporting p-values just below .05. One reason for this might be a new goal to p-hack at least to the level of .025 to avoid detection of p-hacking by p-curve analysis. The observed discovery rate is practically unchanged from 68% to 69%. The expected discovery rate increased only slightly from 28% to 35%, but the difference is not significant. More important, the expected discovery rates are significantly lower than the observed discovery rates before and after 2012. Thus, there is evidence that DeWall used questionable research practices before and after 2012, and there is no evidence that he changed his research practices.

The results for Brandon J. Schmeichel are even more discouraging (Figure 3). Here the expected replication rate decreased from 70% to 56%, although this decrease is not statistically significant. The observed discovery rate decreased significantly from 74% to 63%, which shows that more non-significant results are reported. Visual inspection shows that this is particularly the case for test-statistics close to zero. Further inspection of the article would be needed to see how these results are interpreted. More important, The expected discovery rates are significantly lower than the observed discovery rates before 2012 and after 2012. Thus, there is evidence that QRPs were used before and after 2012 to produce significant results. Overall, there is no evidence that research practices changed in response to the replication crisis.

The results for Kathleen D. Vohs also show no response to the replication crisis (Figure 4). The expected replication rate dropped slightly from 62% to 58%; the difference is not significant. The observed discovery rate dropped slightly from 69% to 66%, and the expected discovery rate decreased from 43% to 31%, although this difference is also not significant. Most important, the observed discovery rates are significantly higher than the expected discovery rates before 2012 and after 2012. Thus, there is clear evidence that questionable research practices were used before and after 2012 to inflate the discovery rate.

Conclusion

After concerns about research practices and replicability emerged in the 2010s, social psychologists have debated this issue. Some social psychologists changed their research practices to increase statistical power and replicability. However, other social psychologists have denied that there is a crisis and attributed replication failures to a number of other causes. Not surprisingly, some social psychologists also did not change their research practices. This blog post shows that Baumeister and his students have not changed research practices. They are able to publish questionable research because there has been no collective effort to define good research practices and to ban questionable practices and to treat the hiding of non-significant results as a breach of research ethics. Thus, Baumeister and his students are simply exerting their right to use questionable research practices, whereas others voluntarily implemented good, open science, practices. Given the freedom of social psychologists to decide which practices they use, social psychology as a field continuous to have a credibility problem. Editors who accept questionable research in their journals are undermining the credibility of their journal. Authors are well advised to publish in journals that emphasis replicability and credibility with open science badges and with a high replicability ranking (Schimmack, 2019).

An Honorable Response to the Credibility Crisis by D.S. Lindsay: Fare Well

We all know what psychologists did before 2012. The name of the game was to get significant results that could be sold to a journal for publication. Some did it with more power and some did it with less power, but everybody did it.

In the beginning of the 2010s it became obvious that this was a flawed way to do science. Bem (2011) used this anything-goes to get significance approach to publish 9 significant demonstration of a phenomenon that does not exist: mental time-travel. The cat was out of the bag. There were only two questions. How many other findings were unreal and how would psychologists respond to the credibility crisis.

D. Steve Lindsay responded to the crisis by helping to implement tighter standards and to enforce these standards as editor of Psychological Science. As a result, Psychological Science has published more credible results over the past five years. At the end of his editorial term, Linday published a gutsy and honest account of his journey towards a better and more open psychological science. It starts with his own realization that his research practices were suboptimal.

Early in 2012, Geoff Cumming blew my mind with a talk that led me to realize that I had been conducting underpowered experiments for decades. In some lines of research in my lab, a predicted effect would come booming through in one experiment but melt away in the next.
My students and I kept trying to find conditions that yielded consistent statistical significance—tweaking items, instructions, exclusion rules—but we sometimes eventually threw in the towel
because results were maddeningly inconsistent. For example, a chapter by Lindsay
and Kantner (2011) reported 16 experiments with an on-again/off-again effect of feedback on recognition memory. Cumming’s talk explained that p values are very noisy. Moreover, when between-subjects designs are used to study small- to medium-sized effects, statistical
tests often yield nonsignificant outcomes (sometimes with huge p values) unless samples are very large.

Hard on the heels of Cumming’s talk, I read Simmons, Nelson, and Simonsohn’s (2011) “False-Positive Psychology” article, published in Psychological Science. Then I gobbled up several articles and blog posts on misuses of null-hypothesis significance testing (NHST). The
authors of these works make a convincing case that hypothesizing after the results are known (HARKing; Kerr, 1998) and other forms of “p hacking” (post hoc exclusions, transformations, addition of moderators, optional stopping, publication bias, etc.) are deeply problematic. Such practices are common in some areas of scientific psychology, as well as in some other life
sciences. These practices sometimes give rise to mistaken beliefs in effects that really do not exist. Combined with publication bias, they often lead to exaggerated estimates
of the sizes of real but small effects.

This quote is exceptional because few psychologists have openly talked about their research practices before (or after) 2012. It is an open secrete that questionable research practices were widely used and anonymous surveys support this (John et al., 2012), but nobody likes to talk about it. Lindsay’s frank account is an honorable exception in the spirit of true leaders who confront mistakes head on, just like a Nobel laureate who recently retracted a Science article (Frances Arnold).

1. Acknowledge your mistakes.

2. Learn from your mistakes.

3. Teach others from your mistakes.

4. Move beyond your mistakes.

Lindsay’s acknowledgement also makes it possible to examine what these research practices look like when we examine published results, and to see whether this pattern changes in response to awareness that certain practices were questionable.

So, I z-curved Lindsay’s published results from 1998 to 2012. The graph shows some evidence of QRPs, in that the model assumes more non-significant results (grey line from 0 to 1.96) than are actually observed (histogram of non-significant results). This is confirmed by a comparison of the observed discovery rate (70% of published results are significant) and the expected discovery rate (44%). However, the confidence intervals overlap. So this test of bias is not significant.

The replication rate is estimated to be 77%. This means that there is a 77% probability that repeating a test with a new sample (of equal size) would produce a significant result again. Even for just significant results (z = 2 to 2.5), the estimated replicability is still 45%. I have seen much worse results.

Nevertheless, it is interesting to see whether things improved. First of all, being editor of Psychological Science is full-time job. Thus, output has decreased. Maybe research also slowed down because studies were conducted with more care. I don’t know. I just know that there are very few statistics to examine.

Although the small sample size of tests makes results somewhat uncertain, the graph shows some changes in research practices. Replicability increased further to 88% and there is no loner a discrepancy between observed and expected discovery rate.

If psychology as a whole had responded like D.S. Lindsay it would be in a good position to start the new decade. The problem is that this response is an exception rather than the rule and some areas of psychology and some individual researchers have not changed at all since 2012. This is unfortunate because questionable research practices hurt psychology, especially when undergraduates and the wider public learn more and more how untrustworthy psychological science has been and often still us. Hopefully, reforms will come sooner than later or we may have to sing a swan song for psychological science.

Francis’s Audit of Multiple-Study Articles in Psychological Science in 2009-2012

Citation: Francis G., (2014). The frequency of excess success for articles
in Psychological Science. Psychon Bull Rev (2014) 21:1180–1187
DOI 10.3758/s13423-014-0601-x

Introduction

The Open Science Collaboration article in Science has over 1,000 articles (OSC, 2015). It showed that attempting to replicate results published in 2008 in three journals, including Psychological Science, produced more failures than successes (37% success rate). It also showed that failures outnumbered successes 3:1 in social psychology. It did not show or explain why most social psychological studies failed to replicate.

Since 2015 numerous explanations have been offered for the discovery that most published results in social psychology cannot be replicated: decline effect (Schooler), regression to the mean (Fiedler), incompetent replicators (Gilbert), sabotaging replication studies (Strack), contextual sensitivity (vanBavel). Although these explanations are different, they share two common elements, (a) they are not supported by evidence, and (b) they are false.

A number of articles have proposed that the low replicability of results in social psychology are caused by questionable research practices (John et al., 2012). Accordingly, social psychologists often investigate small effects in between-subject experiments with small samples that have large sampling error. A low signal to noise ratio (effect size/sampling error) implies that these studies have a low probability of producing a significant result (i.e., low power and high type-II error probability). To boost power, researchers use a number of questionable research practices that inflate effect sizes. Thus, the published results provide the false impression that effect sizes are large and results are replicated, but actual replication attempts show that the effect sizes were inflated. The replicability projected suggested that effect sizes are inflated by 100% (OSC, 2015).

In an important article, Francis (2014) provided clear evidence for the widespread use of questionable research practices for articles published from 2009-2012 (pre crisis) in the journal Psychological Science. However, because this evidence does not fit the narrative that social psychology was a normal and honest science, this article is often omitted from review articles, like Nelson et al’s (2018) ‘Psychology’s Renaissance’ that claims social psychologists never omitted non-significant results from publications (cf. Schimmack, 2019). Omitting disconfirming evidence from literature reviews is just another sign of questionable research practices that priorities self-interest over truth. Given the influence that Annual Review articles hold, many readers maybe unfamiliar with Francis’s important article that shows why replication attempts of articles published in Psychological Science often fail.

Francis (2014) “The frequency of excess success for articles in Psychological Science”

Francis (2014) used a statistical test to examine whether researchers used questionable research practices (QRPs). The test relies on the observation that the success rate (percentage of significant results) should match the mean power of studies in the long run (Brunner & Schimmack, 2019; Ioannidis, J. P. A., & Trikalinos, T. A., 2007; Schimmack, 2012; Sterling et al., 1995). Statistical tests rely on the observed or post-hoc power as an estimate of true power. Thus, mean observed power is an estimate of the expected number of successes that can be compared to the actual success rate in an article.

It has been known for a long time that the actual success rate in psychology articles is surprisingly high (Sterling, 1995). The success rate for multiple-study articles is often 100%. That is, psychologists rarely report studies where they made a prediction and the study returns a non-significant results. Some social psychologists even explicitly stated that it is common practice not to report these ‘uninformative’ studies (cf. Schimmack, 2019).

A success rate of 100% implies that studies required 99.9999% power (power is never 100%) to produce this result. It is unlikely that many studies published in psychological science have the high signal-to-noise ratios to justify these success rates. Indeed, when Francis applied his bias detection method to 44 studies that had sufficient results to use it, he found that 82 % (36 out of 44) of these articles showed positive signs that questionable research practices were used with a 10% error rate. That is, his method could at most produce 5 significant results by chance alone, but he found 36 significant results, indicating the use of questionable research practices. Moreover, this does not mean that the remaining 8 articles did not use questionable research practices. With only four studies, the test has modest power to detect questionable research practices when the bias is relatively small. Thus, the main conclusion is that most if not all multiple-study articles published in Psychological Science used questionable research practices to inflate effect sizes. As these inflated effect sizes cannot be reproduced, the effect sizes in replication studies will be lower and the signal-to-noise ratio will be smaller, producing non-significant results. It was known that this could happen since 1959 (Sterling, 1959). However, the replicability project showed that it does happen (OSC, 2015) and Francis (2014) showed that excessive use of questionable research practices provides a plausible explanation for these replication failures. No review of the replication crisis is complete and honest, without mentioning this fact.

Limitations and Extension

One limitation of Francis’s approach and similar approaches like my incredibility Index (Schimmack, 2012) is that p-values are based on two pieces of information, the effect size and sampling error (signal/noise ratio). This means that these tests can provide evidence for the use of questionable research practices, when the number of studies is large, and the effect size is small. It is well-known that p-values are more informative when they are accompanied by information about effect sizes. That is, it is not only important to know that questionable research practices were used, but also how much these questionable practices inflated effect sizes. Knowledge about the amount of inflation would also make it possible to estimate the true power of studies and use it as a predictor of the success rate in actual replication studies. Jerry Brunner and I have been working on a statistical method that is able to to this, called z-curve, and we validated the method with simulation studies (Brunner & Schimmack, 2019).

I coded the 195 studies in the 44 articles analyzed by Francis and subjected the results to a z-curve analysis. The results are shocking and much worse than the results for the studies in the replicability project that produced an expected replication rate of 61%. In contrast, the expected replication rate for multiple-study articles in Psychological Science is only 16%. Moreover, given the fairly large number of studies, the 95% confidence interval around this estimate is relatively narrow and includes 5% (chance level) and a maximum of 25%.

There is also clear evidence that QRPs were used in many, if not all, articles. Visual inspection shows a steep drop at the level of significance, and the only results that are not significant with p < .05 are results that are marginally significant with p < .10. Thus, the observed discovery rate of 93% is an underestimation and the articles claimed an amazing success rate of 100%.

Correcting for bias, the expected discovery rate is only 6%, which is just shy of 5%, which would imply that all published results are false positives. The upper limit for the 95% confidence interval around this estimate is 14, which would imply that for every published significant result there are 6 studies with non-significant results if file-drawring were the only QRP that was used. Thus, we see not only that most article reported results that were obtained with QRPs, we also see that massive use of QRPs was needed because many studies had very low power to produce significant results without QRPs.

Conclusion

Social psychologists have used QRPs to produce impressive results that suggest all studies that tested a theory confirmed predictions. These results are not real. Like a magic show they give the impression that something amazing happened, when it is all smoke and mirrors. In reality, social psychologists never tested their theories because they simply failed to report results when the data did not support their predictions. This is not science. The 2010s have revealed that social psychological results in journals and text books cannot be trusted and that influential results cannot be replicated when the data are allowed to speak. Thus, for the most part, social psychology has not been an empirical science that used the scientific method to test and refine theories based on empirical evidence. The major discovery in the 2010s was to reveal this fact, and Francis’s analysis provided valuable evidence to reveal this fact. However, most social psychologists preferred to ignore this evidence. As Popper pointed out, this makes them truly ignorant, which he defined as “the unwillingness to acquire knowledge.” Unfortunately, even social psychologists who are trying to improve it wilfully ignore Francis’s evidence that makes replication failures predictable and undermines the value of actual replication studies. Given the extent of QRPs, a more rational approach would be to dismiss all evidence that was published before 2012 and to invest resources in new research with open science practices. Actual replication failures were needed to confirm predictions made by bias tests that old studies cannot be trusted. The next decade should focus on using open science practices to produce robust and replicable findings that can provide the foundation for theories.

The Demise of the Solo Experiment

Wegner’s article “The Premature Demise of the Solo Experiment” in PSPB (1992) is an interesting document for meta-psychologists. It provides some insight into the thinking of leading social psychologists at the time; not only the author, but reviewers and the editor who found this article worthy of publishing, and numerous colleagues who emailed Wegner with approving comments.

The article starts with the observation that in the 1990s social psychology journals increasingly demanded that articles contain more than one study. Wegner thinks that the preference of multiple-study articles is a bias rather than a preference in favour of stronger evidence.

it has become evident that a tremendous bias against the “solo” experiment exists that guides both editors and reviewers” (p. 504).

The idea of bias is based on the assumption that rejection a null-hypothesis with a long-run error-probability of 5% is good enough to publish exciting new ideas and give birth to wonderful novel theories. Demanding even just one replication of this finding would create a lot more burden without any novel insights just to lower this probability to 0.25%.

But let us just think a moment about the demise of the solo experiment. Here we have a case in which skepticism has so overcome the love of ideas that we seem to have squared the probability of error we are willing to allow. Once, p < .05 was enough. Now, however, we must prove things twice. The multiple experiment ethic has surreptitiously changed alpha to .0025 or below.

That’s right. The move from solo-experiment to multiple-study articles shifted the type-I error probability. Even a pair of studies reduced the type-I error probability more than the highly cited and controversial call to move alpha from .05 to .005. A pair of studies with p < .05 reduces the .005 probability by 50%!

Wegner also explains why journals started demanding multiple studies.

After all, the statistical reasons for multiple experiments are obvious-what better protection of the truth than that each article contain its own replication? (p. 505)

Thus, concerns about replicabilty in social psychology were prominent in the early 1990s, twenty years before the replication crisis. And demanding replication studies was considered to be a solution to this problem. If researchers were able to replicate their findings, ideally with different methods, stimuli, and dependent variables, the results are robust and generalizable. So much for the claim that psychologists did not value or conduct replication studies before the open science movement was born in the early 2010.

Wegner also reports about his experience with attempting to replicate his perfectly good first study.

Sometimes it works wonderfully….more often than not, however, we find the second
experiment is harder to do than the first
Even if we do the exact same experiment again” (p. 506).

He even cheerfully acknowledge that the first results are difficult to replicate because the first results were obtained with some good fortune.

Doing it again, we will be less likely to find the same thing even if it is true, because the
error variance regresses our effects to the mean. So we must add more subjects right off the bat. The joy of discovery we felt on bumbling into the first study is soon replaced by the strain of collecting an all new and expanded set of data to fend off the pointers
[pointers = method-terrorists]” (p. 506).

Wegner even thinks that publishing these replication studies is pointless because readers expect the replication study to work. Sure, if the first study worked, so will the second.

This is something of a nuisance in light of the reception that our second experiment will likely get Readers who see us replicate our own findings roll their eyes and say “Sure,” and we wonder why we’ve even gone to the trouble.

However, he fails to examine more carefully why a successful replication study receives only a shoulder-shrug from readers. After all, his own experience was that it was quite difficult to get these replication studies to work. Doesn’t this mean readers should be at the edge of their seats and wonder whether the original result was a false positive or whether it can actually be replicated? Isn’t the second study the real confirmatory test where the rubber hits the road? Insiders of course know that this is not the case. The second study works because it would not have been included in the multiple-study article if it hadn’t worked. That is after all how the field operated. Everybody had the same problems to get studies to work that Wegner describes, but many found a way to get enough studies to work to meet the demands of the editor. The number of studies was just a test of the persistence of a researcher, not a test of a theory. And that is what Wegner rightfully criticized. What is the point of producing a set of studies with p < .05, if more studies do not strengthen the evidence for a claim. We might as well publish a single finding and then move on to find more interesting ideas and publish them with p-values less than .05. Even 9 studies with p < .05 don’t mean that people can foresee the future (Bem, 2011), but it is surely an interesting idea.

Wegner also comments on the nature of replication studies that are now known as conceptual replication studies. The justification for conceptual replication studies is that they address limitations that are unavoidable in a single study. For example, including a manipulation check may introduce biases, but without one, it is not clear whether a manipulation worked. So, ideally the effect could be demonstrated with and without a manipulation check. However, this is not how conceptual replication studies are conducted.

We must engage in a very delicate “tuning” process to dial in a second experiment that is both sufficiently distant from and sufficiently similar to the original. This tuning requires a whole set of considerations and skills that have nothing to do with conducting an experiment. We are not trained in multi experiment design, only experimental design, and this enterprise is therefore largely one of imitation, inspiration, and luck.

So, to replicate original results that were obtained with a healthy dose of luck, more luck is needed in finding a condition that works, or simply to try often enough until luck strikes again.

Given the negative attitude towards rigor, Wegner and colleagues also used a number of tricks to make replication studies work.

Some of us use tricks to disguise our solos. We run “two experiments” in the same session with the same subjects and write them up separately. Or we run what should rightfully be one experiment as several parts, analyzing each separately and writing it up in bite-sized pieces as a multi experiment Many times, we even hobble the first experiment as a way of making sure there will be something useful to do when we run another.” (p. 506).

If you think this sounds like some charlatans who enjoy pretending to be scientists, your impression is rather accurate because the past decade has shown that many of these internal replications in multiple study articles were obtained with tricks and provide no empirical test of empirical hypotheses; p-values are just for show so that it looks like science, but it isn’t.

My own views on this issue are that the multiple study format was a bad fix for a real problem. The real problem was that it was all to easy to get p < .05 in a single study to make grand claims about the causes of human behavior. Multiple-study articles didn’t solve this problem because researchers found ways to get significant results again and again even when their claims were false.

The failure of multiple-study articles to fix psychology has some interesting lessons for the current attempts to improve psychology. Badges for data sharing and preregistration will not improve psychology, if they are being gamed like psychologists gamed the multiple-study format. Ultimately, science can only advance if results are reported honestly and if results are finally able to falsify theoretical predictions. Psychology will only become a science when brilliant novel ideas can be proven false and scientific rigor is prized as much as the creation of interesting ideas. Coming up with interesting ideas is philosophy. Psychology emerged as a distinct discipline in order to subject those theories to empirical tests. After a century of pretending to do so, it is high time to do so for real.

The Diminishing Utility of Replication Studies In Social Psychology

Dorthy Bishop writes on her blog.

“As was evident from my questions after the talk, I was less enthused by the idea of doing a large, replication of Darryl Bem’s studies on extra-sensory perception. Zoltán Kekecs and his team have put in a huge amount of work to ensure that this study meets the highest standards of rigour, and it is a model of collaborative planning, ensuring input into the research questions and design from those with very different prior beliefs. I just wondered what the point was. If you want to put in all that time, money and effort, wouldn’t it be better to investigate a hypothesis about something that doesn’t contradict the laws of physics?”


I think she makes a valid and important point. Bem’s (2011) article highlighted everything that was wrong with the research practices in social psychology. Other articles in JPSP are equally incredible, but this was ignored because naive readers found the claims more plausible (e.g., blood glucose is the energy for will power). We know now that none of these published results provide empirical evidence because the results were obtained with questionable research practices (Schimmack, 2014; Schimmack, 2018). It is also clear that these were not isolated incidents, but that hiding results that do not support a theory was (and still is) a common practice in social psychology (John et al., 2012; Schimmack, 2019).

A large attempt at estimating the replicability of social psychology revealed that only 25% of published significant results could be replicated (OSC). The rate for between-subject experiments was even lower. Thus, the a-priori probability (base rate) that a randomly drawn study from social psychology will produce a significant result in a replication attempt is well below 50%. In other words, a replication failure is the more likely outcome.

The low success rate of these replication studies was a shock. However, it is sometimes falsely implied that the low replicability of results in social psychology was not recognized earlier because nobody conducted replication studies. This is simply wrong. In fact, social psychology is one of the disciplines in psychology that required researchers to conduct multiple studies that showed the same effect to ensure that a result was not a false positive result. Bem had to present 9 studies with significant results to publish his crazy claims about extrasensory perception (Schimmack, 2012). Most of the studies that failed to replicate in the OSC replication project were taken from multiple-study articles that reported several successful demonstrations of an effect. Thus, the problem in social psychology was not that nobody conducted replication studies. The problem was that social psychologists only reported replication studies that were successful.

The proper analyses of the problem also suggests a different solution to the problem. If we pretend that nobody did replication studies, it may seem useful to starting doing replication studies. However, if social psychologists conducted replication studies, but did not report replication failures, the solution is simply to demand that social psychologists report all of their results honestly. This demand is so obvious that undergraduate students are surprised when I tell them that this is not the way social psychologists conduct their research.

In sum, it has become apparent that questionable research practices undermine the credibility of the empirical results in social psychology journals, and that the majority of published results cannot be replicated. Thus, social psychology lacks a solid empirical foundation.

What Next?

It is implied by information theory that little information is gained by conducting actual replication studies in social psychology because a failure to replicate the original result is likely and uninformative. In fact, social psychologists have responded to replication failures by claiming that these studies were poorly conducted and do not invalidate the original claims. Thus, replication studies are both costly and have not advanced theory development in social psychology. More replication studies are unlikely to change this.

A better solution to the replication crisis in social psychology is to characterize research in social psychology from Festinger’s classic small-sample, between-subject study in 1957 to research in 2017 as exploratory and hypotheses generating research. As Bem suggested to his colleagues, this was a period of adventure and exploration where it was ok to “err on the side of discovery” (i.e., publish false positive results, like Bem’s precognition for erotica). Lot’s of interesting discoveries were made during this period; it is just not clear which of these findings can be replicated and what they tell us about social behavior.

Thus, new studies in social psychology should not try to replicate old studies. For example, nobody should try to replicate Devine’s subliminal priming study with racial primes with computers and software from the 1980s (Devine, 1989). Instead, prominent theoretical predictions should be tested with the best research methods that are currently available. Thus, the way forward is not to do more replication studies, but rather to use open science (a.k.a. honest science) that uses experiments to subject theories to empirical tests that may also falsify a theory (e.g., subliminal racial stimuli have no influence on behavior). The main shift that is required is to get away from research that can only confirm theories and to allow for empirical data to falsify theories.

This was exactly the intent of Danny Kahneman’s letter, when he challenged social priming researchers to respond to criticism of their work by going into their labs and to demonstrate that these effects can be replicated across many labs.

Kahneman makes it clear that the onus of replication is on the original researchers who want others to believe their claims. The response to this letter speaks volumes. Not only did social psychologists fail to provide new and credible evidence that their results can be replicated, they also demonstrated defiant denial in the face of replication failures by others. The defiant denial by prominent social psychologists (e.g., Baumeister, 2019) make it clear that they will not be convinced by empirical evidence, while others who can look at the evidence objectively do not need more evidence to realize that the social psychological literature is a train-wreck (Schimmack, 2017; Kahneman, 2017). Thus, I suggest that young social psychologists search the train wreck for survivors, but do not waste their time and resources on replication studies that are likely to fail.

A simple guide through the wreckage of social psychology is to distrust any significant result with a p-value greater than .01 (Schimmack, 2019). Prediction markets also suggest that readers are able to distinguish credible and incredible results (Atlantic). Thus, I recommend to build on studies that are credible and to stay clear of sexy findings that are unlikely to replicate. As Danny Kahneman pointed out, young social psychologists who work in questionable areas face a dilemma. Either they have to replicate the questionable methods that were used to get the original results, which is increasingly considered unethical, or they end up with results that are not very informative. On the positive side, the replication crisis implies that there are many important topics in social psychology that need to be studied properly with the scientific method. Addressing these important questions may be the best way to rescue social psychology.

Fact-Checking Roy Baumeister

Roy Baumeister wrote a book chapter with the title “Self-Control, Ego Depletion, and Social Psychology’s Replication CrisisRoy” (preprint). I think this chapter will make a valuable contribution to the history of psychology and provides valuable insights into the minds of social psychologists.

I fact-checked the chapter and comment on 31 misleading or false statements.

https://replicationindex.com/wp-content/uploads/2019/09/ego-depletion-and-replication-crisis.docx

Comments are welcome.