Review of “What Should Researchers Expect When They Replicate Studies”

Patil, P., Peng, R. D., & Leek, J. T. (2016). What Should Researchers Expect When They Replicate Studies? A Statistical View of Replicability in Psychological Science. Perspectives on Psychological Science11(4), 539-544. https://doi-org.myaccess.library.utoronto.ca/10.1177/1745691616646366


Review by ChatGPT (2/15/2025) [Rating 7/10]

Conclusion

Patil, Peng, and Leek (2016) provide a thought-provoking statistical defense of psychological replication efforts, arguing that expectations for replication need to align with statistical principles. Their prediction interval analysis offers an important counterpoint to the oversimplified “36% replication rate” narrative.

However, the paper does not fully address deeper issues in psychological science, including publication bias, p-hacking, and the overall credibility of original findings. While statistical consistency is valuable, it does not necessarily mean that the original findings are valid or practically meaningful.

Ultimately, their work highlights the need for a more nuanced conversation about replication—one that considers both statistical expectations and scientific integrity in equal measure.

Comment by Ulrich Schimmack (25/02/15)
Rating 3/10

I only saw this article today. I was surprised by the conclusions in this article, because the senior author Leek had conducted an empirical study of the false discovery risk in medicine and found a relatively low risk of 13%. Ten years later, we (Schimmack & Bartos, 2024) replicated this finding with a better statistical method. The authors of this article could have used their method with the original data from the Open Science Reproducibility Project that is the target of this article. Instead, they use the finding from medicine to suggest that things are also fine in psychology. We have shown with z-curve that medicine and psychology, especially social psychology, are dramatically different. Medicine is much more credible than experimental social psychology was before the replication crisis. Thus, their article is true and meaningless. Focusing on the wide confidence intervals in original studies due to small sample distracts from the finding that a replication rate of 37% with a discovery rate of 90% or more in original articles suggests massive publication bias and a high false positive risk. Disappointing, but it may explain why Leek never replied to my emails.

Introduction

The article by Patil, Peng, and Leek (2016) offers a statistical perspective on the replicability crisis in psychology, particularly in response to the Reproducibility Project: Psychology (Open Science Collaboration, 2015). The authors challenge the simplistic interpretation that only 36% of the studies successfully replicated and argue that a deeper statistical understanding—particularly using prediction intervals—paints a different picture of reproducibility.

Their core claim is that 77% of the replication effect sizes fall within the 95% prediction interval of the original study, suggesting that most replications are statistically consistent with expectations, even if effect sizes differ.


Strengths of the Paper

1. A More Nuanced Statistical Perspective

One of the paper’s biggest contributions is its emphasis on prediction intervals, rather than a binary success/failure measure of replication. The authors highlight how:

  • Statistical variation can naturally lead to different effect sizes in replication attempts.
  • Wide confidence and prediction intervals in original studies mean that a range of replication results may still be “consistent.”
  • The expectation of near-identical effect sizes across studies is unrealistic, given factors like sampling variability and measurement error.

This argument is a valuable contribution to the debate on replication, as it shifts the focus from absolute reproducibility to statistical consistency.

2. Critique of Media and Public Interpretation

The authors provide a strong critique of how the 36% replication success rate was widely publicized without appropriate context. They argue that such a stark percentage misrepresents the complexity of replication, especially when original studies have high uncertainty.

This is an important point, as oversimplified media narratives can lead to public distrust in science and misunderstanding of how replication works in practice.

3. Focus on Imprecision in Original Studies

The paper draws attention to a crucial issue: many original psychological studies have wide confidence intervals and low statistical power. This means that:

  • Many reported effects may be overestimated (a common issue due to publication bias).
  • Even when a replication effect is statistically consistent, it may not provide strong evidence of a real effect.

By emphasizing the uncertainty in original studies, the authors shift the focus from whether studies “replicate” to whether they provide reliable and precise estimates.


Criticisms and Limitations

1. Over-Reliance on Prediction Intervals

While prediction intervals provide a broader statistical perspective, they do not necessarily solve the problem of false positives and publication bias. Some key concerns include:

  • A replication result being within the prediction interval does not mean the effect is “real.” If the original study had a biased or inflated effect size, then replications that cluster around it may still be misleading.
  • Prediction intervals assume no systematic bias in the original study. Many psychological studies suffer from issues like questionable research practices (e.g., p-hacking, HARKing), which may inflate effect sizes. Simply computing prediction intervals without accounting for this can be misleading.
  • Wide intervals can be uninformative. If the prediction interval is very large, a wide range of replication outcomes may appear “consistent,” even if they are practically meaningless.

2. Downplays the Importance of Failed Replications

While the authors argue that replications are often “statistically consistent” with the original studies, they do not fully address the practical significance of the findings. If a replication finds an effect size much smaller than the original, even if statistically within expectations, it may challenge the real-world impact of the original finding.

For example, if an original study found an effect size of 0.5 and a replication finds 0.1, the latter may technically fall within the prediction interval but still indicate a much weaker or negligible effect.

3. Lack of Discussion on Systematic Biases

The paper treats the issue as primarily statistical, but psychological science’s replication crisis is also methodological:

  • Publication bias favors studies with significant results, leading to inflated effect sizes.
  • P-hacking and researcher degrees of freedom can produce misleading original findings.
  • Lack of pre-registration means that many “original” findings may be the result of selective reporting.

By focusing mainly on prediction intervals and statistical expectations, the paper does not fully engage with these deeper concerns about scientific rigor and transparency.


Implications for Psychological Science

The authors offer a valuable statistical critique of how replication is interpreted, but their conclusions also raise further questions:

  1. How should researchers interpret replications where the effect size is much smaller but still statistically “consistent”?
  2. Should psychology journals adopt stronger standards for original studies, such as larger sample sizes, pre-registration, and stricter statistical thresholds?
  3. How can the media better communicate nuanced statistical findings without oversimplification?

While this paper challenges the narrative that the Reproducibility Project showed a “crisis,” it does not fully exonerate psychological science from the replication challenges it faces.


1 thought on “Review of “What Should Researchers Expect When They Replicate Studies”

Leave a Reply