Source:
Bryan, C. J., Yeager, D. S., & O’Brien, J. M. (2019). Replicator degrees of freedom allow publication of misleading failures to replicate. Proceedings of the National Academy of Sciences USA, 116, 25535–25545. http://dx.doi.org/10.1073/pnas.1910951116
This review was created “with the help of” ChatGPT to minimize bias. My own comments are highlighted in bold italics with my initials US.
Critical Review of the Article
1. Summary of the Article
The authors argue that the current methods used to conduct and publish replication studies in psychology are flawed due to “replicator degrees of freedom”—choices made during experimental design and data analysis that can lead to misleading failures to replicate. They suggest that flexibility in replication studies allows researchers to inadvertently (or deliberately) produce false negatives while maintaining an appearance of methodological rigor. Using a case study of replication attempts on a voter turnout intervention, they demonstrate how variations in experimental design and statistical analysis can obscure an effect that was originally found to be significant.
The authors claim that:
- Replication studies are subject to biases, much like original studies.
- Flexibility in replication design (e.g., slight changes in timing, context, or participant selection) can result in misleading null findings.
- Even with strict methodological controls, standard analytical choices can be used to favor non-significant results.
- Statistical tools such as specification-curve analysis can be used to evaluate the robustness of replication findings and expose selective reporting of statistical results.
2. Strengths of the Article
a. Important Contribution to the Replication Debate
This paper addresses a key issue in psychological science: the replication crisis. Many scholars focus on p-hacking and questionable research practices in original studies, but this article shifts attention to the biases that can arise in replication attempts. The authors argue convincingly that failures to replicate should not be taken at face value without scrutiny.
b. Use of Statistical Rigor
The authors employ specification-curve analysis, a robust statistical technique that tests multiple model specifications to reveal how different analytical choices impact results. They also apply Bayesian Causal Forest (BCF), a machine-learning approach for detecting heterogeneity in treatment effects. These methods strengthen their argument that the replication study they critique was affected by data analysis choices.
US. Data analytic choices can influence results, but these analyses do not show that these choices reduced the power of the replication studies to produce significant results. A better way to do so would be an estimation of power and an examination of bias (i.e., fewer significant results than power of studies allows).
c. Challenges Assumptions About the Objectivity of Replicators
A key assumption in replication research is that replicators are unbiased, but the authors argue that they, too, have incentives to produce certain results (e.g., publishing null findings to challenge existing theories). By pointing out that replication studies may be subject to their own version of p-hacking (“null hacking”), the paper encourages a more balanced perspective on the replication process.
US. Once more, a comparison of statistical power and success rates can reveal bias in both directions. If we get only 20% significant results with 50% power, we have evidence that non-significant results were selected. If we get 90% significant results with only 50% power, we have evidence for the typical selection bias in favor of significant results. Claims about bias should be supported by empirical evidence of bias.
d. Calls for Improved Methodological Standards
The article provides practical recommendations for improving replication methodology, including:
- Considering contextual factors in replication attempts (e.g., timing and population differences).
- Avoiding overly rigid statistical criteria that make significant results difficult to obtain.
- Using pre-analysis plans and specification-curve analysis to ensure robustness in replication testing.
3. Weaknesses and Criticisms
a. Potential Confirmation Bias
Although the authors present strong statistical evidence, they are not neutral parties in the debate. They were involved in the original research on voter turnout, and their analysis is designed to defend the original findings. While this does not invalidate their arguments, it raises concerns about potential confirmation bias. They acknowledge this issue but do not fully address how it might have influenced their methodological choices.
b. Selective Case Study
The authors focus on one specific replication debate (voter turnout and linguistic framing). While they argue that their findings generalize to other replication studies, this claim is speculative. More evidence across different areas of psychology would strengthen their argument.
US. Not only is the claim speculative, it is also contradicted by empirical evidence. Bias tests have routinely demonstrated selection FOR significance in original studies. In contrast, there is no empirical evidence of selection for non-significant results in replication studies so far.
c. Overreliance on Specification-Curve Analysis
Specification-curve analysis is useful for identifying how analytical choices impact results, but it does not necessarily prove which analysis is correct. The authors suggest that replicators’ choices were misleading, but one could argue that their own analysis is simply another subjective selection of model specifications. While their statistical methods are sophisticated, they do not completely eliminate the possibility of bias.
d. Downplaying the Importance of Larger Sample Sizes
The authors critique the idea that larger replication samples should be given more weight, arguing that flawed design choices can still produce misleading results. While this is true, larger samples generally produce more reliable effect size estimates than small, underpowered studies. The original study they defend had a much smaller sample than the replication attempt, yet they do not fully acknowledge how that might have influenced the discrepancy in findings.
US. Agreed. The strongest evidence to counter replication failures would be for the original authors to replicate their own findings using a preregistered analysis plan with a sample sizes that ensures high power to demonstrate that the effect is real when the right methods and analyses are used. Interestingly, this option is used rarely by original researches when they respond to replication failures.
e. Lack of Discussion on Preregistration’s Role
The article suggests that preregistration does not fully solve the problem of analytical flexibility. However, it does not discuss how preregistration can help mitigate some of these issues. A more nuanced discussion of how preregistration could be improved (rather than dismissed) would strengthen their argument.
4. Broader Implications
This article highlights a critical issue in the replication crisis: the field’s tendency to accept replication failures as definitive proof that an original effect was spurious. The authors argue that context matters, and replication attempts that fail to recreate key aspects of the original study may be invalid. This raises broader questions for psychology:
- Should replication studies always be considered the gold standard for scientific validity?
- How can the field create better methodological guidelines for replication attempts?
- Should psychology focus on understanding effect variability rather than labeling effects as either “real” or “false”?
The paper encourages a more nuanced view of replication, one that takes into account experimental context, researcher incentives, and statistical flexibility. While it does not fully resolve the debate, it provides a strong argument for revisiting past replication failures with greater scrutiny.
US. A common problem in discussion of replication failures is the misinterpretation of replication failures as evidence that an original effect was spurious. The problem is that
5. Conclusion
Overall, this article makes an important contribution to the ongoing replication debate in psychology. It highlights the role of replicator degrees of freedom in shaping replication outcomes and calls for more rigorous methodological standards in replication studies. However, its arguments are somewhat self-serving, and its critique of replication studies could be more balanced. Despite these limitations, it raises valuable points that should be considered as psychology continues to refine its approach to replication research.
Final Rating: 8/10
US. The rating is way to high, but may be the AI is American. There is no evidence that replication failures are caused by shady practices of replication researchers to produce negative results.
✅ Strengths:
- Important contribution to the replication crisis discussion
- Strong statistical analysis (specification-curve & BCF)
- Highlights biases in replication studies
- Calls for better methodological standards
❌ Weaknesses:
- Potential confirmation bias
- Focuses on a single case study
- Overreliance on specification-curve analysis
- Underestimates the value of large sample sizes
This paper should spark productive discussions on how replication studies are designed, analyzed, and interpreted. While it is not the final word on the topic, it is a thought-provoking challenge to conventional wisdom in psychological science.
US The problem is that there are now way more articles that discuss the replication crisis than empirical studies that actually explain it. This article adds no substantial evidence to understand replication failures outside of the specific studies examined here, and even for these there is no conclusive evidence that the original results were credible and the replication failures were caused by unintentional or intentional biases to produce non-significant results.
Final Rating 2/10
1 thought on “Review: “Replicator degrees of freedom allow publication of misleading failures to replicate””