This post-publication peer-review was created in collaboration with ChatGPT5.0.
“On the Poor Statistical Properties of the P-Curve Meta-Analytic Procedure” by Richard D. Morey & Clintin P. Davis-Stober https://doi.org/10.1080/01621459.2025.2544397
Overview
Morey & Davis-Stober present a detailed critique of p-curve, a set of statistical tests proposed by Simonsohn et al. (2014, 2015) to assess evidential value in sets of published results. They identify conceptual and statistical issues, particularly with the formal tests in the p-curve app and its average power estimation feature. Their central conclusion is that p-curve, in its current form, should not be used for either hypothesis testing or power estimation.
The paper is well-written, technically competent, and thorough in its exploration of certain flaws. However, the review also has significant omissions and imbalances. It fails to situate p-curve within the broader trajectory of methodological development in forensic meta-analysis, particularly the work by Brunner and Schimmack on z-curve, which directly addresses and resolves the main statistical weakness Morey & Davis-Stober highlight. Moreover, in some places the critique overstates p-curve’s practical shortcomings and risks giving the impression that no viable alternative exists.
Strengths of the manuscript
- Detailed breakdown of p-curve’s formal tests – The authors provide an in-depth explanation of EV, LEV, and related tests, clarifying the exact null and alternative hypotheses, and showing the disconnect between these hypotheses and the informal “evidential value” narrative.
- Identification of statistical weaknesses – The analysis of inadmissibility, non-monotonicity, and sensitivity to p-values near the significance threshold is technically sound and important for understanding the limitations of the 2015 probit-based tests.
- Critique of the average-power estimator – The authors convincingly show that p-curve’s current estimator is inconsistent under heterogeneity in effect sizes or sample sizes, with potential for substantial upward bias.
Limitations and omissions
- Lack of historical context on heterogeneity problem
The authors treat the discovery that p-curve’s power estimation fails under heterogeneity as novel to their work. In fact, heterogeneity bias in p-curve was already documented in Van Aert et al. (2016) and more systematically addressed by Brunner & Schimmack when they introduced z-curve (Brunner & Schimmack, 2020, 2021).- Z-curve models the full distribution of significant results as a mixture of components with different noncentrality parameters, eliminating the unrealistic homogeneity assumption that underlies p-curve’s bias.
- By omitting this work, the authors under-represent the state of the field and miss an opportunity to frame their critique as part of a solved problem.
- Overstatement of p-curve’s practical failings
While the identified statistical properties are real, they can sound more damaging to applied use than they are:- For detection of evidential value, the original 2014 log-based EV test remains a valid way to reject the “all effects are zero” null in the absence of severe violations (e.g., extreme p-value dependence).
- The main unreliability lies in quantifying evidential value with the app’s average-power feature, not in detecting it.
- Readers may leave the paper believing that p-curve is useless in all forms, which is an overgeneralization.
- No acknowledgment of improved alternatives
The authors conclude by recommending against the use of p-curve in any form, but do not inform readers that more robust successors exist:- Z-curve resolves the inconsistency under heterogeneity, produces bias-corrected average-power estimates with confidence intervals, and retains evidential-value detection capability.
- Bayesian hierarchical selection models offer another path, explicitly modeling the distribution of λ\lambda and selection process.
- Presenting these options would give readers a constructive way forward instead of ending with a methodological dead end.
- Unclear separation between coding issues and statistical flaws
For example, the Montoya et al. (2024) criticism about instability from multiple candidate p-values per study is cited without noting that Simonsohn et al.’s own guidelines already require choosing the most focal test, which largely avoids this issue. This blurs the line between problems inherent to the method and problems arising from poor implementation.
Balanced interpretation
- What the paper gets right:
- The p-curve app’s 2015 probit-based tests (*EV, *LEV) and its average-power feature have real, demonstrable weaknesses.
- These include inadmissibility, high sensitivity to near-threshold p-values, and inconsistency under heterogeneity.
- What the paper underplays or omits:
- The fact that detection of evidential value via the 2014 log-based EV test is less affected by these problems.
- That heterogeneity bias was recognized years earlier and has already been addressed in z-curve and related mixture-model approaches.
- That practical alternatives exist which preserve the core logic of p-curve but remove the statistical defects the authors highlight.
Conclusion
Morey & Davis-Stober’s paper is valuable as a technical audit of the p-curve app’s internal workings, but it is incomplete as a guide for applied researchers. The heterogeneity problem they identify is neither new nor insurmountable: it has been solved in the z-curve framework, which replaces the single-λ\lambda assumption with a mixture model of noncentrality parameters and yields reliable average-power estimates under realistic conditions.
A balanced takeaway is:
- P-curve is serviceable for detecting the presence of evidential value in relatively homogeneous sets of studies, but unreliable for quantifying it in the presence of heterogeneity.
- Z-curve (or similar mixture-model approaches) should be preferred when heterogeneity is suspected, as it retains the interpretive clarity of p-curve while avoiding its statistical weaknesses.
Table – Comparison of P-curve and Z-curve
| Feature | P-curve | Z-curve |
|---|---|---|
| Primary Goal | Test for “evidential value” (reject all-H₀ null) and estimate “average power” | Estimate average power / replicability while accounting for heterogeneity and selection |
| Key Assumption | Single underlying noncentrality parameter (homogeneous power) | Multiple underlying noncentrality parameters (heterogeneous power) via mixture model |
| Selection Model | Implicit truncation at α (e.g., p < .05) | Explicitly models truncation and selection process |
| Heterogeneity Handling | Not modeled; leads to biased power estimates | Modeled via mixture of truncated distributions |
| Evidential Value Detection | Works reasonably when assumptions hold; EV test valid under independence | Also provides detection, via model fit and comparison to null |
| Average Power Estimation | Biased under heterogeneity; CIs unreliable | Unbiased under heterogeneity (within sampling error); CIs valid |
| Outputs | EV test p-value; average power estimate; CI (invalid if heterogeneity present) | Expected replication rate (ERR), expected discovery rate (EDR), average power, valid CIs |
| Best Use Case | Homogeneous, independent, significant p-values | Realistic, heterogeneous, possibly dependent significant p-values |
| Key References | Simonsohn, Nelson, & Simmons (2014, 2015) | Brunner & Schimmack (2020, 2021) |
Thanks, Uli, very interesting. I was wondering how the prompt ChatGBT? Is it a one-off or is it iterative? And what are your instructions?
Kind regards Jesper Schneider
I basically engage in a discussion. During the discussion I ask critical questions that rely on my expertise. I often know things that a quick (or even long) search of the web does not find. You can also ask ChatGPT for sources that may be relevant (search Gelman blog, search DataColada blog, search R-Index blog, etc.)
Thanks, very helpful