Summary (Meta-Description)
P-curve can detect evidential value, but fails under heterogeneity. Z-curve offers accurate replicability estimates with realistic assumptions.
Keywords:
- p-curve
- z-curve
- p-curve vs z-curve
- heterogeneity in meta-analysis
P-Curve vs. Z-Curve: Why Meta-Analysts Are Moving On
If you’ve been following developments in meta-analysis, you’ve probably heard of p-curve—a method introduced by Simonsohn, Nelson, and Simmons to assess whether a set of statistically significant results shows evidential value (i.e., that not all results are false positives). P-curve became popular because it was simple: take the significant p-values from a set of studies, plot their distribution, and see if there are more very small p-values (e.g., < .01) than you’d expect by chance.
But here’s the catch: p-curve makes a strong assumption that all studies have the same underlying statistical power (homogeneity). In the real world—where effect sizes and sample sizes vary—this assumption rarely holds. And when it doesn’t, p-curve’s “average power” estimate can be severely biased, often overestimating the replicability of a research area.
The Heterogeneity Problem
In practice, studies differ in:
- Sample size
- Effect size
- Measurement reliability
- Design quality
These differences produce heterogeneity in statistical power. Under heterogeneity, p-curve’s single-parameter model is misspecified—it treats all studies as if they were equally powered, leading to misleadingly high and overconfident estimates.
This problem has been demonstrated in multiple simulation studies (Brunner & Schimmack, 2020; Bartoš & Schimmack, 2022) and in Chapter 5 of the Z-Curve 3.0 Tutorial (2025). The conclusion is consistent: p-curve works for detecting evidential value, but it’s unreliable for quantifying it when power varies.
Enter Z-Curve
Z-curve was developed by Ulrich Schimmack and Jerry Brunner to directly address the heterogeneity issue. Instead of assuming one “true” power level, z-curve models the distribution of significant z-values as a mixture of components, each with its own noncentrality parameter.
This approach:
- Accurately recovers the expected replication rate (ERR) and expected discovery rate (EDR) under heterogeneity.
- Produces calibrated confidence intervals that reflect the true uncertainty.
- Works with the same type of input data as p-curve (significant test statistics), so researchers can easily compare both.
Simulation Comparisons
The Z-Curve 3.0 Tutorial, Chapter 5 ran head-to-head simulations:
| Feature | p-curve | z-curve |
|---|---|---|
| Model of power | Single (homogeneous) | Mixture (heterogeneous) |
| Accuracy under heterogeneity | Overestimates power | Accurate estimates |
| Confidence intervals | Too narrow, misleading | Proper coverage |
| Practical use | OK for “is there any evidential value?” | Best for “how much?” and “how replicable?” |
Results were striking: in realistic mixed-power scenarios, z-curve’s points clustered along the 45° line (accurate estimates), while p-curve’s often overshot, giving an illusion of higher replicability than was actually present.
When to Use Each Method
- If your only question is: Are all these studies just false positives? — p-curve can still be useful for a quick test of evidential value.
- If you care about: How much evidential value? How replicable are these findings? — use z-curve, especially when power is likely to vary across studies.
Bottom Line
P-curve was an important step forward, but it’s been surpassed by methods that reflect the complexity of real research literatures. Z-curve is the better choice for quantifying evidential value and replicability, particularly under realistic heterogeneity.
Further Reading & Resources
- Brunner, W., & Schimmack, U. (2020). Estimating replication rates from significant results. Meta-Psychology.
- Schimmack, U. (2025). Z-Curve 3.0 Tutorial, Chapter 5: P-Curve vs Z-Curve. replicationindex.com
- Simonsohn, U., Nelson, L., & Simmons, J. (2014). P-curve: A key to the file-drawer. JEP: General.