Willful Ignorance - Replicability-Index

Background

In the interest of open science, this blog post summarizes a private email exchange between Uri Simonsohn — principal developer of p-curve — and me — principal developer of z-curve. The correspondence itself is not reproduced here at Simonsohn’s request. I used AI throughout the communication, and this account of the exchange was written by Claude, who was asked to write it from a neutral third-party perspective. This does not rule out the possibility of bias, but Uri is welcome to use the comment section to share his own perspective — an option that is not available on his own blog, DataColada.

Key Points

I shared simulations showing that p-curve overestimates average power under realistic heterogeneity while z-curve does not. Simonsohn did not challenge these results.
Simonsohn’s own public position since 2018 is that p-curve is biased when some studies have power above 90%. Uncertainty about effect sizes guarantees that real data will include such studies, making bias the norm rather than the exception.
Simonsohn argued that average power is not a meaningful quantity under heterogeneity. If so, the p-curve app should stop displaying it. If average power is meaningful, z-curve estimates it better.
P-curve confidence intervals do not have 95% coverage. Z-curve.3.0 has 95% coverage even with homogeneous data.
Z-curve provides information that p-curve cannot: estimates of average power for all studies (EDR), quantification of publication bias, and bounds on the false discovery risk.
Simonsohn did not address any of these points. His public position remains unchanged from 2018.

I shared simulations showing that p-curve overestimates average power under realistic heterogeneity while z-curve does not. Simonsohn did not challenge these results. Simonsohn’s own public position since 2018 is that p-curve is biased when some studies have power above 90%. Uncertainty about effect sizes guarantees that real data will include such studies, making bias the norm rather than the exception. Simonsohn argued that average power is not a meaningful quantity under heterogeneity. If so, the p-curve app should stop displaying it. If average power is meaningful, z-curve estimates it better. P-curve does not provide confidence intervals for its power estimates. Z-curve does, with 95% coverage validated across a wide range of simulation conditions. Z-curve provides information that p-curve cannot: estimates of average power for all studies (EDR), quantification of publication bias, and bounds on the false discovery risk. Simonsohn did not address any of these points. His public position remains unchanged from 2018.

Selection Models: P-Curve and Z-Curve

P-curve and z-curve are both methods that use the distribution of significant p-values to estimate the average statistical power of a set of studies. They share the same goal but differ in a critical respect: p-curve fits a single power parameter to the data, assuming all studies have the same power, while z-curve fits a mixture model that allows power to vary across studies. When power is truly homogeneous, p-curve’s simpler model is more efficient. When power is heterogeneous — as it typically is in meta-analyses of conceptual replications — p-curve produces inflated estimates with confidence intervals that are too narrow (Brunner & Schimmack, 2020). The question at the center of this exchange was whether, and under what conditions, this difference matters in practice.

The Opening: Procedural Deflection

The exchange began when Schimmack presented evidence that p-curve overestimated average power in the Reproducibility Project data. Simonsohn’s initial response did not address the overestimation. Instead, he objected that the data had not been presented in the format of a p-curve disclosure table — a procedural requirement he had developed for auditing p-curve analyses. Schimmack pointed out that the Reproducibility Project had a uniquely transparent selection process, with key findings identified collaboratively and often with input from original authors, making the disclosure table requirement a matter of form rather than substance. Simonsohn did not contest this point but instead pivoted to personal history, characterizing the dispute as a grudge, and closed the conversation with “I will switch gears and return to my current interests.”

The Simulations: A Deck Stacked for P-Curve

Several weeks later, Simonsohn re-engaged by sharing simulation code originally developed for a 2018 blog post (DataColada 67). He reported that z-curve performed worse than p-curve in most scenarios, with the exception of one scenario Schimmack had provided. His conclusion was that “z-curve is generally slightly worse, except when there are extreme power values that bias p-curve but not z-curve.”

Examination of the simulation parameters revealed two problems. First, the effect size distributions used standard deviations of 0.05 to 0.15 in Cohen’s d units, producing near-homogeneous power across studies. Typical meta-analyses in psychology show heterogeneity of 0.3 to 0.4 or higher (van Erp et al., 2017). Under near-homogeneity, p-curve’s assumption is met by design, making the comparison uninformative about realistic conditions. Second, the simulations used only 20 to 25 studies — too few for z-curve’s mixture model to leverage its structural advantage over p-curve’s simpler model.

Rather than confronting these limitations directly, Schimmack conceded that p-curve could outperform z-curve under some conditions and asked Simonsohn to identify the key moderator determining when each method performed better. Simonsohn did not answer this question directly, responding “I have no time right now.”

Discovering the Estimand Distinction

When the exchange resumed, Simonsohn’s responses revealed that he was encountering the distinction between the Expected Replication Rate (ERR) and the Expected Discovery Rate (EDR) for the first time. He wrote: “ah, it seems you do have a different estimand.” This distinction had been published in Brunner and Schimmack (2020) six years earlier and was printed as standard output by the z-curve R package that Simonsohn had been using in his simulations.

Simonsohn further questioned whether p-curve’s estimand was even well-defined under heterogeneity. Schimmack pointed out that this was precisely the problem: p-curve had a clearly defined estimator (fit a single power parameter) but an ill-defined estimand, while z-curve had clearly defined estimands (ERR and EDR) estimated by a more complex model. Under homogeneity the distinction is invisible because ERR equals EDR. Under heterogeneity it is central.

Schimmack also raised concerns about whether Simonsohn’s simulation architecture — which used an inverse CDF method to generate only significant results rather than simulating natural selection for significance — could adequately distinguish between the quantities the two methods were designed to estimate. The full implications of this concern were clarified only later in the exchange, but the immediate practical question remained: when evaluated against the correct benchmark using realistic parameters, which method performed better?

The Decisive Simulation

Schimmack provided modified code using Simonsohn’s own simulation framework with more realistic parameters: 50 studies, mean effect size d = 0.3, standard deviation of d = 0.25, and mean sample size of 40. These values fall well within the range observed in actual psychology meta-analyses.

The results were clear. True average power was 43%. P-curve estimated 50%, overestimating by 7 percentage points. Z-curve estimated 41%, underestimating by only 2 percentage points. The difference in accuracy was statistically significant. Z-curve’s 95% confidence intervals achieved 96% coverage. Uri’s code did not include confidence intervals to examine coverage of p-curve’s confidence intervals, whereas my own simulations showed better coverage for z-curve than for p-curve.

The Retreat to Philosophy

Faced with these results, Simonsohn shifted from methodological engagement to philosophical objection. He argued that p-curve’s bias under heterogeneity had been known since 2018, that he had acknowledged it in print, and that the bias was “not super consequential” because it occurred only with “extreme power values.” He maintained that averaging power across heterogeneous studies was inherently meaningless, that “most meta-analyses are a waste of everyone’s time,” and that the choice between p-curve and z-curve was “second order” compared to problems of study selection.

Schimmack asked Simonsohn to clarify what he meant by studies with power below 90% — whether he referred to true power (a simulation parameter under the researcher’s control) or observed power (a noisy post-hoc estimate). Simonsohn dismissed this as unimportant: “That’s one of the least important things I wrote.”

The Logical Corner

Schimmack identified a logical inconsistency in Simonsohn’s position. If average power was not a meaningful quantity under heterogeneity, then the natural conclusion would be to remove the power estimate from the p-curve app, which continued to display it to users. Most researchers relied on p-curve’s test of evidential value rather than its power estimate. Removing the estimate would be consistent with Simonsohn’s stated views, would eliminate the known bias, and would not change how most researchers used the tool. Researchers who wanted power estimates could use z-curve, which was designed for that purpose.

Simonsohn did not respond to this suggestion.

Final Conclusion

After the exchange documented above, Simonsohn provided a final response reiterating his original positions: that z-curve performs worse in most scenarios, that p-curve’s bias is caused by “extreme values” rather than heterogeneity, and that average power should not be computed at all when studies are heterogeneous. He did not address the simulation results showing p-curve’s significant overestimation under realistic heterogeneity, nor the absence of confidence intervals in p-curve’s output, nor the suggestion to remove the power estimate from the p-curve app. He requested that only his public writings be cited. His public position remains unchanged from 2018.

The exchange revealed a pattern characteristic of methodological disputes in which a method’s developer has strong ownership over it. Each time the argument narrowed to a point where p-curve’s limitations were empirically exposed, the grounds of discussion shifted — from procedural objections, to personal framing, to redefinition of the relevant quantity, to philosophical dismissal of the enterprise itself. The substantive question — which method gives researchers better estimates under realistic conditions — was answered by the simulations but never acknowledged.

Postscript

I was invited to write a tutorial about the differences between p-curve and z-curve in the Journal of Communication Methods and Measures (2021-2026). My graduate student and I wrote a draft (Schimmack & Soto, 2026). The manuscript shows the simulation results for different levels of heterogeneity (Table 1). Uri Simonsohn was invited to write a commentary and declined to do so.

Table 1

Mean Estimated Replication Rate (ERR), Root Mean Square Error (RMSE), and 95% CI Coverage by Heterogeneity (Tau) and Method

Tau	Criterion	True	Density 2.0	EM2.0	EM3.0	EM3- Norm	P-curve
0.05	Mean Est.	43	44	38	40	40	44
	RMSE		12	11	10	10	12
	Coverage		93	82	94	92	92
0.15	Mean Est.	50	49	43	46	45	50
	RMSE		11	11	9	11	11
	Coverage		97	93	96	95	92
0.25	Mean Est.	59	57	55	58	57	63
	RMSE		10	10	9	10	12
	Coverage		98	91	95	96	79
0.35	Mean Est.	65	64	63	66	67	75
	RMSE		10	10	9	9	13
	Coverage		97	94	98	94	67
0.45	Mean Est.	71	68	67	71	72	82
	RMSE		10	9	8	6	13
	Coverage		98	96	98	98	58
0.55	Mean Est.	73	71	71	75	76	88
	RMSE		7	7	8	5	15
	Coverage		99	95	100	99	35

Note. Bold values indicate the best-performing method for each condition and criterion. True = population ERR; Density 2.0 = density-based estimator; EM2.0/EM3.0 = expectation-maximization z-curve variants; EM3-Norm = EM3.0 with normal mixture; P-curve = p-curve power estimator. Coverage = proportion of simulations in which the 95% confidence interval contains the true value; values close to .95 indicate proper calibration, values below .95
indicate that confidence intervals are too narrow.

Replicability-Index

Improving the replicability of empirical research

Tag Archives: Willful Ignorance

The P-Curve/Z-Curve Exchange: A Methodological Dispute in Real Time