Closed peer-review has strict rules about communication. The main reason may be that open display of peer-review would destroy the illusion that peer-review equals quality control. I am not allowed to share the flawed manuscript to shane the authors, but I am allowed to share my rebuttal that highlights the lack of competence by some critics of z-curve.

Based on a long list of false and misleading claims the authors conclude “we recommend against the use of the z-curve to estimate average power.” The problem for the authors is that they started with the conclusion and then tried to find arguments to justify it. This led to major mistakes like confusing the expected discovery rate and the expected replication rate that estimate different quantities.

Here is my review that has been drafted and fact checked with two neutral experts on z-curve (ChatGPT & Claude).

Overview

*&^%%$# evaluate the sampling distribution and accuracy of z-curve point estimates and recommend against use of z-curve. The manuscript is seriously flawed. The central problem is an estimand error: the manuscript does not distinguish the Expected Replication Rate (ERR) from the Expected Discovery Rate (EDR), and in the key heterogeneous simulation it evaluates the wrong quantity against the wrong benchmark. This error creates the appearance of large bias where the relevant z-curve estimates are in fact close to their proper targets.

The simulation design is also too narrow for the paper’s broad conclusion. It emphasizes homogeneous scenarios that favor a homogeneous selection model, while ignoring the existing z-curve validation literature and the broader simulation designs already used to evaluate z-curve 2.0. A scientifically appropriate comparison would apply Hedges’ selection model and z-curve 2.0 to the same broad factorial simulation design, including both homogeneous and heterogeneous conditions, rather than drawing strong conclusions from a small set of favorable cases for Hedges.

These issues are not peripheral. They reverse the interpretation of the central results.

Major Issue 1: Z-curve Estimates ERR and EDR, Not a Single Undifferentiated “Average Power”

The manuscript treats “average power” as a single quantity. That is not how z-curve 2.0 is defined. Z-curve estimates two distinct quantities:

ERR (Expected Replication Rate): the average power of significant studies, that is, mean power conditional on selection for significance. In a mixture model, ERR is computed from the mixture weights in the selected distribution.

EDR (Expected Discovery Rate): the expected proportion of all conducted tests that become statistically significant. EDR is computed from the population distribution of studies before selection. When selection for significance and heterogeneity are present, the selected distribution overrepresents higher-power studies, so ERR and EDR differ.

Any evaluation of z-curve must compare the ERR estimate with the true ERR and the EDR estimate with the true EDR. The manuscript repeatedly refers to “average power” without maintaining this distinction. This is the source of the most consequential error in the paper.

Major Issue 2: The Benchmark Error in Scenario 2

Scenario 2 is described as an equally weighted mixture over μ = 0, 1, 2, 3, 4, 5. The submitted code samples the noncentrality parameter for each retained significant study uniformly:

z0 <- sample(zz, kk, replace = TRUE)

Because kk is the number of significant studies, this creates equal mixture weights in the selected distribution, not in the population distribution. Thus w_j^sel = 1/6 for all j.

The component powers for μ = 0, 1, 2, 3, 4, 5 are approximately .050, .170, .516, .851, .979, and .999. Therefore the true ERR is the simple average of these values:

true ERR = (1/6)(.050 + .170 + .516 + .851 + .979 + .999) = .594

This is the value reported by the authors. However, .594 is the ERR, not the EDR. The corresponding population weights are obtained by inverse probability weighting:

w_j^pop ∝ w_j^sel / P(sig | μ_j)

With equal selected weights, this yields the following population weights and EDR:

Component μ	P(sig \| μ)	Unnormalized w^pop	Normalized w^pop
0	.050	20.000	.645
1	.170	5.882	.190
2	.516	1.938	.062
3	.851	1.175	.038
4	.979	1.021	.033
5	.999	1.001	.032

The true EDR is therefore approximately:

true EDR = Σ w_j^pop P(sig | μ_j) = 6 / Σ_j [1 / P(sig | μ_j)] ≈ .193

The reported RMSE of approximately .37 to .38 for z-curve in Scenario 2 is therefore not evidence of poor estimation. It is approximately the difference between the true EDR (.19) and the incorrect ERR benchmark (.59). The claim that accuracy can deteriorate as k increases is also an artifact of this error. As k increases, z-curve converges more precisely to the correct EDR, which is farther from the incorrect ERR benchmark.

The relevant evaluation is straightforward: compare the z-curve ERR estimate to ERR = .594 and the z-curve EDR estimate to EDR = .193. Under this comparison, the central Scenario 2 failure disappears.

The R code for the correct benchmarks is:

zz <- 0:5
alpha <- 0.05
z.alpha <- qnorm(1 – alpha/2)
power <- 1 – pnorm(z.alpha – zz) + pnorm(-z.alpha – zz)
w.sel <- rep(1/6, 6)
true.ERR <- sum(w.sel * power)
w.pop <- w.sel / power
w.pop <- w.pop / sum(w.pop)
true.EDR <- sum(w.pop * power)

Major Issue 3: Why Hedges Appears to Perform Well

The manuscript reports that Hedges (1984) achieves lower RMSE than z-curve in Scenario 2 and concludes that z-curve performs worse. This comparison is invalid because Hedges is also being evaluated against the ERR benchmark.

Hedges (1984) is a homogeneous selection model. When fitted to heterogeneous selected data, it estimates a single μ*. The associated power can fall near the selected-sample mean power, that is, near ERR. This does not mean that Hedges correctly estimates EDR or any appropriate population quantity under heterogeneity. It means that the model is being rewarded for estimating a quantity close to the benchmark that the manuscript mistakenly uses.

Thus, Hedges’ apparent superiority is produced by two errors canceling: the benchmark is ERR, and the homogeneous selection model estimates something close to ERR. Z-curve is penalized because its EDR estimate is closer to the correct EDR.

This point is consistent with the broader comparison of homogeneous selection models and z-curve. Hedges is not identical to p-curve, but it shares the key limitation relevant here: it is a homogeneous selection model. Homogeneous selection models can look good when homogeneity holds, but they are not designed for heterogeneous literatures. Prior z-curve validation work comparing z-curve with p-curve and p-uniform shows that p-curve-type homogeneous methods perform well when all studies have the same power, but can overestimate power when power varies across studies. Z-curve was introduced precisely to address this heterogeneity problem.

In additional simulations with continuous heterogeneity in noncentrality parameters, NCP ~ Normal(μ, 2²), the performance ranking reverses: Hedges substantially overestimates EDR, whereas z-curve recovers ERR and EDR with much smaller error. These simulations should be reported explicitly, with code and Monte Carlo uncertainty, but their implication is already clear: the manuscript’s favorable comparison for Hedges is an artifact of simulation design and estimand mismatch, not evidence that Hedges is generally superior.

Major Issue 4: The Simulation Design Ignores Existing Best Practices and Existing z-curve Validation Designs

The problem is not absence of preregistration. The problem is that the simulation design does not follow best practices for evaluating statistical methods. A defensible comparative simulation should examine a broad and theoretically justified design space, include conditions where each method is expected to succeed and fail, report Monte Carlo uncertainty, and compare methods under the same data-generating conditions.

This is especially important here because z-curve 2.0 has already been evaluated in a large simulation design with hundreds of conditions, including homogeneous and heterogeneous power distributions. The manuscript ignores this existing design and instead focuses mainly on homogeneous settings and a single heterogeneous setting that is misinterpreted. A more appropriate comparison would be straightforward: apply both Hedges and z-curve 2.0 to the simulation conditions used to validate z-curve 2.0, including the homogeneous cells already present in that design. There is no need to invent a narrow new design that mainly favors the homogeneous model.

The manuscript’s omission is also unscientific because more recent R-Index simulations extend the coverage evidence to smaller samples, including k = 50 significant results, and report near-nominal EDR confidence-interval coverage across many plausible conditions. A contrary conclusion based on a few narrow scenarios should be interpreted against this existing evidence, not in isolation from it.

Major Issue 5: Homogeneity Tests Efficiency, Not Validity

In homogeneous conditions, a correctly specified homogeneous selection model such as Hedges (1984) is expected to be more efficient than z-curve because it exploits the true homogeneity assumption. Z-curve allows heterogeneity in power, and this flexibility can reduce precision in homogeneous finite samples. That is a robustness-efficiency tradeoff, not evidence of invalidity.

The manuscript treats z-curve’s point-estimate variability under homogeneity as a methodological failure. That interpretation is wrong. The relevant criterion is whether z-curve estimates ERR and EDR with acceptable bias and whether its confidence intervals have appropriate coverage. Existing simulation studies show that z-curve confidence intervals have good coverage under the intended range of conditions. Less precision than a model that “knows” the data are homogeneous is the price of being valid when the data are heterogeneous, which is the typical case in real meta-analytic applications.

Major Issue 6: Application Section Lacks Meaningful Method Comparison

The application section applies z-curve to 38 meta-analyses from Cognition and Emotion and emphasizes bootstrap variability of point estimates. However, no comparison method is applied to the same data. Demonstrating that z-curve estimates are variable in small or weakly informative datasets does not show that z-curve is inferior to alternatives. It may simply show that the data contain little information.

Many of these meta-analyses have small numbers of significant z-statistics after subsampling. In such cases, any method that honestly reflects uncertainty should produce wide intervals. A fixed-effect or homogeneous selection model may produce narrower intervals by imposing stronger assumptions, but this is not evidence that those assumptions are justified.

At minimum, Hedges (1984) and relevant heterogeneous or clustered selection models should be fit to the same datasets, and uncertainty should be compared under the same dependence structure. Otherwise the application section cannot support a method-level recommendation against z-curve.

Minor and Additional Issues

1. “No distributional assumptions” quotation

The authors correctly note that the claim that z-curve “does not make any distributional assumptions about the data” is imprecise. However, the relevant distinction should be stated accurately. Z-curve does not assume a parametric distribution of effect sizes across studies, such as a normal random-effects distribution. It does assume a mixture distribution for significant z-statistics. The manuscript uses the imprecision rhetorically rather than clarifying this distinction.

2. “Moving target” characterization

The manuscript describes z-curve as a moving target because methods have been revised. This is not a substantive statistical criticism. Methodological improvement is normal scientific practice. The relevant question is whether z-curve 2.0, the version under evaluation, is evaluated against its correct estimands.

3. Scenario 1 interpretation

In the homogeneous Scenario 1, ERR and EDR coincide, so the Scenario 2 benchmark error does not apply. However, this scenario is trivially favorable to Hedges because the data-generating mechanism matches Hedges’ assumptions. Z-curve’s lower efficiency in this setting is not evidence of invalidity; it reflects the cost of allowing heterogeneity.

4. Sample sizes in applications

Many application analyses contain fewer than 20 significant z-statistics after subsampling. Wide bootstrap distributions in such cases are expected. The manuscript repeatedly treats lack of information in small datasets as a defect of z-curve.

5. Subsampling sensitivity

The manuscript presents sensitivity to subsampling strategy as a weakness of z-curve. But different subsampling strategies change the effective sample size and the information retained. Sensitivity to these choices reflects the dependence structure in the data. A clustered bootstrap or multilevel approach is the appropriate solution, not abandonment of z-curve.

6. Worsening RMSE with k

The claimed “deeply undesirable” property that RMSE worsens as k increases in Scenario 2 is an artifact of comparing an EDR estimate to an ERR benchmark. With the correct EDR benchmark, the interpretation reverses.

7. Standard deviation comparison

The manuscript compares z-curve variability to a 0.25/sqrt(k) rule. The scope conditions of that rule need to be stated. If the rule concerns ERR under heterogeneous conditions, then applying it to homogeneous single-μ settings is inappropriate.

8. Post-hoc power critique

The manuscript’s appeal to the literature criticizing post-hoc power is irrelevant and misleading. Those critiques concern single-study observed-power calculations, typically using the observed effect size from one completed study, often to interpret a nonsignificant result. Z-curve does not estimate observed power for a single study and does not estimate conditional power. It estimates ERR and EDR from sets of significant test statistics. The manuscript should not use the conventional post-hoc power critique as evidence against z-curve.

9. Code reproducibility issue

The submitted simulation code reportedly contains k <- c(30, 100) followed by k <- 10000, overwriting the intended sample sizes. If this is the code supplied for review, it does not reproduce the reported k = 30 and k = 100 results and should be corrected.

10. Lack of Monte Carlo uncertainty

The manuscript reports RMSEs and other simulation summaries without Monte Carlo standard errors. This is inconsistent with best practice for simulation studies and makes it difficult to judge the numerical stability of reported differences.

11. Arbitrary accuracy threshold

The manuscript states that RMSE = .20 is the worst tolerable accuracy for estimating average power. This threshold is subjective and should not be treated as a general criterion without justification. In any case, the Scenario 2 RMSE is not interpretable because it uses the wrong benchmark.

12. Dependence claim is overstated

The manuscript states that z-curve “strictly speaking should not be used” for the application meta-analyses because z-statistics are dependent. Dependence affects uncertainty estimation and motivates clustered bootstrap procedures or careful sampling, but it does not by itself justify the conclusion that z-curve should not be used.

13. Exclusion of z-values above 6

The manuscript states that zcurve() returns an error when there are fewer than 10 z-statistics larger than 1.960 and smaller than 6. This is not an error. It is safeguard to prevent users from applying z-curve to small meta-analyses that should not be analyzed with z-curve.

14. Typographical error on p. 11

The manuscript states “when the the assumptions of the z-curve hold exactly.” The duplicate “the” should be corrected.

15. Grammatical error on p. 24

The manuscript states “The remaining thirteen meta-analysis are…” This should be “The remaining thirteen meta-analyses are…”

16. Page-count inconsistency

The manuscript footer shows pages such as “Page 38 of 37.” This appears to be a formatting or compilation error and should be corrected before publication.

17. Ambiguous treatment of application data availability

For Pillaud and Ric (2023), the manuscript reconstructs z-statistics from figures using WebPlotDigitizer but cannot identify groups, studies, and papers. These analyses assume independence despite acknowledged dependence. The resulting estimates should be labeled more cautiously and should not be used as strong evidence about z-curve performance.

18. Scope of conclusion exceeds evidence

The manuscript recommends against use of z-curve generally. The evidence presented, even if accepted, would support at most a narrower claim about point-estimate instability in small homogeneous or weakly informative datasets. It does not justify a general recommendation against z-curve, especially in light of existing validation studies and the Scenario 2 estimand error.

Suggested Corrected Simulation Design

A minimally adequate comparison would use the existing z-curve 2.0 validation design and add Hedges (1984) as a comparator. The design should include homogeneous and heterogeneous conditions, evaluate ERR and EDR separately, and report point-estimate accuracy and confidence-interval coverage.

A simple extension of the manuscript’s Scenario 1 would draw noncentrality parameters from a continuous distribution, for example NCP ~ Normal(μ, 2²). An NCP SD of 2 is a plausible stress-test condition. For a two-group standardized mean difference with equal group sizes and total N = 80, n1 = n2 = 40 and the noncentrality parameter is λ = d / sqrt(1/n1 + 1/n2) = d sqrt(20) ≈ 4.47d. Thus heterogeneity of τ_d = .30 to .40 corresponds to SD(λ) ≈ 1.34 to 1.79, and SD(λ) = 2 corresponds to τ_d ≈ .45. This is close to the upper end of empirically observed heterogeneity and is a reasonable condition for testing robustness.

Summary

The manuscript’s central conclusion is not supported. Scenario 1 favors the Hedges model by construction because it assumes homogeneity. Scenario 2 introduces heterogeneity but evaluates z-curve against the wrong benchmark by conflating ERR and EDR. The reported large RMSE for z-curve in Scenario 2 is therefore not evidence of poor performance; it is an estimand mismatch.

The manuscript also ignores extensive existing z-curve validation evidence, including large simulation designs and R-Index simulations showing good confidence-interval coverage for EDR under many conditions, including k = 50 significant results. A scientifically appropriate comparison would evaluate Hedges and z-curve 2.0 across the same broad simulation design and compare ERR and EDR separately.

The recommendation against z-curve should therefore be rejected. At most, the manuscript shows that homogeneous selection models can be more efficient when homogeneity is true and known, and that z-curve estimates are uncertain in small or weakly informative datasets. These are not new findings and do not undermine z-curve’s intended use for heterogeneous literatures selected for significance.

References to Add or Emphasize

Bartoš, F., & Schimmack, U. (2022). Z-curve 2.0: Estimating replication rates and discovery rates. Meta-Psychology, 6.

Brunner, J., & Schimmack, U. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta-Psychology, 4.

Morris, T. P., White, I. R., & Crowther, M. J. (2019). Using simulation studies to evaluate statistical methods. Statistics in Medicine, 38, 2074–2102.

Pawel, S., Kook, L., & Reeve, K. (2024). Pitfalls and potentials in simulation studies. Biometrical Journal, 66, 2200091.

Siepe, B. S., Bartoš, F., Morris, T. P., Boulesteix, A.-L., Heck, D. W., & Pawel, S. (2024). Simulation studies for methodological research in psychology. Psychological Methods.

Schimmack, U. (2026, February 3). Concerns About Z-Curve: Evidence From New Simulations With Few Studies. Replicability-Index.

Schimmack, U. (2026, April 8). The P-Curve/Z-Curve Exchange: A Methodological Dispute in Real Time. Replicability-Index.

Schimmack, U., & Soto, M. D. (2026). Draft/tutorial comparing p-curve and z-curve under heterogeneity, as discussed in the R-Index post “The P-Curve/Z-Curve Exchange.”

Replicability-Index

Improving the replicability of empirical research

Z-Curve is Doing Just Fine: The Critics are Wrong

Overview

Major Issue 1: Z-curve Estimates ERR and EDR, Not a Single Undifferentiated “Average Power”

Major Issue 2: The Benchmark Error in Scenario 2

Major Issue 3: Why Hedges Appears to Perform Well

Major Issue 4: The Simulation Design Ignores Existing Best Practices and Existing z-curve Validation Designs

Major Issue 5: Homogeneity Tests Efficiency, Not Validity

Major Issue 6: Application Section Lacks Meaningful Method Comparison

Minor and Additional Issues

Suggested Corrected Simulation Design

Summary

References to Add or Emphasize

Like this:

Leave a ReplyCancel reply

Overview

Major Issue 1: Z-curve Estimates ERR and EDR, Not a Single Undifferentiated “Average Power”

Major Issue 2: The Benchmark Error in Scenario 2

Major Issue 3: Why Hedges Appears to Perform Well

Major Issue 4: The Simulation Design Ignores Existing Best Practices and Existing z-curve Validation Designs

Major Issue 5: Homogeneity Tests Efficiency, Not Validity

Major Issue 6: Application Section Lacks Meaningful Method Comparison

Minor and Additional Issues

Suggested Corrected Simulation Design

Summary

References to Add or Emphasize

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from Replicability-Index