Scientific progress depends on criticism, especially when it is used to identify limitations of statistical methods and to improve them. z-curve is no exception. Over the past year, several critiques have raised questions about the robustness of z-curve estimates, particularly with respect to the expected discovery rate (EDR). These critiques deserve careful examination, but they also require accurate characterization of what z-curve assumes, what it estimates, and under which conditions its estimates are informative.
Two recent lines of criticism are worth distinguishing. First, Pek et al. (2025) show that z-curve estimates can be biased when the publication process deviates from the assumed selection model. The default z-curve model assumes that selection operates primarily on statistical significance at the conventional α = .05 threshold (z = 1.96). Pek et al. demonstrate that if researchers also suppress statistically significant results with small effect sizes—for example, not publishing a result with p = .04 because the standardized mean difference is only d = .40—then z-curve estimates can become optimistic. This result is correct: z-curve cannot diagnose selective reporting based on effect size rather than statistical significance.
However, there is currently little empirical evidence that routine effect-size–based suppression is a dominant driver of publication bias in psychology. More importantly, the presence of such bias does not justify assuming that the published literature is unbiased. All methods that evaluate evidence from the published record must make assumptions about the selection process, and z-curve’s assumptions are at least aligned with well-documented incentives surrounding statistical significance. The fact that z-curve is imperfect does not imply that ignoring selection bias altogether is preferable.
Moreover, the selection mechanism examined by Pek et al. has a clear directional implication: when statistically significant results are additionally filtered by effect size, z-curve’s estimates of EDR and ERR can be biased upward. This matters for interpretation. If z-curve already yields low EDR or ERR estimates, then the type of misspecification studied by Pek et al. would, if present in practice, imply that the underlying parameters could be even lower. For example, an estimated EDR of 20% under the default selection model could correspond to a substantially lower true discovery rate if significant-but-small effects are systematically suppressed. Whether such effect-size–based suppression is common enough to materially affect typical applications remains an empirical question.
A second critique has been advanced by Erik van Zwet, a biostatistician who has applied models of z-value distributions developed in genomics to meta-analyses of medical trials. These models were designed for settings in which the full set of test statistics is observed and therefore do not incorporate selection bias. When applied to literatures where selection bias is present, such models can yield biased estimates. In contrast, z-curve is explicitly designed to assess the presence of selection bias and to correct for it, when it is present. When no bias is present, z-curve can also be fitted to the full z-curve, including non-significant results.
van Zwet has published a few blog posts arguing that z-curve performs poorly when estimating the expected discovery rate (EDR). Importantly, his simulations do not show problems for the expected replication rate (ERR). Thus, z-curve’s ability to estimate the average true power of published significant results is not in question. The disputed issue concerns inference about the broader population of studies, including unpublished nonsignificant results.
Some aspects of this critique require clarification. van Zwet has suggested that z-curve was evaluated only in a small number of simulations. This is incorrect. Prior work includes two large simulation studies—one conducted by František Bartoš and one conducted by me—that examined EDR confidence-interval coverage across a wide range of conditions. Based on these results, the width of the nominal 95% confidence intervals was conservatively expanded by ±5 percentage points to achieve near-nominal coverage across a wide range of realistic scenarios (see details below). Thus, EDR interval estimation was already empirically validated across many conditions with 100 or more significant results.
However, these simulations did not examine performance of z-curve with small sets of significant results. Because z-curve can technically be fit with as few as 10 significant results, it is reasonable to ask whether EDR confidence-interval coverage remains adequate when the number of significant studies is substantially smaller than 100. To address this question directly, I conducted a new simulation study focusing on the case of 50 significant results.
In addition, I introduced two diagnostics designed to assess when EDR estimation is likely to be weakly identified. Estimation of the EDR relies disproportionately on significant results from low-powered studies or false positives, because these observations provide information about the number of missing nonsignificant results. When nearly all significant results come from highly powered studies, the observed z-value distribution contains little information about what is missing. The first diagnostic therefore counts how many significant z-values fall in the interval from 1.96 to 2.96. Very small counts in this range signal that EDR estimates are driven by limited information. The second diagnostic examines the slope of the z-value density in this interval. A decreasing slope indicates information consistent with a mixture that includes low-powered studies, whereas an increasing slope reflects dominance of high-powered studies and weak identification of the EDR.
Reproducible Results of Simulation Study with 50 Significant Results
The simulation used a fully crossed factorial design with four values for each of four parameters, yielding 192 conditions. Population-level standardized mean differences were set to 0, .2, .4, or .6. Heterogeneity was modeled using normally distributed effect sizes with standard deviations (τ) of 0, .2, .4, or .6. In addition, a separate population of true null studies was included, with the proportion of false discoveries among significant results set to 0, .2, .4, or .6. Sample sizes varied across conditions, starting at n = 50 (25 observations per group). For each condition, simulations were run with exactly 50 significant results.
The simulation code is available here. The results are available here.
Across all scenarios coverage is 96%. The percentage is higher than the nominal 95% because the conservative adjustment leads to higher coverage in less changing scenarios.
The slope diagnostic works as expected. When the slope is decreasing, coverage is 97%, but when the slope is increasing it drops to 83%. Increasing slopes are more likely to lead to an overestimation than underestimation of the EDR (75%). Increasing slopes occurred in only 5% of all simulations because these scenarios assume that the majority of studies have over 50% power, which requires large samples and moderate to large effect sizes.
The number of z-values in the range between 1.96 and 1.97 also matters. At least 12 values in this range are needed to have 95% coverage. However, the slope criterion is more diagnostic than the number of z-values in this range.
Conclusion
Pek et al. and van Zwet have raised broad concerns about z-curve’s estimates of the expected discovery rate (EDR), which is used to assess publication bias and to quantify the extent of missing nonsignificant results. Their arguments rely heavily on a small set of stylized scenarios. These scenarios do not show that z-curve generally produces untrustworthy results. In contrast, prior large-scale simulation studies, together with the present extension to datasets with only 50 significant results, indicate that z-curve’s EDR confidence intervals achieve near-nominal coverage across a wide range of conditions, including many that are plausible for applied research.
Importantly, the new simulations also validate simple diagnostics that indicate when EDR estimation is likely to be less reliable. In particular, the shape of the significant z-value distribution in the critical range from 1.96 to 2.96—especially whether the density decreases or increases just above the significance threshold—helps identify weak-information regimes in which EDR may be overestimated and confidence-interval coverage may be reduced. Users can therefore compare the near-threshold shape of their observed z-value distribution to the patterns observed in simulation to assess whether EDR should be interpreted cautiously in a given application.
Overall, these results support the conclusion that z-curve provides credible estimates of the expected replication rate (ERR) and, under diagnostically identifiable conditions, the expected discovery rate (EDR), and that these quantities remain useful for evaluating the credibility of literatures in which selection on statistical significance is present.