Category Archives: Z-Curve

Power Failure Revisited: A Z-Curve Analysis of Button et al.

Power Failure, False Positives, and The Replication Crisis

Scientists have become increasingly skeptical about the credibility of published results (Baker, 2016). The main concern is that scientists were presenting results as objective facts, while the results were often influenced by undisclosed subjective decisions that increased the chances of presenting a desirable result. These degrees of freedom in analyses are now called questionable research practices or p-hacking.

Ioannidis (2005) showed with hypothetical scenarios that questionable research practices combined with low statistical power and testing of many false hypotheses could lead to more false than true discoveries of statistical regularities (i.e., a statistically significant result).

Awareness of this problem has produced thousands of new articles that discuss this problem. It has even created its own new science called meta-science; the scientific study of science. Some articles have gained prominent status and are foundational to meta-science.

For example, the Reproducibility Project in psychology replicated 100 studies. While 97 of these studies reported a statistically significant result, only 36% of the replication studies showed a significant result. The drop in the success rate can be attributed to questionable research practices that inflated effect size estimates to achieve significance. Honest replications did not have this advantage, and the true population effect sizes were often too small to produce significant results.

The true probability of obtaining a statistically significant result is called statistical power (Cohen, 1988; Neyman & Pearson, 1933). In the long run, a set of studies with average true power of 50% are expected to produce 50% significant results, even if all studies test different hypotheses (Brunner & Schimmack, 2020l). Thus, the success rate of the Reproducibility Project implies that the replication studies had about 40% average power. As these studies replicated original studies as closely as possible (and similar sample sizes), this suggests that the average power of the original studies was also around 40%.

This estimate is in line with Cohen’s (1962) seminal estimate of power. Average power around 40% has two implications. First, many attempts to demonstrate an effect in a single study will fail to reject a false null hypothesis that there is no relationship; a false negative result (Cohen, 1988). Concerns about false negatives were the focus of meta-scientific discussions about significance testing in the 1990s (Cohen, 1994).

This shifted, when meta-scientists pointed out the consequences of selection for significance and low power (Ioannidis, 2005; Rosenthal, 1979; Sterling et al., 1995). Low statistical power combined with questionable research practices could result in many false discoveries (i.e., statistically significant results without a real effect). In some scenarios, literatures could be entirely made up of false discoveries (Rosenthal, 1979) or at least more false than true discoveries (Ioannidis, 2005).

Theoretical articles and simulation studies suggested that false positive rates might be uncomfortably high and replication failures seemed to support this suspicion, although replication failures could also just be false negative results (Maxwell, 2016). Thus, actual replication studies often do not settle conflicting interpretations of the evidence. While some researchers see replication failures as evidence that original results cannot be trusted, others point towards the difficulty of replicating actual studies and false negatives as reasons why original results could not be replicated (Gilbert et al., 2016).

An alternative approach examines false positives for sets of studies rather than a single study. The statistical results of original articles are used to estimate the average power of studies and to use power to evaluate the risk of false positive results. One of the first attempts to do so was Button, Ioannidis, Mokrysz, Nosek, Flint, Robinson, and Munafò’s (2013) article “Power failure: why small sample size undermines the reliability of neuroscience.” The key empirical finding was that median power of 730 studies from 49 meta-analysis was 21%. The article did not provide an empirical estimate of the false positive rate, but it did illustrate implications of the power estimate for false positive rates in various scenarios. The authors suggested that “a major implication is that the likelihood that any nominally significant finding actually reflects a true effect is small” (p. 371). This claim has contribute to concerns that many published significant results are unreliable.

Reexamining The Power Failure

More than ten years later, it is possible to revisit the seminal article with the benefit of hindsight. Advances in the estimation of true power have revealed important conceptual problems that are different from the computation of hypothetical power for the purpose of sample size planning (Brunner & Schimmack, 2020; Soto & Schimmack, 2026).

Cohen defined statistical power as the probability of obtaining a significant result (1988). In the context of sample size planning, however, power is defined as the probability of obtaining a significant result given a hypothetical population effect size greater than zero. This conditional definition of power given a true hypothesis is widely used in the power literature and was also used by Ioannidis (2005) in his calculations of false positive rates.

Assuming only true hypothesis to compute power is reasonable for hypothetical scenarios, but not for the estimation of true power of completed studies. As the population effect size remains unknown after a study produced an effect size estimate, it is not possible to assume an effect size greater than zero. Thus, the true probability of a completed study to produce a significant result is unconditional and independent of the distinction between H0 and H1. Any estimate of average true power is therefore an estimate of the unconditional probability to produce a significant result. This average can contain tests of true null-hypothesis.

The distinction between conditional and unconditional probabilities has important implications for Button’s calculations of false positive rates. The median power of 21% is unconditional, but the false positive calculations assume conditional power. This can lead to inflated estimates of false positive rates. For example, mean power of 20% could be made up of 50% true H0 with a 5% probability to produce a (false) significant results and 50% tests of H1 with 35% power. In this scenario, the false positive rate is 2.5% / (2.5% + 17.5) = 12.5%. Increasing the proportion of true hypothesis that were tested to a 4:1 ratio would increase the conditional power of tests of H1 to 80% to maintain 20% average power. The false positive rate would increase to .04 / (.04 + .15) = 20%. As noted by Soric (1989), we can even compute the maximum false positive rate that is consistent with unconditional mean power assuming conditional power of 1. With mean power of 21%, the maximum ratio of H0 over H1 is 5.25:1 and the maximum false discovery rate is 20%.

Table 1

Maximum False Discovery Rate for 20% Unconditional Power (Soric, 1989)

 Not SignificantSignificantTotal
H₁ True.000.160.160
H₀ True.798.042.840
Total.798.2021.000
H₀ : H₁ Ratio5.25 : 1  
False Discovery Rate .208 

Note. The table shows the maximum false discovery rate when average unconditional power equals 20%. This maximum occurs when conditional power for true hypotheses (H₁) equals 100%. The false discovery rate equals the proportion of significant results that are false positives: .042 / .202 = .208. Any lower conditional power with the same unconditional power of 20% produces a lower false discovery rate.

Soric’s formula: max.FDR = (1/Mean.Power – 1)*(alpha/(1-alpha))

The 21% false positive rate overestimates the true false positive rate with 21% median power for two reasons. Soric’s formula assumes that H1 are tested with 100% power. Assuming that many tests of small true effect sizes in small samples have low conditional power, the true false positive rate is below 21%. The second reason is that unconditional power has a skewed distribution with many low power studies and a few high power studies. As a result, mean power will be higher than median power. Button et al.’s provide information about mean power based on their analyses of publication bias that uses mean power. This analysis suggested that 254 of the 730 studies were expected to produce a significant result and the expected percentage of significant results is equivalent to mean power (Brunner & Schimmack, 2020). Thus, mean power was estimated to be 254 / 730 = 35%. Based on Soric’s formula, the maximum false discovery rate with 35% significant results is 10%.

In conclusion, Button et al.’s estimate of unconditional mean power can be used to draw inferences about false positives in the meta-analyses that they examined that do not rely on unknown ratios of true and false hypotheses being tested in neuroscience. Using their data and Soric’s formula suggests that the false positive risk is fairly small.

A Z-Curve Analyses of Button et al.’s Data

Button et al.’s article contribute to a culture of open sharing of data, but that was not the norm when the article was published. Fortunately, Nord et al. (2017) conducted further analyses of the data and shared power estimates for the 730 studies in an Open Science Foundation (OSC) project. The power estimates do not use the effect sizes of individual studies. Rather they use the sample sizes and the meta-analytic effect size to estimate power. This approach corrects for effect size inflation in smaller studies and reduces bias in power estimates. The following analyses used these data. Power estimates based on individual studies are likely to be inflated by publication bias.

Based on these data, 28% of the studies were statistically significant. Mean power was 35%, matching Button et al.’s estimate of mean power, suggesting that Nord et al.’s power values are based on meta-analytic effect sizes.

I converted power values into z-values and analyzed the z-values with z-curve.3.0 using the default model (Figure 1).

The observed discovery rate (ODR) is simply the percentage of significant results. More important is the bias-corrected estimate of unconditional mean power for all 730 z-values. Z-curve uses the observed distribution of significant z-values and projects the fitted model into the range of unobserved significant results. As shown in Figure 1, the model predicts the actual distribution of non-significant results fairly well. This suggests that the use of meta-analytic effect sizes adjusted inflated effect size estimates and removed publication bias. The estimated mean power for all studies is called the expected discovery rate (EDR). The EDR estimate is close to the ODR, suggesting further that the data are unbiased.

A key problem of estimating the EDR based on the significant results only is that the confidence interval around the point estimate is very wide. When the data show no major bias, more precise estimates can be obtained by fitting the model to all 730 data (Figure 2).

The key finding is that the point estimate of the false positive risk, FDR = 13% is in line with calculations based on Button’s estimate of mean power. The confidence interval around this estimate limits the FDR at 20%. This is an upper limit because conditional power of studies with significant results is likely to be less than 100%.

In fact, z-curve makes it possible to estimate conditional power of significant studies. First, z-curve estimates unconditional average power of significant studies. This parameter is called the expected replication rate (ERR) because it predicts how many studies would produce a significant result again in a hypothetical replication project that reproduces the original studies exactly with new samples. The ERR is 54% with an upper limit of 60% for the 95% confidence interval. We also know that no more than 20% of these studies are false positives. Assuming 80% true hypotheses, the average conditional power can not be higher than (.60 – .20*.05) / .8 = 74%. Thus, Soric’s assumption of 100% power is conservative, and the false positive rate is likely to be lower.

In conclusion, a z-curve analysis of Nord et al.’s power estimates for Button et al.’s meta-analyses confirms estimates that could have been obtained by applying Soric’s formula to Button et al.’s estimate of mean power. The true rate of false positive results remains unknown, but it is unlikely to be more than 20%.

Heterogeneity Across Research Areas

Nord et al. (2017) demonstrated that power varies across different research areas that were included in Button et al.’s sample of meta-analyses. Some of these areas had enough studies to conduct z-curve analyses for these specific areas. The most interesting area are candidate-gene studies that relate genotypic variation in single genes to phenotypes across participants With the benefit of hindsight, it is known that variation in a single gene has trivial effects on complex traits and that many of the significant results in these studies were practically false positive results (Duncan & Keller, 2011). 234 of the 730 studies were from this research area. Figure 3 shows the results. Interestingly, only 11% of the results were statistically significant. Thus, the low average power can be explained by many studies that reported non-significant results. There is no evidence of publication bias in these meta-analyses.

Using Soric’s formula, the low EDR translates into a high false positive risk, 42% and the upper limit of the 95% confidence interval includes 100%. Thus, z-curve confirms that the rare significant results in this literature could be false positive results. Most significant results also are just significant. There are hardly any results that show strong evidence (z > 4) against the null-hypothesis.

In short, a large portion of the 730 studies came from a research area that is known to have produced few significant results. This finding implies that other research areas are producing more credible significant results (Nord et al., 2017).

A second set of meta-analyses were clinical trials. Clinical trials have received considerable attention using Cochrane meta-analyses and abstracts in original articles that often report the key statistical result ( (Jager & Leek, 2013; Schimmack & Bartos, 2023, van Zwet et al., 2024). The results suggest that unconditional mean power is around 30% and the false positive risk is between 10% and 20%. These results serve as benchmarks for the z-curve analysis of the 145 clinical trials in Button et al.’s study (Figure 4).

The EDR is somewhat lower, 21%, but the 95% confidence interval includes 30%. The FDR is 19%, but the lower limit of the confidence interval includes 13%. Thus, the results are a bit lower, but mostly consistent with evidence from estimates based on thousands of results. These estimates of the FDR are notably lower than the false positive rates that were predicted by Ioannidis’s scenarios that assumed high rates of true null-hypotheses.

The third domain were studies from psychology. Psychological scientists have examined the credibility of their research in the wake of replication failures (Open Science Collaboration, 2015). Suddenly, only significant results in multiple studies within a single study were no longer attributed to reliable effects, but seen as signs of selection for significance (Schimmack, 2012). Francis (2014) found that over 80% of these multi-study articles showed statistically significant evidence of bias. Large scale multi-lab replication studies failed also showed that effect sizes estimates in these studies could be inflated by a factor of 1,000, shrinking effect sizes from d = .6 to d = .06 (Vohs et al., 2019). A z-curve analysis of a representative sample of studies in social psychology estimated that average unconditional power before selection for significance, EDR = 19%, FDR = 22%. Cohen (1962) already found similar estimates are similar for focal and non-focal results. This was also the case in a survey of emotion research (Soto & Schimmack, 2024). Soto and Schimmack (2024) reported an EDR of 30% and a corresponding FDR = 12% (k sig = 21,628) for all automatically extracted tests, and an EDR of 27%, FDR = 14%, for hand-coded focal tests (k sig = 227). These results serve as a comparison standard for the z-curve of 145 studies classified as psychological research by Nord et al. (2017). The EDR is 49%, FDR = 5%. Even the lower limit of the EDR confidence interval, 39%, implies only 8% false positives. among the significant results.


There are several reasons why these results differ from other findings. First, the focus on meta-analyses leads to an unrepresentative sample of the entire literature. Meta-analyses often include a lot more non-significant results and have less bias than original articles. Second, the specific set of meta-analyses was not representative of the broader literature in psychology. Thus, the results cannot be generalized from the specific studies in Button et al.’s sample to psychology or neuroscience. That would require representative sampling or collecting data from all studies using automatic extraction of test statistics.

Discussion

Button et al.’s (2013) was a first attempt to assess the credibility of empirical results with empirical estimates of power based on meta-analytic effect sizes and sample sizes. The median power was low (21%). The key implications of these finding was that researchers often fail to reject null-hypotheses and may use questionable research practices to report significant results in published articles. Low power and bias could lead to many false positive results. This article added to other concerns about the reliability of findings in neuroscience (Vul et al., 2019).

Most citations took Button et al.’s findings and implications at face value. Nord et al. (2017) pointed out that power and false positive rates varied across research areas. Most notably, candidate gene studies have lower power and a much higher false positive risk. Including these studies in the calculation of median power may have led to false perceptions of other research areas.

Here I presented the first serious critical examination of Button et al.’s methodology and inferences and found several problems that undermine their pessimistic assessment of neuroscience. First, they estimated unconditional power, but their false positive calculations require estimates of conditional power. Second, false positives rates depend on mean power and not median power. Mean power was 35% which is close to the estimate for psychology based on actual replication studies (OSC, 2015). Third, they made unnecessary assumptions about ratios of true and false hypotheses being tested, when unconditional power alone is sufficient to estimate false positive rates (Soric, 1989). Fourth, they relied on meta-analysis to correct for publication bias, but meta-analyses are not representative of the broader literature.

Meta-science is like other sciences. Ideally, critical analyses reveal problems and new innovations address these problems. Power estimation started in the 1960s with Cohen’s seminal article. Cohen (1962) worked with plausible effect sizes, but did not aim to estimate studies true power. Moreover, his work and statistical power were largely ignored (Cohen, 1990; Sedlmeier & Gigenzer, 1989).

Conclusion

The replication crisis stimulated renewed interest in methods that use observed results to draw inferences about the power of actual studies (Ioannidis & Trikalinos, 2007; Francis, 2014; Schimmack, 2012; Simonsohn, Nelson, & Simmons, 2014). This work shifted attention from prospective power calculations to the retrospective assessment of evidential strength in published literatures. Two challenges emerged as central. First, selection bias inflates the observed rate of significant results, requiring methods that correct for selection. Second, power varies across studies, requiring models that allow for heterogeneity rather than assuming a single common effect size or power level. Early approaches addressed selection under simplifying assumptions, typically treating power as homogeneous across studies. As a result, their inferences become unreliable when studies differ in sample size, effect size, or both (Brunner & Schimmack, 2020; Schimmack, 2026).

Z-curve extends this line of work by explicitly modeling both selection and heterogeneity, estimating a distribution of power across studies rather than a single average. This provides a framework for quantifying key properties of the literature, including expected discovery and replication rates, and for linking these quantities to false discovery risk (Sorić, 1989). In this sense, z-curve represents a substantive advance in the empirical assessment of the credibility of published findings. Like earlier contributions such as Button et al., it is unlikely to be the final word, but it is currently the most advanced method to estimate true power for sets of studies with heterogeneity in power and selection bias.

Is Z-Curve Just Another P-Curve?

P-curve is a statistical tool that was designed to evaluate the statistical credibility of significant results. When only significant results are published, it is unclear how much selection for significance contributed to the results. In the worst case scenario, all published results are false positives. P-curve uses a variety of approaches to test this worst case scenario. If the null-hypothesis can be rejected, the data are said to have evidential value; that is, at least some of the studies rejected a false null-hypothesis.

P-curve was published without extensive validation research. Critical examination of the method has focussed on the estimate of average power (Brunner, 2018; Brunner & Schimmack, 2020). Average power can quantify the strength of evidence against the null-hypothesis rather than simply rejecting the null-hypothesis of no evidence. For example, a set of studies could have 18% average power, suggesting that some significant results were true positives, but also showing that this literature has many studies with low power.

The problem with p-curve is that, contrary to claims by its developer, it produces inflated estimates of power when studies vary in power. For example, it predicts that 91% of replications should have been successful in the reproducibility project (Open Science Collaboration, 2015), when only 36% of the actual replications were successful. This bias is expected given the large heterogeneity in power across these studies (Schimmack & Soto, 2026). A solution to this problem is to use z-curve (Bartos & Schimmack, 2022; Brunner & Schimmack, 2020). Z-curve is explicitly designed for heterogenous data and performs well with low and high heterogeneity (Schimmack & Soto, 2026).

Morey and Davis-Stober (2025) raised further concerns about the statistical properties of p-curve. Given the similar aims of p-curve and z-curve, it is reasonable to wonder whether z-curve suffers from some of the same problems as p-curve, despite its ability to handle heterogeneity well. I asked Claude AI to examine this question and it concluded that z-curve is built on a fundamentally different approach than p-curve that avoids many of p-curve’s pitfalls. Here is a summary of the evaluation.

Full table

CriticismHeterogeneity-dependent?Affects power estimation?Generalizes to z-curve?
EV* inadmissibility (probit/concave acceptance region)NoYes (same transform used)No
Nonmonotonicity (compound half p-curve)NoNoNo
Boundary sensitivity (probit maps boundary to ∞)NoYesNo (EM is smooth)
LEV/LEV* large-value blindnessNoIndirectlyNo
Power estimation inconsistencyYes (core mechanism)Yes (the main finding)No
Conceptual: not tests of skewNoPartlyNo (z-curve doesn’t claim this)
Conceptual: noncentrality ≠ effect sizeNoPartly (p-curve conflates them in its framing)Not applicable — z-curve targets power, not effect size

P-curve’s problems go beyond heterogeneity

The most fundamental problem is inadmissibility of the core test of evidential value (EV). The core test — the version currently in the p-curve app — uses a probit transformation that produces a concave acceptance region in the test statistic space. By results from Birnbaum (1954) and Marden (1982), this makes the test inadmissible: its power is dominated by other tests for every possible alternative, including the homogeneous case. The 2015 switch from the log to the probit transformation was motivated by wanting robustness to extreme values, but admissibility requires exactly the property that was engineered out — sensitivity to large individual test statistics.

The compound half p-curve rule introduces nonmonotonicity: increasing the evidence in a single study can flip the procedure from rejection to acceptance and back, multiple times, along a monotonically increasing path. This is a purely structural consequence of the hard boundary at αpc/2 combined with the probit transform, and has nothing to do with whether effect sizes are heterogeneous.

Test LEV, which is supposed to detect “lack of evidential value,” has an additional pathology: arbitrarily large test statistics contribute zero weight to the sum, because they map to log(1) = 0. A single study with a p value just below 0.05 can dominate the test and force rejection regardless of how large every other test statistic is. Six studies with Z = ∞ plus one study at Z = 1.97 yields the same test statistic as six studies at Z = 1.97.

None of these problems affect z-curve. Z-curve uses EM estimation on a mixture of truncated normal distributions, fitting the full shape of the observed z-score distribution above the significance threshold. Large z-scores contribute information proportional to their posterior weight on high-NCP components. The EM likelihood surface is smooth and does not blow up near the truncation boundary. There is no compound decision rule. And because z-curve’s target quantities are replicability (ERR) and discovery rate (EDR) — both functions of noncentrality parameters — there is no conflation of power with effect size.

The Morey and Davis-Stober paper does not mention z-curve. It does not need to. Their formal results simply confirm, from a different direction and with different tools, what simulation studies have shown for years: p-curve’s statistical machinery is not up to the job it advertises. Z-curve was designed from the start to avoid exactly these pitfalls.

In short, z-curve is not just another p-curve. While the aims are similar, the statistical approach and the ability to handle realistic amounts of heterogeneity are very different. Morey and Davis-Stober’s critique is limited to p-curve and does not generalize to z-curve.

A Z-Curve Analysis of Emotion Journals: Soto & Schimmack 2024

For the full article see:

Full citation: Soto, M. D., & Schimmack, U. (2024). Credibility of results in emotion science: A Z-curve analysis of results in the journals Cognition & Emotion and Emotion. Cognition and Emotion. https://doi.org/10.1080/02699931.2024.2443016

OSF repository: https://osf.io/42vxd/

Purpose of this document: This is a detailed analytical summary written entirely in the summarizer’s own words. It is intended to make the paper’s methods, results, and arguments accessible for discussion and analysis without reproducing copyrighted text. Readers should consult the original article for exact language and figures.


Structured Summary

1. Motivation and Research Question

The paper addresses whether the replication crisis — documented most prominently by the Open Science Collaboration (2015), which found only 36% of psychology results replicated — extends to the emotion research literature specifically. The authors note that the OSC findings were limited to articles from 2008 and may not generalize to emotion research, which has its own dedicated journals and traditions.

The two journals examined are Cognition & Emotion (established 1987) and Emotion (established 2001 by APA). The authors aimed to assess: (a) how much selection bias exists in these journals, (b) what proportion of published results might be false positives, (c) what the expected replication rate is, and (d) whether these indicators have improved over time in response to the replication crisis.


2. Z-Curve Method: How It Works

The paper uses Z-curve 2.0 (Bartoš & Schimmack, 2022), which takes a set of test statistics, converts them to absolute z-scores, and fits a finite mixture model to the distribution of statistically significant z-values (those exceeding 1.96). The method produces four key estimates:

Expected Discovery Rate (EDR): An estimate of the average true power of studies before selection for significance. This represents what proportion of all conducted tests (including unpublished ones) would be expected to reach significance. It is conceptually the mean power across the full population of tests.

Expected Replication Rate (ERR): An estimate of mean power after selection for significance — that is, among published significant results. Because significance selection favors higher-powered studies, ERR is always higher than EDR. The authors frame ERR as an optimistic upper bound on expected replication success.

Observed Discovery Rate (ODR): Simply the proportion of extracted test statistics that were statistically significant at p < .05. Comparing ODR to EDR quantifies selection bias: a large gap indicates that many non-significant results went unreported.

False Discovery Risk (FDR): Computed from the EDR using Soric’s (1989) formula, which gives the maximum proportion of significant results that could be false positives given a particular discovery rate.

The authors explicitly note that ERR overestimates actual replication success (comparing z-curve’s ERR for the OSC dataset to the actual 36% rate), and they recommend interpreting the true replication rate as falling somewhere between EDR and ERR, citing Sotola (2023) for empirical support.


3. Methods

3.1 Test Statistic Extraction

The authors collected the complete set of published articles from both journals (3,831 from C&E covering 1987–2023; 2,323 from Emotion covering 2001–2023). Using custom R code built on the pdftools package (Ooms, 2024), they automatically extracted reported test statistics: F-tests, t-tests, chi-square tests (with df between 1 and 6 only, to exclude SEM model-fit tests), z-tests, and 95% confidence intervals of odds ratios and regression coefficients.

Chi-square tests with df > 6 were excluded because these typically come from structural equation modeling, where rejecting the null indicates poor model fit rather than a substantive finding. Confidence intervals were excluded when reported alongside test statistics to avoid double-counting. Meta-analysis articles were excluded entirely.

The extraction code was designed to handle various notation formats across journals and was iteratively refined. However, the authors acknowledge that the automated process cannot extract statistics from tables or figures, and cannot distinguish between focal and non-focal hypothesis tests.

After exclusions (including test statistics with N < 30, since t-to-z conversion is unreliable at very low df), the final samples were 30,513 z-scores from 1,902 C&E articles and 35,457 z-scores from 1,953 Emotion articles. The majority were F-tests (62% C&E, 53% Emotion) and t-tests (26% C&E, 28% Emotion).

3.2 Statistical Analysis — The Clustering Approach

This is a critical methodological detail. The authors used the zcurve_clustered function with the “b” method. This method works by sampling a single test statistic from each article during model fitting, thereby addressing within-article dependence. This directly addresses concerns about independence violations that arise when multiple test statistics are extracted from the same paper.

The EM algorithm was applied to significant z-values between 1.96 and 6 (values above 6 are treated as having essentially 100% power). The fitted mixture model uses seven discrete components (z = 0 through 6), and the estimated weights are used to compute EDR and ERR. The model then extrapolates the full distribution to estimate what the non-significant portion would look like without selection.

3.3 Time Trend Analysis

Annual z-curve estimates were computed for each publication year and regressed on linear and quadratic predictors of year. The quadratic term tested whether improvements accelerated after 2011 (when the replication crisis became prominent).

3.4 Hand-Coded Focal Tests

To address the limitation that automatic extraction conflates focal and non-focal tests, the authors also present results from 241 hand-coded articles from 2010 and 2020, drawn from an ongoing project covering 30+ journals and 4,000+ studies (Schimmack, 2020). This sample contained 227 significant tests out of 241 total.


4. Results

4.1 Main Z-Curve Estimates

The two journals produced remarkably similar results:

ParameterCognition & EmotionEmotion
ODR71% [70%, 71%]70% [70%, 70%]
EDR30% [14%, 53%]31% [15%, 53%]
ERR66% [59%, 73%]65% [59%, 71%]
FDR12% [5%, 32%]12% [5%, 30%]

The ODR-EDR gap (approximately 40 percentage points) provides clear evidence of selection bias in both journals, confirmed visually by a sharp drop in observed z-scores just below the significance threshold of 1.96.

The ERR of approximately 65% suggests that the majority of published significant results should replicate with the same sample size, though the authors stress this is an optimistic estimate. The FDR point estimate of 12% is comparable to medical clinical trial journals (14% per Schimmack & Bartoš, 2023) and substantially lower than the most pessimistic predictions (Ioannidis, 2005). However, the upper bound of the FDR confidence interval (~30%) is high enough to warrant concern.

4.2 Time Trends

Sample sizes (degrees of freedom): Both journals showed significant linear increases over time, with some acceleration (significant quadratic trends). Median within-group df increased from roughly 50 in the early years to over 100 in recent years for Emotion, and showed a particularly sharp increase in C&E’s most recent years.

ODR: Both journals showed significant linear decreases in ODR over time (approximately 0.45 percentage points per year), suggesting that non-significant results are being reported more frequently. However, the quadratic terms were non-significant, meaning this trend preceded the replication crisis rather than being a response to it.

EDR: Both journals showed significant increases in EDR over time, consistent with increasing sample sizes leading to higher power. The combination of decreasing ODR and increasing EDR indicates that selection bias has diminished, though it remains present.

ERR: Increased over time for both journals, with C&E showing a significant acceleration (quadratic trend) suggesting the replication crisis may have prompted improvements.

FDR: Decreased over time as a direct consequence of the increasing EDR.

4.3 Hand-Coded Focal Test Results

The 241 hand-coded focal tests from 2010 and 2020 yielded:

ParameterEstimate95% CI
ODR94%[91%, 97%]
EDR27%[10%, 67%]
ERR65%[53%, 75%]
FDR14%[3%, 50%]

The ODR for focal tests (94%) is substantially higher than the 70–71% from automatic extraction, confirming that automatic extraction captures many non-focal, non-significant tests that dilute the ODR. However, the EDR, ERR, and FDR estimates are comparable to the automatically extracted results and fall within their confidence intervals. This is an important robustness check: the key z-curve parameters are not substantially altered by the inclusion of non-focal tests.

4.4 Alpha Adjustment Analysis

The authors examined the effect of lowering the significance threshold on discovery rates and false positive risk. Lowering alpha from .05 to .01 retains approximately half of all significant results while reducing FDR to below 5% for most publication years. Further reductions to .005 or .001 have diminishing returns for FDR reduction but increasingly sacrifice power.


5. Discussion and Interpretation

The authors frame their results as relatively encouraging for emotion research compared to worst-case scenarios. Key interpretive points:

The FDR of approximately 12% (though with wide CIs) suggests that most published significant results in emotion journals are not false positives. However, the upper bound of the CI leaves open the possibility of rates up to 30%.

The ERR of 65% predicts that most significant results should replicate with the same sample size, but this is optimistic. Adjusting for the estimated FDR, power for true effects may be approximately 72%, close to the conventional 80% benchmark but with substantial heterogeneity — half of studies have less power than this average.

The authors recommend treating results with p-values between .05 and .01 with skepticism, and suggest that alpha = .01 provides a better balance between false positive risk and power loss for the emotion literature specifically. They emphasize this recommendation is for evaluating existing literature, not as a new publication standard.

On effect sizes, the authors warn that selection bias inflates point estimates, making even meta-analytic effect sizes unreliable unless bias correction is applied. They advocate for honest reporting of all results, including non-significant ones, as essential for accurate meta-analysis.


6. Limitations Acknowledged by the Authors

The authors explicitly discuss several limitations:

  1. Z-curve’s selection model assumes that publication probability is a function of power. In reality, questionable research practices (QRPs) can produce significance without real effects, potentially inflating EDR estimates and underestimating selection bias.
  2. Simulation studies of z-curve performance under QRP-generated data are lacking.
  3. The N > 30 exclusion removes some studies, though supplementary analyses with the full sample show similar results.
  4. Automated extraction cannot distinguish focal from non-focal tests (addressed by the hand-coded analysis).
  5. The automated extraction cannot reliably capture statistics from tables or figures.

7. Key Methodological Features Relevant to the Pek et al. Debate

Several aspects of this paper are directly relevant to criticisms raised by Pek et al.:

Independence assumption: Soto & Schimmack explicitly used zcurve_clustered with the “b” method, which samples one test statistic per article during bootstrapping. This directly addresses the concern about within-article dependence. The method section states this clearly.

Focal vs. non-focal tests: The paper includes both automatic extraction (all tests) and hand-coded focal tests, and shows that the z-curve parameters (EDR, ERR, FDR) are comparable across both approaches. This addresses the concern that including non-focal tests distorts results.

Appropriate caveats: The authors consistently describe ERR as optimistic, characterize the true replication rate as lying between EDR and ERR, acknowledge the wide confidence intervals on EDR and FDR, and explicitly discuss the limitations of the selection model assumption.

Asymmetric interpretation: The paper notes that z-curve evaluations of credibility are asymmetric — low values raise concerns about a literature, but high values do not guarantee credibility.


8. Summary Table of All Z-Curve Estimates

AnalysisN testsN sigODREDR [95% CI]ERR [95% CI]FDR [95% CI]
C&E (auto)30,51321,62871%30% [14%, 53%]66% [59%, 73%]12% [5%, 32%]
Emotion (auto)35,45724,82470%31% [15%, 53%]65% [59%, 71%]12% [5%, 30%]
Focal (hand-coded)24122794%27% [10%, 67%]65% [53%, 75%]14% [3%, 50%]

Summary prepared for analytical discussion purposes. All descriptions reflect the summarizer’s interpretation of the original work. For exact language, figures, and supplementary analyses, consult the published article.

How to Respond to Criticisms of Z-Curve in Peer-Review

The pattern is now familiar. I received another anonymous review by Reviewer 2 from a z-curve article that repeated Pek’s concerns about the performance of z-curve. To deal with biased reviewers, journals allow authors to mention potentially biased reviewers. I suggest doing so for Pek. I also suggest sharing a manuscript with me to ensure proper interpretation of results and to make it “reviewer-safe.”

To justify the claim that Pek is biased, researchers can use this rebuttal of Pek’s unscientific claims about z-curve.

Reviewer 2

Reviewer Report:

The manuscript “A review and z-curve analysis of research on the palliative association of system justification” (Manuscript ID 1598066) extends the work of Sotola and Credé (2022), who used Z-curve analysis to evaluate the evidential value of findings related to system justification theory (SJT). The present paper similarly reports estimates of publication bias, questionable research practices (QRPs), and replication rates in the SJT literature using Z-curve. Evaluating how scientific evidence accumulates in the published literature is unquestionably important.

However, there is growing concern about the performance of meta-analytic forensic tools such as p-curve (Simonsohn, Nelson, & Simmons, 2014; see Morey & Davis-Stober, 2025 for a critique) and Z-curve (Brunner & Schimmack, 2020; Bartoš & Schimmack, 2022; see Pek et al., in press for a critique). Independent simulation studies increasingly suggest that these methods may perform poorly under realistic conditions, potentially yielding misleading results.

Justification for a theory or method typically requires subjecting it to a severe test (Mayo, 2019) – that is, assuming the opposite of what one seeks to establish (e.g., a null hypothesis of no effect) and demonstrating that this assumption leads to contradiction. In contrast, the simulation work used to support Z-curve (Brunner & Schimmack, 2020; Bartoš & Schimmack, 2022) relies on affirming belief through confirmation, a well-documented cognitive bias.

Findings from Pek et al. (in press) show that when selection bias is presented in published p-values — the very scenario Z-curve was intended to be applied — estimates of the expected discovery rate (EDR), expected replication rate (ERR), and Sorić’s False Discovery Risk (FDR) are themselves biased.

The magnitude and direction of this bias depend on multiple factors (e.g., number of p-values, selection mechanism of p-values) and cannot be corrected or detected from empirical data alone. The manuscript’s main contribution rests on the assumption that Z-curve yields reasonable estimates of the “reliability of published studies,” operationalized as a high ERR, and that the difference between the observed discovery rate (ODR) and EDR quantifies the extent of QRPs and publication bias.

The paper reports an ERR of .76, 95% CI [.53, .91] and concludes that research on the palliative hypothesis may be more reliable than findings in many other areas of psychology. There are several issues with this claim. First, the assertion that Sotola (2023) validated ERR estimates from the Z-curve reflects confirmation bias – I have not read Röseler (2023) and cannot comment on the argument made in it. The argument rests solely on the descriptive similarly between the ERR produced by Z-curve and the replication rate reported by the Open Science Collaboration (2015). However, no formal test of equivalence was conducted, and no consideration was given to estimate imprecision, potential bias in the estimates, or the conditions under which such agreement might occur by chance.

At minimum, if Z-curve estimates are treated as predicted values, some form of cross-validation or prediction interval should be used to quantify prediction uncertainty. More broadly, because ERR estimates produced by Z-curve are themselves likely biased (as shown in Pek et al., in press), and because the magnitude and direction of this bias are unknown, comparisons about ERR values across literatures do not provide a strong evidential basis for claims about the relative reliability of research areas.

Furthermore, the width of the 95% CI spans roughly half of the bounded parameter space of [0, 1], indicating substantial imprecision. Any claims based on these estimates should thus be contextualized with appropriate caution.

Another key result concerns the comparison of EDR = .52, 95% CO [.14, .92], and ODR = .81, 95% CI = [.69, .90]. The manuscript states that “When these two estimates are highly discrepant, this is consistent with the presence of questionable research practices (QRPS) and publication bias in this area of research (Brunner & Schimmack, 2020).

But in this case, the 95% CIs for the EDR and ODR in this work overlapped quite a bit, meaning that they may not be significantly different…” (p. 22). There are several issues with such a claim. First, Z curve results cannot directly support claims about the presence of QRPs.

The EDR reflects the proportion of significant p values expected under no selection bias, but it does not identify the source of selection bias (e.g., QRPs, fraud, editorial decisions). Using Z curve requires accepting its assumed missing data mechanism—a strong assumption that cannot be empirically validated.

Second, a descriptive comparison between two estimates cannot be interpreted as a formal test of difference (e.g., eyeballing two estimates of means as different does not tell us whether this difference is not driven by sampling variability). Means can be significantly different even if their confidence intervals overlap (Cumming & Finch, 2005).

A formal test of the difference is required. Third, EDR estimates can be biased. Even under ideal conditions, convergence to the population values requires extremely large numbers of studies (e.g., > 3000, see Figure 1 of Pek et al., in press).

The current study only has 64 tests. Thus, even if a formal test of the difference of ODR – EDR was conducted, little confidence could be placed on the result if the EDR estimate is biased and does not reflect the true population value.

Although I am critical of the outputs of Z curve analysis due to its poor statistical performance under realistic conditions, the manuscript has several strengths. These include adherence to good meta analytic practices such as providing a PRISMA flow chart, clearly stating inclusion and exclusion criteria, and verifying the calculation of p values. These aspects could be further strengthened by reporting test–retest reliability (given that a single author coded all studies) and by explicitly defining the population of selected p values. Because there appears to be heterogeneity in the results, a random effects meta analysis may be appropriate, and study level variables (e.g., type of hypothesis or analysis) could be used to explain between study variability. Additionally, the independence of p values has not been clearly addressed; p values may be correlated within articles or across studies. Minor points: The “reliability” of studies should be explicitly defined. The work by Manapat et al. (2022) should be cited in relation to Nagy et al. (2025). The findings of Simmons et al. (2011) applies only to single studies.

However, most research is published in multi-study sets, and follow-up simulations by Wegener at al. (2024) indicate that the Type I error rate is well-controlled when methodological constraints (e.g., same test, same design, same measures) are applied consistently across multiple studies – thus, the concerns of Simmons et al. (2011) pertain to a very small number of published results.

I could not find the reference to Schimmack and Brunner (2023) cited on p. 17.


Rebuttal to Core Claims in Recent Critiques of z-Curve

1. Claim: z-curve “performs poorly under realistic conditions”

Rebuttal

The claim that z-curve “performs poorly under realistic conditions” is not supported by the full body of available evidence. While recent critiques demonstrate that z-curve estimates—particularly EDR—can be biased under specific data-generating and selection mechanisms, these findings do not justify a general conclusion of poor performance.

Z-curve has been evaluated in extensive simulation studies that examined a wide range of empirically plausible scenarios, including heterogeneous power distributions, mixtures of low- and high-powered studies, varying false-positive rates, different degrees of selection for significance, and multiple shapes of observed z-value distributions (e.g., unimodal, right-skewed, and multimodal distributions). These simulations explicitly included sample sizes as low as k ≈ 100, which is typical for applied meta-research in psychology.

Across these conditions, z-curve demonstrated reasonable statistical properties conditional on its assumptions, including interpretable ERR and EDR estimates and confidence intervals with acceptable coverage in most realistic regimes. Importantly, these studies also identified conditions under which estimation becomes less informative—such as when the observed z-value distribution provides little information about missing nonsignificant results—thereby documenting diagnosable scope limits rather than undifferentiated poor performance.

Recent critiques rely primarily on selective adversarial scenarios and extrapolate from these to broad claims about “realistic conditions,” while not engaging with the earlier simulation literature that systematically evaluated z-curve across a much broader parameter space. A balanced scientific assessment therefore supports a more limited conclusion: z-curve has identifiable limitations and scope conditions, but existing simulation evidence does not support the claim that it generally performs poorly under realistic conditions.


2. Claim: Bias in EDR or ERR renders these estimates uninterpretable or misleading

Rebuttal

The critique conflates the possibility of bias with a lack of inferential value. All methods used to evaluate published literatures under selection—including effect-size meta-analysis, selection models, and Bayesian hierarchical approaches—are biased under some violations of their assumptions. The existence of bias therefore does not imply that an estimator is uninformative.

Z-curve explicitly reports uncertainty through bootstrap confidence intervals, which quantify sampling variability and model uncertainty given the observed data. No evidence is presented that z-curve confidence intervals systematically fail to achieve nominal coverage under conditions relevant to applied analyses. The appropriate conclusion is that z-curve estimates must be interpreted conditionally and cautiously, not that they lack statistical meaning.


3. Claim: Reliable EDR estimation requires “extremely large” numbers of studies (e.g., >3000)

Rebuttal

This claim overgeneralizes results from specific, highly constrained simulation scenarios. The cited sample sizes correspond to conditions in which the observed data provide little identifying information, not to a general requirement for statistical validity.

In applied statistics, consistency in the limit does not imply that estimates at smaller sample sizes are meaningless; it implies that uncertainty must be acknowledged. In the present application, this uncertainty is explicitly reflected in wide confidence intervals. Small sample sizes therefore affect precision, not validity, and do not justify dismissing the estimates outright.


4. Claim: Differences between ODR and EDR cannot support inferences about selection or questionable research practices

Rebuttal

It is correct that differences between ODR and EDR do not identify the source of selection (e.g., QRPs, editorial decisions, or other mechanisms). However, the critique goes further by implying that such differences lack diagnostic value altogether.

Under the z-curve framework, ODR–EDR discrepancies are interpreted as evidence of selection, not of specific researcher behaviors. This inference is explicitly conditional and does not rely on attributing intent or mechanism. Rejecting this interpretation would require demonstrating that ODR–EDR differences are uninformative even under monotonic selection on statistical significance, which has not been shown.


5. Claim: ERR comparisons across literatures lack evidential basis because bias direction is unknown

Rebuttal

The critique asserts that because ERR estimates may be biased with unknown direction, comparisons across literatures lack evidential value. This conclusion does not follow.

Bias does not eliminate comparative information unless it is shown to be large, variable, and systematically distorting rankings across plausible conditions. No evidence is provided that ERR estimates reverse ordering across literatures or are less informative than alternative metrics. While comparative claims should be interpreted cautiously, caution does not imply the absence of evidential content.


6. Claim: z-curve validation relies on “affirming belief through confirmation”

Rebuttal

This characterization misrepresents the role of simulation studies in statistical methodology. Simulation-based evaluation of estimators under known data-generating processes is the standard approach for assessing bias, variance, and coverage across frequentist and Bayesian methods alike.

Characterizing simulation-based validation as epistemically deficient would apply equally to conventional meta-analysis, selection models, and hierarchical Bayesian approaches. No alternative validation framework is proposed that would avoid reliance on model-based simulation.


7. Implicit claim: Effect-size meta-analysis provides a firmer basis for credibility assessment

Rebuttal

Effect-size meta-analysis addresses a different inferential target. It presupposes that studies estimate commensurable effects of a common hypothesis. In heterogeneous literatures, pooled effect sizes represent averages over substantively distinct estimands and may lack clear interpretation.

Moreover, effect-size meta-analysis does not estimate discovery rates, replication probabilities, or false-positive risk, nor does it model selection unless explicitly extended. No evidence is provided that effect-size meta-analysis offers superior performance for evaluating evidential credibility under selective reporting.


Summary

The critiques correctly identify that z-curve is a model-based method with assumptions and scope conditions. However, they systematically extend these points beyond what the evidence supports by:

  • extrapolating from selective adversarial simulations,
  • conflating potential bias with lack of inferential value,
  • overgeneralizing small-sample limitations,
  • and applying asymmetrical standards relative to conventional methods.

A scientifically justified conclusion is that z-curve provides conditionally informative estimates with quantifiable uncertainty, not that it lacks statistical validity or evidential relevance.


Reply to Erik van Zwet: Z-Curve Only Works on Earth

In the 17th century, early telescopic observations of Mars suggested that the planet might be populated. Now imagine a study that aims to examine whether Martians are taller than humans. The problem is obvious: although we may assume that Martians exist, we cannot observe or measure them, and therefore we end up with zero observations of Martian height. Would we blame the t-test for not telling us what we want to know? I hope your answer to this rhetorical question is “No, of course not.”

If you pass this sanity check, the rest of this post should be easy to follow. It responds to criticism by Erik van Zwet (EvZ), hosted and endorsed by Andrew Gelman on his blog,

Concerns about the z-curve method.”

EvZ imagines a scenario in which z-curve is applied to data generated by two distinct lines of research. One lab conducts studies that test only true null hypotheses. While exact effect sizes of zero may be rare in practice, attempting to detect extremely small effects in small samples is, for all practical purposes, equivalent. A well-known example comes from early molecular genetic research that attempted to link variation in single genes—such as the serotonin transporter gene—to complex phenotypes like Neuroticism. It is now well established that these candidate-gene studies produced primarily false positive results when evaluated with the conventional significance threshold of α = .05.

In response, molecular genetics fundamentally changed its approach. Researchers began testing many genetic variants simultaneously and adopted much more stringent significance thresholds to control the multiple-comparison problem. In the simplified example used here, I assume α = .001, implying an expected false positive rate of only 1 in 1,000 tests. I further assume that truly associated genetic predictors—single nucleotide polymorphisms (SNPs)—are tested in very large samples, such that sampling error is small and true effects yield z-values around 6. This is, of course, a stylized assumption, but it serves to illustrate the logic of the critique.

Figure 1 illustrates a situation with 1,000 studies from each of these two research traditions. Among the 1,000 candidate-gene studies, only one significant result is expected by chance. Among the genome-wide association studies (GWAS), power to reject the null hypothesis at α = .001 is close to 1, although a small number (3–4 out of 1,000) of studies may still fail to reach significance.

At this point, it is essential to distinguish between two scenarios. In the first scenario, all 999 non-significant results are observed and available for analysis. If we could recover the full distribution of results—including non-significant ones—we could fit models to the complete set of z-values. Z-curve can, in principle, be applied to such data, but it was not designed for this purpose.

Z-curve was developed for the second scenario. In this scenario, the light-purple, non-significant results exist only in researchers’ file drawers and are not part of the observed record. This situation—selection for statistical significance—is commonly referred to as publication bias. In psychology, success rates above 90% strongly suggest that statistical significance is a necessary condition for publication (Sterling, 1959). Under such selection, non-significant results provide no observable information, and only significant results remain. In extreme cases, it is theoretically possible that all published significant findings are false positives (Rosenthal, 1979), and in some literatures—such as candidate-gene research or social priming—this possibility is not merely theoretical.

Z-curve addresses uncertainty about the credibility of published significant results by explicitly conditioning on selection for significance and modeling only those results. When success rates approach 90% or higher, there is often no alternative: non-significant results are simply unavailable.

In Figure 1, the light-purple bars represent non-significant results that exist only in file drawers. Z-curve is fitted exclusively to the dark-purple, significant results. Based on these data, the fitted model (red curve), which is centered near the true value of z = 6, correctly infers that the average true power of the studies contributing to the significant results is approximately 99% when α = .001 (corresponding to a critical value of z ≈ 3.3).

Z-curve also estimates the Expected Discovery Rate (EDR). Importantly, the EDR refers to the average power of all studies that were conducted in the process of producing the observed significant results. This conditioning is crucial. Z-curve does not attempt to estimate the total number of studies ever conducted, nor does it attempt to account for studies from populations that could not have produced the observed significant findings. In this example, candidate-gene studies that produced non-significant results—whether published or not—are irrelevant because they did not contribute to the set of significant GWAS results under analysis.

What matters instead is how many GWAS studies failed to reach significance and therefore remain unobserved. Given the assumed power, this number is at most 3–4 out of 1,000 (<1%). Consequently, an EDR estimate of 99% is correct and indicates that publication bias within the relevant population of studies is trivial. Because the false discovery rate is derived from the EDR, the implied false positive risk is effectively zero—again, correctly so for this population.

EvZ’s criticism of z-curve is therefore based on a misunderstanding of the method’s purpose and estimand. He evaluates z-curve against a target that includes large numbers of studies that leave no trace in the observed record and have no influence on the distribution of significant results being analyzed. But no method that conditions on observed significant results can recover information about such studies—nor should it be expected to.

Z-curve is concerned exclusively with the credibility of published significant results. Non-significant studies that originate from populations that do not contribute to those results are as irrelevant to this task as the height of Martians.


Response to van Zwet’s Concerns about Z-Curve


Bartoš, F., & Schimmack, U. (2022). Z-curve 2.0: Estimating replication rates and discovery rates. Meta-Psychology, 6, Article e0000130. https://doi.org/10.15626/MP.2022.2981

Brunner, J., & Schimmack, U. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta- Psychology. MP.2018.874, https://doi.org/10.15626/MP.2018.874

van Zwet, E., Gelman, A., Greenland, S., Imbens, G., Schwab, S., & Goodman, S. N. (2024). A New Look at P Values for Randomized Clinical Trials. NEJM evidence3(1), EVIDoa2300003. https://doi.org/10.1056/EVIDoa2300003

The Story of Two Z-Curve Models

Erik van Zwet recently posted a critique of the z-curve method on Andrew Gelman’s blog.

Concerns about the z-curve method | Statistical Modeling, Causal Inference, and Social Science

Meaningful discussion of the severity and scope of this critique was difficult in that forum, so I address the issue more carefully here.

van Zwet identified a situation in which z-curve can overestimate the Expected Discovery Rate (EDR) when it is inferred from the distribution of statistically significant z-values. Specifically, when the distribution of significant results is driven primarily by studies with high power, the observed distribution contains little information about the distribution of nonsignificant results. If those nonsignificant results are not reported and z-curve is nevertheless used to infer them from the significant results alone, the method can underestimate the number of missing nonsignificant studies and, as a consequence, overestimate the Expected Discovery Rate (EDR).

This is a genuine limitation, but it is a conditional and diagnosable one. Crucially, the problematic scenarios are directly observable in the data. Problematic data have an increasing or flat slope of the significant z-value distribution and a mode well above the significance threshold. In such cases, z-curve does not silently fail; it signals that inference about missing studies is weak and that EDR estimates should not be trusted.

This is rarely a problem in psychology, where most studies have low power, the mode is at the significance criterion, and the slope decreases, often steeply. This pattern implies a large set of non-significant results and z-curve provides good estimates in these scenarios. It is difficult to estimate distributions of unobserved data, leading to wide confidence intervals around these estimates. However, there is no fixed number of studies that are needed. The relevant question is whether the confidence intervals are informative enough to support meaningful conclusions.

One of the most powerful set of studies that I have actually seen comes from epidemiology, where studies often have large samples to estimate effect sizes precisely. In these studies, power to reject the null hypothesis is actually not really important, but the data serve as a good example of a set of studies with high power, rather than low power as in psychology.

However, even this example shows a decreasing slope and a mode at significance criterion. Fitting z-curve to these data still suggests some selection bias and no underestimation of reported non-significant results. This illustrates how extreme van Zwet’s scenario must be to produce the increasing-slope pattern that undermines EDR estimation.

What about van Zwet’s Z-Curve Method?

It is also noteworthy that van Zwet does not compare our z-curve method (Bartos & Schimmack, 2022; Brunner & Bartos, 2020) to his own z-curve method that was used to analyze z-values from clinical trials (van Zwet et al., 2024).

The article fits a model to the distribution of absolute z-values (ignoring whether results show a benefit or harm to patients). The key differences between the two approaches are that (a) van Zwet et al.’s model uses all z-values and assumes (implicitly) that there is no selection bias, and (b) that true effect sizes are never zero and errors can only be sign errors. Based on these assumptions, the article concludes that no more than 2% of clinical trials produce a result that falsely rejects a true hypothesis. For example, a statistically significant result could be treated as an error only if the true effect has the opposite sign (e.g., the true effect increases smoking, but a significant result is used to claim it reduced smoking).

The advantage of this method is that it is not necessary to estimate the EDR from the distribution of only significant results, but it does so only by assuming that publication bias does not exist. In this case, we can just count the observed non-significant and significant results and use the observed discovery rate to estimate average power and the false positive risk.

The trade-off is clear. z-curve attempts to address selection bias and sometimes lacks sufficient information to do so reliably; van Zwet’s approach achieves stable estimates by assuming the problem away. The former risks imprecision when information is weak; the latter risks bias when its core assumption is violated.

In the example from epidemiology, there is evidence of some publication bias and omission of non-significant results. Using van Zwet’s model would be inappropriate because it would overestimate the true discovery rate. The focus on sign errors alone is also questionable and should be clearly stated as a strong assumption. It implies that significant results in the right direction are not errors, even if effect sizes are close to zero. For example, a significant result that suggests it extends life is considered a true finding, even if the effect size is one day.

False positive rates do not fully solve that problem, but false positive rates that include zero as a hypothetical value for the population effect size are higher and treat small effects close to zero as errors rather than treating half of them as correct rejections of the null hypothesis. For example, an intervention that decreases smoking by 1% of all smokers is not really different from one that increases it by 1%, but a focus on signs treats only the latter one as an error.

In short, van Zwet’s critique identifies a boundary condition for z-curve, not a general failure. At the same time, his own method rests on a stronger and untested assumption—no selection bias—whose violation would invalidate its conclusions entirely. No method is perfect and using a single scenario to imply that a method is always wrong is not a valid argument against any method. By the same logic, van Zwet’s own method could be declared “useless” whenever selection bias exists, which is precisely the point: all methods have scope conditions.

Using proper logic, we suggest that all methods work when assumptions are met. The main point is to test whether they are met or not. We clarified that z-curve estimation of the EDR assumes that enough low powered studies produced significant results to influence the distribution of significant results. If the slope of significant results is not decreasing, this assumption does not hold and z-curve should not be used to estimate the EDR. Similarly, users of van Zwets first method should first test whether selection bias is present and not use it when it does. They should also examine whether they think a proportion of studies could have tested practically true null hypotheses and not use the method when this is a concern.

Finally, the blog post responds to Gelman’s polemic about our z-curve method and earlier work by Jager and Leek (2014), by noting that Gelman’s critic of other methods exist in parallel to his own work (at least co-authorship) that also modeled distribution of z-values to make claims about power and the risk of false inferences. The assumption of this model that selection bias does not exist is peculiar, given Gelman’s typical writing about low power and the negative effects of selection for significance. A more constructive discussion would apply the same critical standards to all methods—including one’s own.


Frequently Asked Questions about Z-Curve

under development

Can’t find what you are looking for?
1. Ask an AI to search replicationindex.com to find answers that are not here)
2. Send me an email and I will answer your question and add it to the FAQ list.

Link to the tutorial:
🔗 Z-Curve 3.0 Tutorial (Introduction and links to all chapters)https://replicationindex.com/2025/07/08/z-curve-tutorial-introduction/ (replicationindex.com)

1. Has z-curve been tested with simulation studies?

Yes. The original z-curve method was evaluated in large-scale simulation studies by Jerry Brunner. The extended z-curve 2.0 was tested by Frantisek Bartoš. The results of these simulation studies were part of the manuscripts submitted for publication. They have also been reproduced by independent researchers. The results and the code to reproduce or run new analyses are shared on the Open Science Framework.

After concerns were raised about performance with small sets of studies, new simulation studies showed good performance with just 50 significant results (see Concerns About Z-Curve: Evidence From New Simulations With Few Studies).

It is important to distinguish between estimation of the Expected Replication Rate (ERR) and the Expected Discovery Rate (EDR). The ERR estimates the average true power of the significant results. It is estimated with high accuracy even with fewer than 50 studies. The EDR predicts the distribution of non-significant results that were not observed. This prediction is more difficult and requires more information from the significant results. With small samples, confidence intervals are naturally wide, but their width is data-dependent and informative in itself.

In short, ERR estimates can be obtained even with small sets of significant results. They are also preferable to p-curve’s estimates of average power, which degrade when studies vary in power. EDR estimates are trustworthy with more than 50 significant results.

Does z-curve fail when power is homogeneous and falls between grid points?

The discrete mixture model uses fixed components at noncentrality parameters 0 through 6. When all studies share a single noncentrality that falls between two grid points — for example, NCP = 1.5 — the model must approximate a point mass using weights on flanking components. Van Zwet (2026) showed that this maximizes approximation error and can bias the EDR estimate.

Z-curve 3.0 addresses this directly. When low heterogeneity is detected, the algorithm fits a single-component model with a free noncentrality parameter and adds the estimated NCP as an additional component. The EM algorithm then places weight on this data-driven component rather than splitting weight awkwardly across the fixed grid. In simulations with NCP = 1.5, the added component recovers the correct location and the bias disappears.

When power varies across studies — as in any real meta-analysis of conceptual replications — the discrete model performs well without this correction. The mixture weights approximate the underlying distribution of power much like a histogram approximates a smooth density. Simulation studies confirm that the discrete model outperforms a parametric normal model under high heterogeneity, because it makes no assumption about the shape of the power distribution. In this sense, the discrete model is better understood as a distribution-free (nonparametric) estimator, and its flexibility is an advantage rather than a limitation.

3. Does z-curve offer options for small sample (small-N) literatures like animal research?

Short answer:
Yes — z-curve 3.0 adds new transformation methods and a t-curve option that make the method more appropriate for analyses involving small samples (e.g., N < 30). These options are designed to mitigate biases that arise when you convert small-sample test statistics to z-scores using standard normal approximations. Z-curve.3.0 also allows researchers to use t-distributions (t-curve) with a fixed df that is more similar to the distributions of test statistics from small samples than the standard normal distribution.

Details:

  • The z-curve 3.0 tutorial (Chapter 8) explains that instead of only converting p-values to z-scores, you can now:
    • Try alternative transformations of t-values that better reflect their sampling distribution, and
    • Use a direct t-curve model that fits t-distributions with specified degrees of freedom instead of forcing a normal approximation. This “t-curve” option is recommended when studies have similar and genuinely small degrees of freedom (like many animal experiments). (replicationindex.com)
  • These improvements help reduce bias introduced by naïve normal transformations, though they don’t completely eliminate all small-sample challenges, and performance can still be unstable when degrees of freedom vary widely or are extremely small. (replicationindex.com)


Z-Curve.3.0 Tutorial: Introduction

Links to Additional Resources and Answers to Frequently Asked Questions

Chapters

This post is Chapter 1. The R-code for this chapter can be found on my github:
zcurve3.0/Tutorial.R.Script.Chapter1.R at main · UlrichSchimmack/zcurve3.0
(the picture for this post shows a “finger-plot, you can make your own with the code)

Chapter 2 shows the use of z-curve.3.0 with the Open Science Collaboration Reproducibility Project (Science, 2015) p-values of the original studies.
zcurve3.0/Tutorial.R.Script.Chapter2.R at main · UlrichSchimmack/zcurve3.0

Chapter 3 shows the use of z-curve.3.0 with the Open Science Collaboration Reproducibility Project (Science, 2015) p-values of the replication studies.
zcurve3.0/Tutorial.R.Script.Chapter3.R at main · UlrichSchimmack/zcurve3.0

Chapter 4 shows how you can run simulation studies to evaluate the performance of z-curve for yourself.
zcurve3.0/Tutorial.R.Script.Chapter4.R at main · UlrichSchimmack/zcurve3.0

Chapter 5 uses the simulation from Chapter 4 to compare the performance of z-curve with p-curve, another method that aims to estimate the average power of only significant results that is used to estimate the expected replication rate with z-curve.
zcurve3.0/Tutorial.R.Script.Chapter5.R at main · UlrichSchimmack/zcurve3.0

Chapter 6 uses the simulation from Chapter 4 to compare the performance of the default z-curve method with a z-curve that assumes a normal distribution of population effect sizes. The simulation highlights the problem of making distribution assumptions. One of the strengths of z-curve is that it does not make an assumption about the distribution of power.
zcurve3.0/Tutorial.R.Script.Chapter6.R at main · UlrichSchimmack/zcurve3.0

Chapter 7 uses the simulation from Chapter 4 to compare the performance of z-curve to a Bayesian mixture model (bacon). The aim of bacon is different, but it also fits a mixture model to a set of z-values. The simulation results show that z-curve performs better than the Bayesian mixture model.
zcurve3.0/Tutorial.R.Script.Chapter7.R at main · UlrichSchimmack/zcurve3.0

Chapter 8 uses the simulation from Chapter 4 to examine the performance of z-curve with t-values from small studies (N = 30). It introduces a new transformation method that performs better than the default method from z-curve.2.0 and it introduces the t-curve option to analyze t-values from small studies with t-distributions.
zcurve3.0/Tutorial.R.Script.Chapter8.R at main · UlrichSchimmack/zcurve3.0

Chapter 9 simulates p-hacking by combining small samples with favorable trends into a larger sample with a significant result (patchwork samples). The simulation simulates studies with between-subject two-group designs with varying means and SD of effect sizes and sample sizes. It also examines the ability of z-curve to detect p-hacking and compares the performance of the default z-curve that does not make assumptions about the distribution of power and a z-curve model that assumes a normal distribution of power.
zcurve3.0/Tutorial.R.Script.Chapter9.R at main · UlrichSchimmack/zcurve3.0


Brief ChatGPT Generated Summary of Key Points

What Is Z-Curve?

Z-curve is a statistical tool used in meta-analysis, especially for large sets of studies (e.g., more than 100). It can also be used with smaller sets (as few as 10 significant results), but the estimates become less precise.

There are several types of meta-analysis:

  • Direct replication: Studies that test the same hypothesis with the same methods.
    Example: Several studies testing whether aspirin lowers blood pressure.
  • Conceptual replication: Studies that test a similar hypothesis using different procedures or measures.
    Example: Different studies exploring how stress affects memory using different tasks and memory measures.

In direct replications, we expect low variability in the true effect sizes. In conceptual replications, variability is higher due to different designs.

Z-curve was primarily developed for a third type of meta-analysis: reviewing many studies that ask different questions but share a common feature—like being published in the same journal or during the same time period. In these cases, estimating an average effect size isn’t very meaningful because effects vary so much. Instead, z-curve focuses on statistical integrity, especially the concept of statistical power.

What Is Statistical Power?

I define statistical power as the probability that a study will produce a statistically significant result (usually p < .05).

To understand this, we need to review null hypothesis significance testing (NHST):

  1. Researchers test a hypothesis (like exercise increasing lifespan) by conducting a study.
  2. They calculate the effect size (e.g., exercise increase the average lifespan by 2 years) and divide it by the standard error to get a test statistic (e.g., a z-score).
  3. Higher test-statistics imply a lower probability that the null hypothesis is true. The null hypothesis is that there is no effect. If the probability is below the conventional criterion of 5%, the finding is interpreted as evidence of an effect.

Power is the probability of obtaining a significant result, p < .05.

Hypothetical vs. Observed Power

Textbooks often describe power in hypothetical terms. For example, before collecting data, a researcher might assume an effect size and calculate how many participants are needed for 80% power.

But z-curve does something different. It estimates the average true power of a set of studies. It is only possible to estimate average true power for sets of studies because power estimates based on a single study are typically too imprecise to be useful. Z-curve provides estimates of the average true power of a set of studies and the uncertainty in these estimates.

Populations of Studies

Brunner and Schimmack (2020) introduced an important distinction:

  • All studies ever conducted (regardless of whether results were published).
  • Only published studies, which are often biased toward significant results.

If we had access to all studies, we could simply calculate power by looking at the proportion of significant results. For example, if 50% of all studies show p < .05, then the average power is 50%.

In reality, we only see a biased sample—mostly significant results that made it into journals. This is called selection bias (or publication bias), and it can mislead us.

What Z-Curve Does

Z-curve helps us correct for this bias by:

  1. Using the p-values from published studies.
  2. Converting them to z-scores (e.g., p = .05 → z ≈ 1.96).
  3. Modeling the distribution of these z-scores to estimate:
    • The power of the studies we see,
    • The likely number of missing studies,
    • And the amount of bias.

Key Terms in Z-Curve

TermMeaning
ODR (Observed Discovery Rate)% of studies that report significant results
EDR (Expected Discovery Rate)Estimated % of significant results we’d expect if there were no selection bias
ERR (Expected Replication Rate)Estimated % of significant studies that would replicate if repeated exactly
FDR (False Discovery Rate)Estimated % of significant results that are false positives

Understanding the Z-Curve Plot

Figure 1. Histogram of z-scores from 1,984 significant tests. The solid red line shows the model’s estimated distribution of observed z-values. The dashed line shows what we’d expect without selection bias. The Observed Discovery Rate (ODR) is 100%, meaning all studies shown are significant. However, the Expected Discovery Rate (EDR) is only 40%, suggesting many non-significant results were omitted. The Expected Replication Rate (ERR) is also 40%, indicating that only 40% of these significant results would likely replicate. The False Discovery Rate (FDR) is estimated at 8%.

Notice how the histogram spikes just above z = 2 (i.e., just significant) and drops off below. This pattern signals selection for significance, which is unlikely to occur due to chance alone.


Homogeneity vs. Heterogeneity of Power

Sometimes all studies in a set have similar power (called homogeneity). In that case, the power of significant and non-significant studies is similar.

However, z-curve allows for heterogeneity, where studies have different power levels. This flexibility makes it better suited to real-world data than methods that assume all studies are equally powered.

When power varies, high-power studies are more likely to produce significant results. That’s why, under heterogeneity, the ERR (for significant studies) is often higher than the EDR (for all studies).


Summary of Key Concepts

  • Meta-analysis = Statistical summary of multiple studies.
  • Statistical significance = p < .05.
  • Power = Probability of finding a significant result.
  • Selection bias = Overrepresentation of significant results in the literature.
  • ODR = Observed rate of p < .05.
  • EDR = Expected rate of p < .05 without bias.
  • ERR = Estimated replication success rate of significant results.

Full Introduction

Z-curve is a statistical tool for meta-analysis of larger sets of studies (k > 100). Although it can be used with smaller sets of studies (k > 10 significant results), confidence intervals are likely to be very wide. There are also different types of meta-analysis. The core application of meta-analysis is to combine information from direct replication studies. that is studies that test the same hypothesis (e.g., the effect of aspirin on blood pressure). The most widely used meta-analytic tools aim to estimate the average effect size for a set of studies with the same research question. A second application is to quantitatively review studies on a specific research topic. These studies are called conceptual replication studies. They test the same or related hypothesis, but with different experimental procedures (paradigms). The main difference between meta-analysis of direct and conceptual replication studies is that we would expect less variability in population effect sizes (not the estimates in specific samples) in direct replications, whereas variability is expected to be higher in conceptual replication studies with different experimental manipulations and dependent variables.

Z-curve can be applied to meta-analysis of conceptual replication studies, but it was mainly developed for a third type of meta-analysis. These meta-analyses examine sets of studies with different hypotheses and research designs. Usually, these studies share a common feature. For example, they may be published in the same journal, belong to a specific scientific discipline or sub-discipline, or a specific time period. The main question of interest here is not the average effect size that is likely to vary widely from study to study. The purpose of a z-curve analysis is to examine the credibility or statistical integrity of a set of studies. The term credibility is a broad term that covers many features of a study. Z-curve focuses on statistical power as one criterion for the credibility of a study. To use z-curve and to interpret z-curve results it is therefore important to understand the concept of statistical power. Unfortunately, statistical power is still not part of the standard education in psychology. Thus, I will provide a brief introduction to statistical power.

Statistical Power

Like many other concepts in statistics, statistical power (henceforth power, the only power that does not corrupt), is a probability. To understand power, it is necessary to understand the basics of null-hypothesis significance testing (NHST). When resources are insufficient to estimate effect sizes, researchers often have to settle for the modest goal to examine whether a predicted positive effect (exercise increases longevity) is positive, or a predicted negative effect is negative (asparin lowers blood pressure). The common approach to do so is to estimate the effect size in a sample, estimate the sampling error, compute the ratio of the two, and then compute the probability that the observed effect size or an even bigger one could have been obtained without an effect; that is, a true effect size of 0. Say, the effect of exercise on longevity is an extra 2 years, the sampling error is 1 year, and the test statistic is 2/1 = 2. This value would correspond to a p-value of .05 that the true effect is positive (not 2 years, but greater than 0). P-values below .05 are conventionally used to decide against the null hypothesis and to infer that the true effect size is positive if the estimate is positive or that the true effect is negative if the estimate is negative. Now we can define power. Power is the probability of obtaining a significant result, which typically means a p-value below .05. In short,

Power is the probability of obtaining a statistically significant result.

This definition of power differs from the textbook definition of power because we need to distinguish between different types of powers or power calculations. The most common use of power calculations relies on hypothetical population effect sizes. For example, let’s say we want to conduct a study of exercise and longevity without any prior studies. Therefore, we do not know whether exercise has an effect or how big the effect is. This does not stop us from calculating power because we can just make assumptions about the effect size. Let’s say we assume the effect is two years. The main reason to compute hypothetical power is to plan sample sizes of studies. For example, we have information about the standard deviation of people’s life span and can compute power for hypothetical sample sizes. A common recommendation is to plan studies with 80% power to obtain a significant result with the correct sign.

It would be silly to compute the hypothetical power for an effect size of zero. First, we know that the probability of a significant result without a real effect is set by the research. When they use p < .05 as a rule to determine significance, the probability of obtaining a significant result without a real effect is 5%. If they use p < .01, it is 1%. No calculations are needed. Second, researchers conduct power analysis to find evidence for an effect. So, it would make no sense to do the power calculation with a value of zero. This is null hypothesis that researchers want to reject, and they want a reasonable sample size to do so.

All of this means that hypothetical power calculations assume a non-zero effect size and power is defined as the conditional probability to obtain a significant result for a specified non-zero effect size. Z-curve is used to compute a different type of power. The goal is to estimate the average true power of a set of studies. This average can be made up of a mix of studies in which the null hypothesis is true or false. Therefore, z-curve estimates are no longer conditional on a true effect. When the null hypothesis is true, power is set by the significance criterion. When there is an effect, power is a function of the size of the effect. All of the discussion of conditional probability is just needed to understand the distinction between the definition of power in hypothetical power calculations and in empirical estimates of power with z-curve. The short and simple definition of power is simply the probability of a study to produce a significant result.

Populations of Studies

Brunner and Schimmack (2020) introduce another distinction between power estimates that is important for the understanding of z-curve. One population of studies are all studies that have been conducted independent of the significance criterion. Let’s assume researchers’ computers were hooked up to the internet and whenever they conduct a statistical analysis, the results are stored in a giant database. The database will contain millions of p-values, some above .05 and others below .05. We could now examine the science-wide average power of null hypothesis significance tests. In fact, it would be very easy to do so. Remember, power is defined as the probability to obtain a significant result. We can therefore just compute the percentage of significant results to estimate average power. This is no different than averaging the results of 100,000 roulette games to see how often a table produces “red” or “black” as an outcome. If the table is biased and has more power to get “red” results, you could win a lot of money with that knowledge. In short,

The percentage of significant results in a set of studies provides an estimate of the average power of the set of studies that was conducted.

We would not need a tool like z-curve, if power estimation were that easy. The reason why we need z-curve is that we do not have access to all statistical tests that were conducted in science, psychology, or even a single lab. Although data sharing is becoming more common, we only see a fraction of results that are published in journal articles or preprints on the web. The published set of results is akin to the proverbial tip of the iceberg, and many results remain unreported and are not available for meta-analysis. This means, we only have a sample of studies.

Whenever statisticians draw conclusions about populations from samples, it is necessary to worry about sampling bias. In meta-analyses, this bias is known as publication bias, but a better term for it is selection bias. Scientific journals, especially in psychology, prefer to publish statistically significant results (exercise increases longevity) over non-significant results (exercise may or may not increase longevity). Concerns about selection bias are as old as meta-analyses, but actual meta-analyses have often ignored the risk of selection bias. Z-curve is one of the few tools that can be used to detect selection bias and quantify the amount of selection bias (the other tool is the selection model for effect size estimation).

To examine selection bias, we need a second approach to estimate average power, other than computing the average of significant results. The second approach is to use the exact p-values of a study (e.g., p = .17, .05, .005) and to convert them into z-values (e.g., z = 1, 2, 2.8). These z-values are a function of the true power of a study (e.g., a study with 50% power has an expected z-value of ~ 2), and sampling error. Z-curve uses this information to obtain a second estimate of the average power of a set of studies. If there is no selection bias, the two estimates should be similar, especially in reasonably large sets of studies. However, often the percentage of significant result (power estimate 1) is higher than the z-curve estimate (power estimate 2). This pattern of results suggests selection for significance.

In conclusion, there are two ways to estimate the average power of a set of studies. Without selection bias, the two estimates will be similar. With selection bias, the estimate based on counting significant results will be higher than the estimate based on the exact p-values.

Figure 1 illustrates the extreme scenario that the true power of studies was just 40%, but selection bias filtered out all non-significant results.


Figure 1. Histogram of z-scores from 1,984 significant tests (based on a simulation of 5,000 studies with 40% power). The solid red line represents the z-curve fit to the distribution of observed z-values. The dashed red line shows the expected distribution without selection bias. The vertical red line shows the significance criterion, p < .05 (two-sided, z ~ 2). ODR = Observed Discovery Rate, EDR = Expected Discovery Rate, ERR = Expected Replication Rate. FDR = False Positive Risk, not relevant for the Introduction.


The figure shows a z-curve plot. Understanding this plot is important for the use of z-curve. First, the plot is a histogram of absolute z-values. Absolute z-values are used because in field-wide meta-analyses the sign has no meaning. In one study, researchers predicted a negative result (aspirin decreases blood pressure) and in another study they predicted a positive result (exercises increases longevity). What matters is that the significant result was used to reject the null hypothesis in either direction. Z-values above 6 are not shown because they are very strong, imply nearly 100% power. The critical range of z-scores are z-scores between 2 (p = .05, just significant) and 4 (~ p = .0001).

The z-curve plot makes it easy to spot selection for significance because there are many studies with just significant results (z > 2) and no studies with just not-significant results that are often called marginally significant results because they are used in publications to reject the null hypothesis with a relaxed criterion. A plot like this cannot be produced by sampling error.

In a z-curve plot, the percentage of significant results is called the observed discovery rate. Discovery is a term used in statistic for a significant result. It does not mean a breaking-news discovery. It just means p < .05. The ODR is 100% because all results are significant. This would imply that all studies tested a true hypothesis with 100% power. However, we know that this is not the case. Z-curve uses the distribution of significant z-scores to estimate power, but there are two populations of power. One population is all studies, including the missing non-significant results. I will explain later how z-curve estimates power. Here it is only important that the estimate is 40%. This estimate is called the expected discovery rate. That is, if we could get access to all missing studies, we would see that only 40% of the studies were significant. Expected therefore means without selection bias and open access to all studies. The difference between the ODR and EDR quantifies the amount of selection bias. Here selection bias inflates the ODR from 40% to 100%.

It is now time to introduce another population of studies. This is the population of studies with significant results. We do not have to assume that all of these studies were published. We just assume that the published studies were not selected based on their p-values. This is a common assumption in selection models. We will see later how changing this assumption can change results.

It is well known that selection introduces bias in averages. Selection for significance, selects studies that had positive sampling error that produced z-scores greater than 2, while the expected z-score without sampling error is only 1.7, not significant on its own. Thus, a simple power calculation for the significant results would overestimate power. Z-curve corrects for this bias and produces an unbiased estimate of the average power of the population of studies with significant results. This estimate of power after selection for significance is called the expected replication rate (ERR). The reason is that average power of the significant results predicts the percentage of significant results if the studies with significant results were replicated exactly, including the same sample sizes. The outcome of this hypothetical replication project would be 40% significant results. The decrease from 100% to 40% is explained by the effect of selection and regression to the mean. A study that had an expected value of 1.7, but sampling error pushed it to 2.1 and produced a significant result is unlikely to have the same sampling error and produce a significant result again.

At the bottom of z-curve 3.0, you see estimates of local power. These are average power estimates for ranges of z-values. The default is to use steps of z = .05. You see that the strength of the observed z-values does not matter. Z-values between 0 and 0.5 are estimated to have 40% power as do z-values between 5.5 and 6. This happens when all studies have the same power. When studies differ in power, local power increases because studies with higher power are more likely to produce larger z-values.

When all studies have the same power, power is said to be homogenous. When studies have different levels of power, power is heterogeneous. Homogeneity or small heterogeneity in power imply that it is easy to infer the power of studies with non-significant results from studies with significant results. The reason is that power is more or less the same. Some selection models like p-curve assume homogeneity. For this reason, it is not necessary to distinguish populations of studies with or without significant results. It is assumed that the true power is the same for all studies, and if the true power is the same for all studies, it is also the same for all subsets of studies. This is different for z-curve. Z-curve allows for heterogeneity in power, and z-curve 3.0 provides a test of heterogeneity. If there is heterogeneity in power, the ERR will be higher than the EDR because studies with higher power are more likely to produce a significant result (Brunner & Schimmack, 2020).

To conclude, the introduction introduced basic statistical concepts that are needed to conduct z-curve analyses and to interpret the results correctly. The key constructs are

Meta-Analysis: the statistical analysis of results from multiple studies
Null Hypothesis Significance Testing
Statistical Significance: p < .05 (alpha)
(Statistical) Power: the probability of obtaining a significant result
Conditional Power: the probability of obtaining a significant result with a true effect
Populations of Studies: A set of studies with a common characteristic
Set of all studies: studies with non-significant and significant results
Selection Bias: An overrepresentation of significant results in a set of studies
(Sub)Set of studies with significant results: Subset of studies with p < .05
Observed Discovery Rate (ODR): the percentage of significant results in a set of studies
Expected Discovery Rate (EDR): the z-curve estimate of the discovery rate based on z-values
Expected Replication Rate (ERR): the z-curve estimate of average power for the subset of significant results.

Guest Post by Jerry Brunner: Response to an Anonymous Reviewer

Introduction

Jerry Brunner is a recent emeritus from the Department of Statistics at the University of Toronto Mississauga. Jerry first started in psychology, but was frustrated by the unscientific practices he observed in graduate school. He went on to become a professor in statistics. Thus, he is not only an expert in statistis. He also understands the methodological problems in psychology.

Sometime in the wake of the replication crisis around 2014/15, I went to his office to talk to him about power and bias detection. . Working with Jerry was educational and motivational. Without him z-curve would not exist. We spend years on trying different methods and thinking about the underlying statistical assumptions. Simulations often shattered our intuitions. The Brunner and Schimmack (2020) article summarizes all of this work.

A few years later, the method is being used to examine the credibility of published articles across different research areas. However, not everybody is happy about a tool that can reveal publication bias, the use of questionable research practices, and a high risk of false positive results. An anonymous reviewer dismissed z-curve results based on a long list of criticisms (Post: Dear Anonymous Reviewer). It was funny to see how ChatGPT responds to these criticisms (Comment). However, the quality of ChatGPT responses is difficult to evaluate. Therefore, I am pleased to share Jerry’s response to the reviewer’s comments here. Let’s just say that the reviewer was wise to make their comments anonymously. Posting the review and the response in public also shows why we need open reviews like the ones published in Meta-Psychology by the reviewers of our z-curve article. Hidden and biased reviews are just one more reason why progress in psychology is so slow.

Jerry Brunner’s Response

This is Jerry Brunner, the “Professor of Statistics” mentioned the post. I am also co-author of Brunner and Schimmack (2020). Since the review Uli posted is mostly an attack on our joint paper (Brunner and Schimmack, 2020), I thought I’d respond.

First of all, z-curve is sort of a moving target. The method described by Brunner and Schimmack is strictly a way of estimating population mean power based on a random sample of tests that have been selected for statistical significance. I’ll call it z-curve 1.0. The algorithm has evolved over time, and the current z-curve R package (available at https://cran.r-project.org/web/packages/zcurve/index.html) implements a variety of diagnostics based on a sample of p-values. The reviewer’s comments apply to z-curve 1.0, and so do my responses. This is good from my perspective, because I was in on the development of z-curve 1.0, and I believe I understand it pretty well. When I refer to z-curve in the material that follows, I mean z-curve 1.0. I do believe z-curve 1.0 has some limitations, but they do not overlap with the ones suggested by the reviewer.

Here are some quotes from the review, followed by my answers.

(1) “… z-curve analysis is based on the concept of using an average power estimate of completed studies (i.e., post hoc power analysis). However, statisticians and methodologists have written about the problem of post hoc power analysis …”

This is not accurate. Post-hoc power analysis is indeed fatally flawed; z-curve is something quite different. For later reference, in the “observed” power method, sample effect size is used to estimate population effect size for a single study. Estimated effect size is combined with observed sample size to produce an estimated non-centrality parameter for the non-central distribution of the test statistic, and estimated power is calculated from that, as an area under the curve of the non-central distribution. So, the observed power method produces an estimated power for an individual study. These estimates have been found to be too noisy for practical use.

The confusion of z-curve with observed power comes up frequently in the reviewer’s comments. To be clear, z-curve does not estimate effect sizes, nor does it produce power estimates for individual studies.

(2) “It should be noted that power is not a property of a (completed) study (fixed data). Power is a performance measure of a procedure (statistical test) applied to an infinite number of studies (random data) represented by a sampling distribution. Thus, what one estimates from completed study is not really “power” that has the properties of a frequentist probability even though the same formula is used. Average power does not solve this ontological problem (i.e., misunderstanding what frequentist probability is; see also McShane et al., 2020). Power should always be about a design for future studies, because power is the probability of the performance of a test (rejecting the null hypothesis) over repeated samples for some specified sample size, effect size, and Type I error rate (see also Greenland et al., 2016; O’Keefe, 2007). z-curve, however, makes use of this problematic concept of average power (for completed studies), which brings to question the validity of z-curve analysis results.”

The reviewer appears to believe that once the results of a study are in, the study no longer has a power. To clear up this misconception, I will describe the model on which z-curve is based.

There is a population of studies, each with its own subject population. One designated significance test will be carried out on the data for each study. Given the subject population, the procedure and design of the study (including sample size), significance level and the statistical test employed, there is a probability of rejecting the null hypothesis. This probability has the usual frequentist interpretation; it’s the long-term relative frequency of rejection based on (hypothetical) repeated sampling from the particular subject population. I will use the term “power” for the probability of rejecting the null hypothesis, whether or not the null hypothesis is exactly true.

Note that the power of the test — again, a member of a population of tests — is a function of the design and procedure of the study, and also of the true state of affairs in the subject population (say, as captured by effect size).

So, every study in the population of studies has a power. It’s the same before any data are collected, and after the data are collected. If the study were replicated exactly with a fresh sample from the same population, the probability of observing significant results would be exactly the power of the study — the true power.

This takes care of the reviewer’s objection, but let me continue describing our model, because the details will be useful later.

For each study in the population of studies, a random sample is drawn from the subject population, and the null hypothesis is tested. The results are either significant, or not. If the results are not significant, they are rejected for publication, or more likely never submitted. They go into the mythical “file drawer,” and are no longer available. The studies that do obtain significant results form a sub-population of the original population of studies. Naturally, each of these studies has a true power value. What z-curve is trying to estimate is the population mean power of the studies with significant results.

So, we draw a random sample from the population of studies with significant results, and use the reported results to estimate population mean power — not of the original population of studies, but only of the subset that obtained significant results. To us, this roughly corresponds to the mean power in a population of published results in a particular field or sub-field.

Note that there are two sources of randomness in the model just described. One arises from the random sampling of studies, and the other from random sampling of subjects within studies. In an appendix containing the theorems, Brunner and Schimmack liken designing a study (and choosing a test) to the manufacture of a biased coin with probability of heads equal to the power. All the coins are tossed, corresponding to running the subjects, collecting the data and carrying out the tests. Then the coins showing tails are discarded. We seek to estimate the mean P(Head) for all the remaining coins.

(3) “In Brunner and Schimmack (2020), there is a problem with ‘Theorem 1 states that success rate and mean power are equivalent …’ Here, the coin flip with a binary outcome is a process to describe significant vs. nonsignificant p-values. Focusing on observed power, the problem is that using estimated effect sizes (from completed studies) have sampling variability and cannot be assumed to be equivalent to the population effect size.”

There is no problem with Theorem 1. The theorem says that in the coin tossing experiment just described, suppose you (1) randomly select a coin from the population, and (2) toss it — so there are two stages of randomness. Then the probability of observing a head is exactly equal to the mean P(Heads) for the entire set of coins. This is pretty cool if you think about it. The theorem makes no use of the concept of effect size. In fact, it’s not directly about estimation at all; it’s actually a well-known result in pure probability, slightly specialized for this setting. The reviewer says “Focusing on observed power …” But why would he or she focus on observed power? We are talking about true power here.

(4) “Coming back to p-values, these statistics have their own distribution (that cannot be derived unless the effect size is null and the p-value follows a uniform distribution).

They said it couldn’t be done. Actually, deriving the distribution of the p-value under the alternative hypothesis is a reasonable homework problem for a masters student in statistics. I could give some hints …

(5) “Now, if the counter argument taken is that z-curve does not require an effect size input to calculate power, then I’m not sure what z-curve calculates because a value of power is defined by sample size, effect size, Type I error rate, and the sampling distribution of the statistical procedure (as consistently presented in textbooks for data analysis).”

Indeed, z-curve uses only p-values, from which useful estimates of effect size cannot be recovered. As previously stated, z-curve does not estimate power for individual studies. However, the reviewer is aware that p-values have a probability distribution. Intuitively, shouldn’t the distribution of p-values and the distribution of power values be connected in some way? For example, if all the null hypotheses in a population of tests were true so that all power values were equal to 0.05, then the distribution of p-values would be uniform on the interval from zero to one. When the null hypothesis of a test is false, the distribution of the p-value is right skewed and strictly decreasing (except in pathological artificial cases), with more of the probability piling up near zero. If average power were very high, one might expect a distribution with a lot of very small p-values. The point of this is just that the distribution of p-values surely contains some information about the distribution of power values. What z-curve does is to massage a sample of significant p-values to produce an estimate, not of the entire distribution of power after selection, but just of its population mean. It’s not an unreasonable enterprise, in spite of what the reviewer thinks. Also, it works well for large samples of studies. This is confirmed in the simulation studies reported by Brunner and Schimmack.

(6) “The problem of Theorem 2 in Brunner and Schimmack (2020) is assuming some distribution of power (for all tests, effect sizes, and sample sizes). This is curious because the calculation of power is based on the sampling distribution of a specific test statistic centered about the unknown population effect size and whose variance is determined by sample size. Power is then a function of sample size, effect size, and the sampling distribution of the test statistic.”

Okay, no problem. As described above, every study in the population of studies has its own test statistic, its own true (not estimated) effect size, its own sample size — and therefore its own true power. The relative frequency histogram of these numbers is the true population distribution of power.

(7) “There is no justification (or mathematical derivation) to show that power follows a uniform or beta distribution (e.g., see Figure 1 & 2 in Brunner and Schimmack, 2000, respectively).”

Right. These were examples, illustrating the distribution of power before versus after selection for significance — as given in Theorem 2. Theorem 2 applies to any distribution of true power values.

(8) “If the counter argument here is that we avoid these issues by transforming everything into a z-score, there is no justification that these z-scores will follow a z-distribution because the z-score is derived from a normal distribution – it is not the transformation of a p-value that will result in a z-distribution of z-scores … it’s weird to assume that p-values transformed to z-scores might have the standard error of 1 according to the z-distribution …”

The reviewer is objecting to Step 1 of constructing a z-curve estimate, given on page 6 of Brunner and Schimmack (2020). We start with a sample of significant p-values, arising from a variety of statistical tests, various F-tests, chi-squared tests, whatever — all with different sample sizes. Then we pretend that all the tests were actually two-sided z-tests with the results in the predicted direction, equivalent to one-sided z-tests with significance level 0.025. Then we transform the p-values to obtain the z statistics that would have generated them, had they actually been z-tests. Then we do some other stuff to the z statistics.

But as the reviewer notes, most of the tests probably are not z-tests. The distributions of their p-values, which depend on the non-central distributions of their test statistics, are different from one another, and also different from the distribution for genuine z-tests. Our paper describes it as an approximation, but why should it be a good approximation? I honestly don’t know, and I have given it a lot of thought. I certainly would not have come up with this idea myself, and when Uli proposed it, I did not think it would work. We both came up with a lot of estimation methods that did not work when we tested them out. But when we tested this one, it was successful. Call it a brilliant leap of intuition on Uli’s part. That’s how I think of it.

Uli’s comment.
It helps to know your history. Well before psychologists focused on effect sizes for meta-analysis, Fisher already had a method to meta-analyze p-values. P-Curve is just a meta-analysis of p-values with a selection model. However, p-values have ugly distributions and Stouffer proposed the transformation of p-values into z-scores to conduct meta-analyses. This method was used by Rosenthal to compute the fail-safe-N, one of the earliest methods to evaluate the credibility of published results (Fail-Safe-N). Ironically, even the p-curve app started using this transformation (p-curve changes). Thus, p-curve is really a version of z-curve. The problem with p-curve is that it has only one parameter and cannot model heterogeneity in true power. This is the key advantage of z-curve.1.0 over p-curve (Brunner & Schimmack, 2020). P-curve is even biased when all studies have the same population effect size, but different sample sizes, which leads to heterogeneity in power (Brunner, 2018].

Such things are fairly common in statistics. An idea is proposed, and it seems to work. There’s a “proof,” or at least an argument for the method, but the proof does not hold up. Later on, somebody figures out how to fill in the missing technical details. A good example is Cox’s proportional hazards regression model in survival analysis. It worked great in a large number of simulation studies, and was widely used in practice. Cox’s mathematical justification was weak. The justification starts out being intuitively reasonable but not quite rigorous, and then deteriorates. I have taught this material, and it’s not a pleasant experience. People used the method anyway. Then decades after it was proposed by Cox, somebody else (Aalen and others) proved everything using a very different and advanced set of mathematical tools. The clean justification was too advanced for my students.

Another example (from mathematics) is Fermat’s last theorem, which took over 300 years to prove. I’m not saying that z-curve is in the same league as Fermat’s last theorem, just that statistical methods can be successful and essentially correct before anyone has been able to provide a rigorous justification.

Still, this is one place where the reviewer is not completely mixed up.

Another Uli comment
Undergraduate students are often taught different test statistics and distributions as if they are totally different. However, most tests in psychology are practically z-tests. Just look at a t-distribution with N = 40 (df = 38) and try to see the difference to a standard normal distribution. The difference is tiny and invisible when you increase sample sizes above 40! And F-tests. F-values with 1 experimenter degree of freedom are just squared t-values, so the square root of these is practically a z-test. But what about chi-square? Well, with 1 df, chi-square is just a squared z-score, so we can use the square root and have a z-score. But what if we don’t have two groups, but compute correlations or regressions? Well, the statistical significance test uses the t-distribution and sample sizes are often well above 40. So, t and z are practically identical. It is therefore not surprising to me that approximating empirical results with different test-statistics can be approximated with the standard normal distribution. We could make teaching statistics so much easier, instead of confusing students with F-distributions. The only exception are complex designs with 3 x 4 x 5 ANOVAs, but they don’t really test anything and are just used to p-hack. Rant over. Back to Jerry.

(9) “It is unclear how Theorem 2 is related to the z-curve procedure.”

Theorem 2 is about how selection for significance affects the probability distribution of true power values. Z-curve estimates are based only on studies that have achieved significant results; the others are hidden, by a process that can be called publication bias. There is a fundamental distinction between the original population of power values and the sub-population belonging to studies that produce significant results. The theorems in the appendix are intended to clarify that distinction. The reviewer believes that once significance has been observed, the studies in question no longer even have true power values. So, clarification would seem to be necessary.

(10) “In the description of the z-curve analysis, it is unclear why z-curve is needed to calculate “average power.” If p < .05 is the criterion of significance, then according to Theorem 1, why not count up all the reported p-values and calculate the proportion in which the p-values are significant?”

If there were no selection for significance, this is what a reasonable person would do. But the point of the paper, and what makes the estimation problem challenging, is that all we can observe are statistics from studies with p < 0.05. Publication bias is real, and z-curve is designed to allow for it.

(11) “To beat a dead horse, z-curve makes use of the concept of “power” for completed studies. To claim that power is a property of completed studies is an ontological error …”

Wrong. Power is a feature of the design of a study, the significance test, and the subject population. All of these features still exist after data have been collected and the test is carried out.

Uli and Jerry comment:
Whenever a psychologist uses the word “ontological,” be very skeptical. Most psychologists who use the word understand philosophy as well as this reviewer understands statistics.

(12) “The authors make a statement that (observed) power is the probability of exact replication. However, there is a conceptual error embedded in this statement. While Greenwald et al. (1996, p. 1976) state “replicability can be computed as the power of an exact replication study, which can be approximated by [observed power],” they also explicitly emphasized that such a statement requires the assumption that the estimated effect size is the same as the unknown population effect size which they admit cannot be met in practice.”

Observed power (a bad estimate of true power) is not the probability of significance upon exact replication. True power is the probability of significance upon exact replication. It’s based on true effect size, not estimated effect size. We were talking about true power, and we mistakenly thought that was obvious.

(13) “The basis of supporting the z-curve procedure is a simulation study. This approach merely confirms what is assumed with simulation and does not allow for the procedure to be refuted in any way (cf. Popper’s idea of refutation being the basis of science.) In a simulation study, one assumes that the underlying process of generating p-values is correct (i.e., consistent with the z-curve procedure). However, one cannot evaluate whether the p-value generating process assumed in the simulation study matches that of empirical data. Stated a different way, models about phenomena are fallible and so we find evidence to refute and corroborate these models. The simulation in support of the z-curve does not put the z-curve to the test but uses a model consistent with the z-curve (absent of empirical data) to confirm the z-curve procedure (a tautological argument). This is akin to saying that model A gives us the best results, and based on simulated data on model A, we get the best results.”

This criticism would have been somewhat justified if the simulations had used p-values from a bunch of z-tests. However, they did not. The simulations reported in the paper are all F-tests with one numerator degree of freedom, and denominator degrees of freedom depending on the sample size. This covers all the tests of individual regression coefficients in multiple regression, as well as comparisons of two means using two-sample (and even matched) t-tests. Brunner and Schmmack say (p. 8)

Because the pattern of results was similar for F-tests
and chi-squared tests and for different degrees of freedom,
we only report details for F-tests with one numerator
degree of freedom; preliminary data mining of
the psychological literature suggests that this is the case
most frequently encountered in practice. Full results are
given in the supplementary materials.

So I was going to refer the reader (and the anonymous reviewer, who is probably not reading this post anyway) to the supplementary materials. Fortunately I checked first, and found that the supplementary materials include a bunch of OSF stuff like the letter submitting the article for publication, and the reviewers’ comments and so on — but not the full set of simulations. Oops.

All the code and the full set of simulation results is posted at

https://www.utstat.utoronto.ca/brunner/zcurve2018

You can download all the material in a single file at

https://www.utstat.utoronto.ca/brunner/zcurve2018.zip

After expanding, just open index.html in a browser.

Actually we did a lot more simulation studies than this, but you have to draw the line somewhere. The point is that z-curve performs well for large numbers of studies with chi-squared test statistics as well as F statistics — all with varying degrees of freedom.

(14) “The simulation study was conducted for the performance of the z-curve on constrained scenarios including F-tests with df = 1 and not for the combination of t-tests and chi-square tests as applied in the current study. I’m not sure what to make of the z-curve performance for the data used in the current paper because the simulation study does not provide evidence of its performance under these unexplored conditions.”

Now the reviewer is talking about the paper that was actually under review. The mistake is natural, because of our (my) error in not making sure that the full set of simulations was included in the supplementary materials. The conditions in question are not unexplored; they are thoroughly explored, and the accuracy of z-curve for large samples is confirmed.

(15+) There are some more comments by the reviewer, but these are strictly about the paper under review, and not about Brunner and Schimmack (2020). So, I will leave any further response to others.

Z-Curve: An even better p-curve

So far Simmons, Nelson, and Simonsohn have not commented on this blog post. I now submitted it as a commentary to JEP-General. Let’s see whether it will be send out for review and whether they will comment as (anonymous) reviewers.

Abstract

P-Curve was a first attempt to take the problem of selection for significance seriously and to evaluate whether a set of studies provides credible evidence against the null-hypothesis after taking selection bias into account. Here I showed that p-curve has serious limitations and provides misleading evidence about the strength of evidence against the null-hypothesis. I showed that all of the information that is provided by a p-curve analysis (Simonsohn, Nelson, & Simmons, 2014) is also provided by a z-curve analysis (Bartos & Schimmack, 2021). Moreover, z-curve provides additional information about the presence and the amount of selection bias. As z-curve is superior than p-curve, the rational choice is to use z-curve to examine the credibility of significant results.

Keywords: Publication Bias, Selection Bias, Z-Curve, P-Curve, Expected Replication Rate, Expected Discovery Rate, File-Drawer, Power

Introduction

In 2011, it dawned on psychologists that something was wrong with their science. Daryl Bem had just published an article with nine studies that showed an incredible finding (Bem, 2011). Participants’ responses were influenced by random events that had not yet occurred. Since then, the flaws in research practices have become clear and it has been shown that they are not limited to mental time travel (Schimmack, 2020). For decades, psychologists assumed that statistically significant results reveal true effects and reported only statistically significant results (Motyl et al., 2017; Sterling, 1959; Sterling et al., 1995). However, selective reporting of significant results undermines the purpose of significance testing to distinguish true and false hypotheses. If only significant results are reported, most published results could be false positive results (Simmons, Nelson, & Simonsohn, 2011).

Selective reporting of significant results also undermines the credibility of meta-analyses (Rosenthal, 1979), which explains why meta-analyses also suggest humans posses psychic abilities (Bem & Honorton, 1994). Thus, selection bias not only invalidates the results of original studies, it also threatens the validity of conclusions based on meta-analyses that do not take selection bias into account.

Concerns about a replication crisis in psychology led to an increased focus on replication studies. An ambitious project found that only 37% of studies in (cognitive & social) experimental psychology could be replicated (Open Science Collaboration, 2015). This dismal result created a crisis of confidence in published results. To alleviate these concerns, psychologists developed new methods to detect publication bias. These new methods showed that Bem’s paranormal results were obtained with the help of questionable research practices (Francis, 2012; Schimmack, 2012), which explained why replication attempts were unsuccessful (Galak et al., 2012). Furthermore, Francis showed that many published articles in the prestigious journal Psychological Science show signs of publication bias (Francis, 2014). However, the presence of publication bias does not imply that the published results are false (positives). Publication bias may merely inflate effect sizes without invalidating the main theoretical claims. To address the latter question it is necessary to conduct meta-analyses that take publication bias into account. In this article, I compare two methods that were developed for this purpose; p-curve (Simonsohn et al., 2014), and z-curve (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020). P-curve was introduced in 2014 and has already been used in many articles. Z-curve was developed in 2015, but was only published recently in a peer-reviewed journal. Experimental psychologists who are familiar with speed-accuracy tradeoffs may not be surprised to learn that z-curve is a superior method. As Brunner and Schimmack (2020) demonstrated with simulation studies, p-curve often produces inflated estimates of the evidential value of original studies. This bias was not detected by the developers of p-curve because they did not evaluate their method with simulation studies. Moreover, their latest version of p-curve was never peer-reviewed. In this article, I first provide a critical review of p-curve and then show how z-curve addresses all of them.

P-Curve

P-curve is a name for a family of statistical tests that have been combined into the p-curve app that researchers can use to conduct p-curve analyses, henceforth called p-curve . The latest version of p-curve is version 4.06 that was last updated on November 30, 2017 (p-curve.com).

The first part of a p-curve analysis is a p-curve plot. A p-curve plot is a histogram of all significant p-values where p-values are placed into five bins, namely p-values ranging from 0 to .01, .01 to .02, .02 to .03, .03 to .04, and .04 to .05. If the set of studies contains mostly studies with true effects that have been tested with moderate to high power, there are more p-values between 0 and .01 than between .04 and .05. This pattern has been called a right-skewed distribution by the p-curve authors. If the distribution is flat or reversed (more p-values between .04 and .05 than between 0 and .01), the data lack evidential value; that is, the results are more consistent with the null-hypothesis than with the presence of a real effect.

The main limitation of p-curve plots is that it is difficult to evaluate ambiguous cases. To aid in the interpretation of p-curve plots, p-curve also provides statistical tests of evidential value. One test is a significance tests against the null-hypothesis that all significant p-values are false positive results. If this null-hypothesis can be rejected with the traditional alpha criterion of .05, it is possible to conclude that at least some of the significant results are not false positives. The main problem with this significance test is that it does not provide information about effect sizes. A right-skewed p-curve with a significant p-values may be due to weak evidence with many false positive results or strong evidence with few false positives.

To address this concern, the p-curve app also provides an estimate of statistical power. When studies are heterogeneous (i.e., different sample sizes or effect sizes or both) this estimate is an estimate of mean unconditional power (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020). Unconditional power refers to the fact that a significant result may be a false positive result. Unconditional power does not condition on the presence of an effect (i.e., the null-hypothesis is false). When the null-hypothesis is true, a result has a probability of alpha (typically 5%) to be significant. Thus, a p-curve analysis that includes some false positive results, includes some studies with a probability equal to alpha and others with probabilities greater than alpha.

To illustrate the p-curve app, I conducted a meta-analysis of all published articles by Leif D. Nelson, one of the co-authors of p-curve  I found 119 studies with codable data and coded the most focal hypothesis for each of these studies. I then submitted the data to the online p-curve app. Figure 1 shows the output.

Visual inspection of the p-curve plot shows a right-skewed distribution with 57% of the p-values between 0 and .01 and only 6% of p-values between .04 and .05. The statistical tests against the null-hypothesis that all of the significant p-values are false positives is highly significant. Thus, at least some of the p-values are likely to be true positives. Finally, the power estimate is very high, 97%, with a tight confidence interval ranging from 96% to 98%. Somewhat redundant with this information, the p-curve app also provides a significance test for the hypothesis that power is less than 33%. This test is not significant, which is not surprising given the estimated power of 97%.

The p-curve results are surprising. After all, Nelson openly stated that he used questionable research practices before he became aware of the high false positive risk associated with these practices. “We knew many researchers—including ourselves—who readily admitted to dropping dependent variables, conditions, or participants to achieve significance.” (Simmons, Nelson, & Simonsohn, 2018, p. 255). The impressive estimate of 97% power is in stark contrast to the claim that questionable research practices were used to produce Nelson’s results. A z-curve analysis of the data shows that the p-curve results provide false information about the robustness of Nelson’s published results.

Z-Curve

Like p-curve, z-curve analyses are supplemented by a plot of the data. The main difference is that p-values are converted into z-scores using the formula for the inverse normal distribution; z = qnorm(1-p/2). The second difference is that significant and non-significant p-values are plotted. The third difference is that z-curve plots have a much finer resolution than p-curve plots. Whereas p-curve bins all z-scores from 2.58 to infinity into one bin (p < .01), z-curve uses the information about the distribution of z-scores all the way up to z = 6 (p = .000000002; 1/500,000,000). Z-statistics greater than 6 are assigned a power of 1.

Visual inspection of the z-curve plot reveals something that the p-curve plot does not show, namely there is clear evidence for the presence of selection bias. Whereas p-curve suggests that “highly” significant results (0 to .01) are much more common than “just” significant results (.04 to .05), z-curve shows that just significant results (.05 to .005) are much more frequent than highly significant (p < .005) results. The difference is due to the implicit definition of high and low in the two plots. The high frequency of highly significant (p < .01) results in the p-curve plots is due to the wide range of values that are lumped together into this bin. Once it is clear that many p-values are clustered just below .05 (z > 1.96, the vertical red line), it is immediately notable that there are too few just non-significant (z < 1.96) values. This steep drop in frequencies for just significant to just not significant values is inconsistent with random sampling error. Thus, publication bias is readily visible by visual inspection of a z-curve plot. In contrast, p-curve plots provide no information about publication bias because non-significant results are not shown. Even worse, right skewed distributions are often falsely interpreted as evidence that there is no publication bias or use of questionable research practices (e.g., Rusz, Le Pelley, Kompier, Mait, & Bijleveld, 2020). This misinterpretation of p-curve plots can be easily avoided by inspection of z-curve plots.

The second part of a z-curve analysis uses a finite mixture model to estimate two statistical parameters of the data. These parameters are called the expected discovery rate and the expected replication rate (Bartos & Schimmack, 2021). Another term for these parameters is mean power before selection and mean power after selection for significance (Brunner & Schimmack, 2020). The meaning of these terms is best understood with a simple example where a researcher tests 100 false hypotheses and 100 true hypotheses with 100% power. The outcome of this study produces significant and non-significant p-values. The expected value for the frequency of significant p-values is 100 for the 100 true hypotheses tested with 100% power and 5% for the 100 false hypotheses that produce 5 significant results when alpha is set to 5%. Thus, we are expecting 105 significant results and 95 non-significant results. In this example, the discovery rate is 105/200 = 52.5%. With real data, the discovery rate is often not known because not all statistical tests are published. When selection for significance is present, the observed discovery rate is an inflated estimate of the actual discovery rate. For example, if 50 of the 95 non-significant results are missing, the observed discovery rate is 105/150 = 70%. Z-curve.2.0 uses the distribution of the significant z-scores to estimate the discovery rate by taking selection bias into account. That is, it uses the truncated distribution for z-scores greater than 1.96 to estimate the shape of the full distribution (i.e., the grey curve in Figure 2). This produces an estimate of the mean power before selection for significance. As significance is determined by power and sampling error, the estimate of mean power provides an estimate of the expected discovery rate. Figure 2 shows an observed discovery rate of 87%. This is in line with estimates of discovery rates around 90% in psychology journals (Motyl et al., 2017; Sterling, 1959; Sterling et al., 1995). However, the z-curve estimate of the expected discovery rate is only 27%. The bootstrapped, robust confidence interval around this estimate ranges from 5% to 51%. As this interval does not include the value for the observed discovery rate, the results provide statistically significant evidence that questionable research practices were used to produce 89% significant results. Moreover, the difference between the observed and expected discovery rate is large. This finding is consistent with Nelson’s admission that many questionable research practices were used to achieve significant results (Simmons et al., 2018). In contrast, p-curve provides no information about the presence or amount of selection bias.

The power estimate provided by the p-curve app is the mean power of studies with a significant result. Mean power for these studies is equal or greater to the mean power of all studies because studies with higher power are more likely to produce a significant result (Brunner & Schimmack, 2020). Bartos and Schimmack (2021) refer to mean power after selection for significance as the expected replication rate. To explain this term, it is instructive to see how selection for significance influences mean power in the example with 100 test of true null-hypotheses and 100 tests of true alternative hypotheses with 100% power. We expect only 5 false positive results and 100 true positive results. The average power of these 105 studies is (5  * .05 + 100 * 1)/105 = 95.5%.  This is much higher than the mean power before selection for significance which was based on 100 rather than just 5 tests of a true null-hypothesis. For Nelson’s data, p-curve produced an estimate of 97% power. Thus, p-curve predicts that 96% of replication attempts of Nelson’s published results would produce a significant result again. The z-curve estimate in Figure 2 shows that this is a dramatically inflated estimate of the expected replication rate. The z-curve estimate is only 52% with a robust 95% confidence interval ranging from 40% to 68%. Simulation studies show that z-curve estimates are close to the simulated values, whereas p-curve estimates are inflated when the studies are heterogeneous (Brunner & Schimmack, 2020). The p-curve authors have been aware of this bias in p-curve estimates since January 2018 (Simmons, Nelson, & Simonsohn, 2018), but they have not changed their app or warned users about this problem. The present example clearly shows that p-curve estimates can be highly misleading and that it is unscientific to use or interpret p-curve estimates of the expected replication rate.

Published Example

Since p-curve was introduced, it has been cited in over 500 articles and it has been used in many meta-analyses. While some meta-analyses correctly interpreted p-curve results to demonstrate merely that a set of studies have some evidential value (i.e., the nil-hypothesis that all significant results are false positives), others went further and drew false conclusions from a p-curve analysis. Moreover, meta-analyses that used p-curve missed the opportunity to quantify the amount of selection bias in a literature. To illustrate how meta-analysts can benefit from a z-curve analysis, I reexamined a meta-analysis of the effects of reward stimuli on attention (Rusz, et al., 2020).

Using their open data (https://osf.io/rgeb6/), I first reproduced their p-curve analysis using the p-curve app (http://www.p-curve.com/app4/). Figure 3 show that 42% of the p-values are between .01 and 0, whereas only 7% of the p-values are between .04 and .05. The figure also shows that the observed p-curve is similar to the p-curve that is predicted by a homogeneous set of studies with 33% power. Nevertheless, power is estimated to be 52%. Rusz et al. (2020) interpret these results as evidence that “this set of studies contains evidential value for reward-driven distraction” and that “It provides no evidence for p-hacking” (p. 886).

Figure 4 shows the z-curve for the same data. Visual inspection of the z-curve plot shows that there are many more just-significant than just-not-significant results. This impression is confirmed by a comparison of the observed discovery rate (74%) versus the expected discovery rate (27%). The bootstrapped, robust 95% confidence interval, 8% to 58%, does not include the observed discovery rate. Thus, there is statistically significant evidence that questionable research practices inflated the percentage of significant results. The expected replication rate is also lower (37%) than the p-curve estimate (52%). With an average power of 37%, it is clear that published studies are underpowered. Based on these results, it is clear that effect-size meta-analysis that do not take selection bias into account produce inflated effect size estimates. Moreover, when the ERR is higher than the EDR, studies are heterogenous, which means that some studies have even less power than the average power of 37%, and some of these may be false positive results. It is therefore unclear which reward stimuli and which attention paradigms show a theoretically significant effect and which do not. However, meta-analysts often falsely generalize an average effect to individual studies. For example, Rusz et al. (2020) concluded from their significant average effect size (d ~ .3) that high-reward stimuli impair cognitive performance “across different paradigms and across different reward cues” (p. 887). This conclusion is incorrect because they mean effect size is inflated and could be based on subsets of reward stimuli and paradigms. To demonstrate that a specific reward stimulus influences performance on a specific task would require high powered replication studies for the various combinations of rewards and paradigms. At present, the meta-analysis merely shows that some rewards can interfere with some tasks.

Conclusion

Simonsohn et al. (2014) introduced p-curve as a statistical tool to correct for publication bias and questionable research practices in meta-analyses. In this article, I critically reviewed p-curve and showed several limitations and biases in p-curve results. The first p-curve methods focussed on statistical significance and did not quantify the strength of evidence against the null-hypothesis that all significant results are false positives. This problem was solved by introducing a method that quantified strength of evidence as the mean unconditional power of studies with significant results. However, the estimation method was never validated with simulation studies. Independent simulation studies showed that p-curve systematically overestimates power when effect sizes or sample sizes are heterogeneous. In the present article, this bias inflated mean power for Nelson’s published results from 52% to 97%. This is not a small or negligible deviation. Rather, it shows that p-curve results can be extremely misleading. In an application to a published meta-analysis, the bias was less extreme, but still substantial, 37% vs. 52%, a 15 percentage points difference. As the amount of bias is unknown unless p-curve results are compared to z-curve results, researchers can simply use z-curve to obtain an estimate of mean power after selection for significance or the expected replication rate.

Z-curve not only provides a better estimate of the expected replication rate. It also provides an estimate of the expected discovery rate; that is the percentage of results that are significant if all studies were available (i.e., after researchers empty their file drawer). This estimate can be compared to the observed discovery rate to examine whether selection bias is present and how large it is. In contrast, p-curve provides no information about the presence of selection bias and the use of questionable research practices.

In sum, z-curve does everything that p-curve does better and it provides additional information. As z-curve is better than p-curve on all features, the rational choice is to use z-curve in future meta-analyses and to reexamine published p-curve analyses with z-curve. To do so, researchers can use the free R-package zcurve (Bartos & Schimmack, 2020).

References

Bartoš, F., & Schimmack, U. (2020). “zcurve: An R Package for Fitting Z-curves.” R package version 1.0.0

Bartoš, F., & Schimmack, U. (2021). Z-curve.2.0: Estimating the replication and discovery rates. Meta-Psychology, in press.

Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100, 407–425. http://dx.doi.org/10.1037/a0021524

Bem, D. J., & Honorton, C. (1994). Does psi exist? Replicable evidence for an anomalous process of information transfer. Psychological Bulletin, 115(1), 4–18. https://doi.org/10.1037/0033-2909.115.1.4

Brunner, J. & Schimmack, U. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta-Psychology, 4, https://doi.org/10.15626/MP.2018.874

Francis, G. (2012). Too good to be true: Publication bias in two prominent studies from experimental psychology. Psychonomic Bulletin & Review, 19, 151–156. http://dx.doi.org/10.3758/s13423-012-0227-9

Francis G., (2014). The frequency of excess success for articles in Psychological Science. Psychonomic Bulletin and Review, 21, 1180–1187. https://doi.org/10.3758/s13423-014-0601-x

Galak, J., LeBoeuf, R. A., Nelson, L. D., & Simmons, J. P. (2012). Correcting the past: Failures to replicate. Journal of Personality and Social Psychology, 103, 933–948. http://dx.doi.org/10.1037/a0029709

Motyl, M., Demos, A. P., Carsel, T. S., Hanson, B. E., Melton, Z. J., Mueller, A. B., Prims, J. P., Sun, J., Washburn, A. N., Wong, K. M., Yantis, C., & Skitka, L. J. (2017). The state of social and personality science: Rotten to the core, not so bad, getting better, or getting worse? Journal of Personality and Social Psychology, 113(1), 34–58. https://doi.org/10.1037/pspa0000084

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716–aac4716. https://doi.org/10.1126/science.aac4716

Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86(3), 638–641. https://doi.org/10.1037/0033-2909.86.3.638

Rusz, D., Le Pelley, M. E., Kompier, M. A. J., Mait, L., & Bijleveld, E. (2020). Reward-driven distraction: A meta-analysis. Psychological Bulletin, 146(10), 872–899. https://doi.org/10.1037/bul0000296

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17, 551–566. https://doi.org/10.1037/a0029487

Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne. 61 (4), 364-376. https://doi.org/10.1037/cap0000246

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366. http://dx.doi.org/10.1177/0956797611417632

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2018). False-positive citations. Perspectives on Psychological Science, 13(2), 255–259. https://doi.org/10.1177/1745691617698146

Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve: A key to the file-drawer. Journal of Experimental Psychology: General, 143(2), 534–547. https://doi.org/10.1037/a0033242

Sterling, T. D. (1959). Publication decision and the possible effects on inferences drawn from tests of significance – or vice versa. Journal of the American Statistical Association, 54, 30–34. https://doi.org/10.2307/2282137

Sterling, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication decisions revisited: The effect of the outcome of statistical tests on the decision to publish and vice versa. The American Statistician, 49, 108–112. https://doi.org/10.2307/2684823