Category Archives: Z-Curve

A Z-Curve Analysis of Emotion Journals: Soto & Schimmack 2024

March 1, 2026Credibility, Z-Curve, ZcurveUlrich Schimmack

For the full article see:

Full citation: Soto, M. D., & Schimmack, U. (2024). Credibility of results in emotion science: A Z-curve analysis of results in the journals Cognition & Emotion and Emotion. Cognition and Emotion. https://doi.org/10.1080/02699931.2024.2443016

OSF repository: https://osf.io/42vxd/

Purpose of this document: This is a detailed analytical summary written entirely in the summarizer’s own words. It is intended to make the paper’s methods, results, and arguments accessible for discussion and analysis without reproducing copyrighted text. Readers should consult the original article for exact language and figures.

Structured Summary

1. Motivation and Research Question

The paper addresses whether the replication crisis — documented most prominently by the Open Science Collaboration (2015), which found only 36% of psychology results replicated — extends to the emotion research literature specifically. The authors note that the OSC findings were limited to articles from 2008 and may not generalize to emotion research, which has its own dedicated journals and traditions.

The two journals examined are Cognition & Emotion (established 1987) and Emotion (established 2001 by APA). The authors aimed to assess: (a) how much selection bias exists in these journals, (b) what proportion of published results might be false positives, (c) what the expected replication rate is, and (d) whether these indicators have improved over time in response to the replication crisis.

2. Z-Curve Method: How It Works

The paper uses Z-curve 2.0 (Bartoš & Schimmack, 2022), which takes a set of test statistics, converts them to absolute z-scores, and fits a finite mixture model to the distribution of statistically significant z-values (those exceeding 1.96). The method produces four key estimates:

Expected Discovery Rate (EDR): An estimate of the average true power of studies before selection for significance. This represents what proportion of all conducted tests (including unpublished ones) would be expected to reach significance. It is conceptually the mean power across the full population of tests.

Expected Replication Rate (ERR): An estimate of mean power after selection for significance — that is, among published significant results. Because significance selection favors higher-powered studies, ERR is always higher than EDR. The authors frame ERR as an optimistic upper bound on expected replication success.

Observed Discovery Rate (ODR): Simply the proportion of extracted test statistics that were statistically significant at p < .05. Comparing ODR to EDR quantifies selection bias: a large gap indicates that many non-significant results went unreported.

False Discovery Risk (FDR): Computed from the EDR using Soric’s (1989) formula, which gives the maximum proportion of significant results that could be false positives given a particular discovery rate.

The authors explicitly note that ERR overestimates actual replication success (comparing z-curve’s ERR for the OSC dataset to the actual 36% rate), and they recommend interpreting the true replication rate as falling somewhere between EDR and ERR, citing Sotola (2023) for empirical support.

3. Methods

3.1 Test Statistic Extraction

The authors collected the complete set of published articles from both journals (3,831 from C&E covering 1987–2023; 2,323 from Emotion covering 2001–2023). Using custom R code built on the pdftools package (Ooms, 2024), they automatically extracted reported test statistics: F-tests, t-tests, chi-square tests (with df between 1 and 6 only, to exclude SEM model-fit tests), z-tests, and 95% confidence intervals of odds ratios and regression coefficients.

Chi-square tests with df > 6 were excluded because these typically come from structural equation modeling, where rejecting the null indicates poor model fit rather than a substantive finding. Confidence intervals were excluded when reported alongside test statistics to avoid double-counting. Meta-analysis articles were excluded entirely.

The extraction code was designed to handle various notation formats across journals and was iteratively refined. However, the authors acknowledge that the automated process cannot extract statistics from tables or figures, and cannot distinguish between focal and non-focal hypothesis tests.

After exclusions (including test statistics with N < 30, since t-to-z conversion is unreliable at very low df), the final samples were 30,513 z-scores from 1,902 C&E articles and 35,457 z-scores from 1,953 Emotion articles. The majority were F-tests (62% C&E, 53% Emotion) and t-tests (26% C&E, 28% Emotion).

3.2 Statistical Analysis — The Clustering Approach

This is a critical methodological detail. The authors used the zcurve_clustered function with the “b” method. This method works by sampling a single test statistic from each article during model fitting, thereby addressing within-article dependence. This directly addresses concerns about independence violations that arise when multiple test statistics are extracted from the same paper.

The EM algorithm was applied to significant z-values between 1.96 and 6 (values above 6 are treated as having essentially 100% power). The fitted mixture model uses seven discrete components (z = 0 through 6), and the estimated weights are used to compute EDR and ERR. The model then extrapolates the full distribution to estimate what the non-significant portion would look like without selection.

3.3 Time Trend Analysis

Annual z-curve estimates were computed for each publication year and regressed on linear and quadratic predictors of year. The quadratic term tested whether improvements accelerated after 2011 (when the replication crisis became prominent).

3.4 Hand-Coded Focal Tests

To address the limitation that automatic extraction conflates focal and non-focal tests, the authors also present results from 241 hand-coded articles from 2010 and 2020, drawn from an ongoing project covering 30+ journals and 4,000+ studies (Schimmack, 2020). This sample contained 227 significant tests out of 241 total.

4. Results

4.1 Main Z-Curve Estimates

The two journals produced remarkably similar results:

Parameter	Cognition & Emotion	Emotion
ODR	71% [70%, 71%]	70% [70%, 70%]
EDR	30% [14%, 53%]	31% [15%, 53%]
ERR	66% [59%, 73%]	65% [59%, 71%]
FDR	12% [5%, 32%]	12% [5%, 30%]

The ODR-EDR gap (approximately 40 percentage points) provides clear evidence of selection bias in both journals, confirmed visually by a sharp drop in observed z-scores just below the significance threshold of 1.96.

The ERR of approximately 65% suggests that the majority of published significant results should replicate with the same sample size, though the authors stress this is an optimistic estimate. The FDR point estimate of 12% is comparable to medical clinical trial journals (14% per Schimmack & Bartoš, 2023) and substantially lower than the most pessimistic predictions (Ioannidis, 2005). However, the upper bound of the FDR confidence interval (~30%) is high enough to warrant concern.

4.2 Time Trends

Sample sizes (degrees of freedom): Both journals showed significant linear increases over time, with some acceleration (significant quadratic trends). Median within-group df increased from roughly 50 in the early years to over 100 in recent years for Emotion, and showed a particularly sharp increase in C&E’s most recent years.

ODR: Both journals showed significant linear decreases in ODR over time (approximately 0.45 percentage points per year), suggesting that non-significant results are being reported more frequently. However, the quadratic terms were non-significant, meaning this trend preceded the replication crisis rather than being a response to it.

EDR: Both journals showed significant increases in EDR over time, consistent with increasing sample sizes leading to higher power. The combination of decreasing ODR and increasing EDR indicates that selection bias has diminished, though it remains present.

ERR: Increased over time for both journals, with C&E showing a significant acceleration (quadratic trend) suggesting the replication crisis may have prompted improvements.

FDR: Decreased over time as a direct consequence of the increasing EDR.

4.3 Hand-Coded Focal Test Results

The 241 hand-coded focal tests from 2010 and 2020 yielded:

Parameter	Estimate	95% CI
ODR	94%	[91%, 97%]
EDR	27%	[10%, 67%]
ERR	65%	[53%, 75%]
FDR	14%	[3%, 50%]

The ODR for focal tests (94%) is substantially higher than the 70–71% from automatic extraction, confirming that automatic extraction captures many non-focal, non-significant tests that dilute the ODR. However, the EDR, ERR, and FDR estimates are comparable to the automatically extracted results and fall within their confidence intervals. This is an important robustness check: the key z-curve parameters are not substantially altered by the inclusion of non-focal tests.

4.4 Alpha Adjustment Analysis

The authors examined the effect of lowering the significance threshold on discovery rates and false positive risk. Lowering alpha from .05 to .01 retains approximately half of all significant results while reducing FDR to below 5% for most publication years. Further reductions to .005 or .001 have diminishing returns for FDR reduction but increasingly sacrifice power.

5. Discussion and Interpretation

The authors frame their results as relatively encouraging for emotion research compared to worst-case scenarios. Key interpretive points:

The FDR of approximately 12% (though with wide CIs) suggests that most published significant results in emotion journals are not false positives. However, the upper bound of the CI leaves open the possibility of rates up to 30%.

The ERR of 65% predicts that most significant results should replicate with the same sample size, but this is optimistic. Adjusting for the estimated FDR, power for true effects may be approximately 72%, close to the conventional 80% benchmark but with substantial heterogeneity — half of studies have less power than this average.

The authors recommend treating results with p-values between .05 and .01 with skepticism, and suggest that alpha = .01 provides a better balance between false positive risk and power loss for the emotion literature specifically. They emphasize this recommendation is for evaluating existing literature, not as a new publication standard.

On effect sizes, the authors warn that selection bias inflates point estimates, making even meta-analytic effect sizes unreliable unless bias correction is applied. They advocate for honest reporting of all results, including non-significant ones, as essential for accurate meta-analysis.

6. Limitations Acknowledged by the Authors

The authors explicitly discuss several limitations:

Z-curve’s selection model assumes that publication probability is a function of power. In reality, questionable research practices (QRPs) can produce significance without real effects, potentially inflating EDR estimates and underestimating selection bias.
Simulation studies of z-curve performance under QRP-generated data are lacking.
The N > 30 exclusion removes some studies, though supplementary analyses with the full sample show similar results.
Automated extraction cannot distinguish focal from non-focal tests (addressed by the hand-coded analysis).
The automated extraction cannot reliably capture statistics from tables or figures.

7. Key Methodological Features Relevant to the Pek et al. Debate

Several aspects of this paper are directly relevant to criticisms raised by Pek et al.:

Independence assumption: Soto & Schimmack explicitly used zcurve_clustered with the “b” method, which samples one test statistic per article during bootstrapping. This directly addresses the concern about within-article dependence. The method section states this clearly.

Focal vs. non-focal tests: The paper includes both automatic extraction (all tests) and hand-coded focal tests, and shows that the z-curve parameters (EDR, ERR, FDR) are comparable across both approaches. This addresses the concern that including non-focal tests distorts results.

Appropriate caveats: The authors consistently describe ERR as optimistic, characterize the true replication rate as lying between EDR and ERR, acknowledge the wide confidence intervals on EDR and FDR, and explicitly discuss the limitations of the selection model assumption.

Asymmetric interpretation: The paper notes that z-curve evaluations of credibility are asymmetric — low values raise concerns about a literature, but high values do not guarantee credibility.

8. Summary Table of All Z-Curve Estimates

Analysis	N tests	N sig	ODR	EDR [95% CI]	ERR [95% CI]	FDR [95% CI]
C&E (auto)	30,513	21,628	71%	30% [14%, 53%]	66% [59%, 73%]	12% [5%, 32%]
Emotion (auto)	35,457	24,824	70%	31% [15%, 53%]	65% [59%, 71%]	12% [5%, 30%]
Focal (hand-coded)	241	227	94%	27% [10%, 67%]	65% [53%, 75%]	14% [3%, 50%]

Summary prepared for analytical discussion purposes. All descriptions reflect the summarizer’s interpretation of the original work. For exact language, figures, and supplementary analyses, consult the published article.

How to Avoid Pek as Reviewer 2

February 3, 2026Z-Curve, ZcurveAnonymous Reviews, Closed Review, Concerns about z-curve, Pek, Reviewer 2, Reviewer Bias, Z-Curve, zcurveUlrich Schimmack

The pattern is now familiar. I received another anonymous review by Reviewer 2 from a z-curve article that repeated Pek’s concerns about the performance of z-curve. To deal with biased reviewers, journals allow authors to mention potentially biased reviewers. I suggest doing so for Pek. I also suggest sharing a manuscript with me to ensure proper interpretation of results and to make it “reviewer-safe.”

To justify the claim that Pek is biased, researchers can use this rebuttal of Pek’s unscientific claims about z-curve.

Reviewer 2 (either Pek or a Pek parrot)

Reviewer Report:

The manuscript “A review and z-curve analysis of research on the palliative association of system justification” (Manuscript ID 1598066) extends the work of Sotola and Credé (2022), who used Z-curve analysis to evaluate the evidential value of findings related to system justification theory (SJT). The present paper similarly reports estimates of publication bias, questionable research practices (QRPs), and replication rates in the SJT literature using Z-curve. Evaluating how scientific evidence accumulates in the published literature is unquestionably important.

However, there is growing concern about the performance of meta-analytic forensic tools such as p-curve (Simonsohn, Nelson, & Simmons, 2014; see Morey & Davis-Stober, 2025 for a critique) and Z-curve (Brunner & Schimmack, 2020; Bartoš & Schimmack, 2022; see Pek et al., in press for a critique). Independent simulation studies increasingly suggest that these methods may perform poorly under realistic conditions, potentially yielding misleading results.

Justification for a theory or method typically requires subjecting it to a severe test (Mayo, 2019) – that is, assuming the opposite of what one seeks to establish (e.g., a null hypothesis of no effect) and demonstrating that this assumption leads to contradiction. In contrast, the simulation work used to support Z-curve (Brunner & Schimmack, 2020; Bartoš & Schimmack, 2022) relies on affirming belief through confirmation, a well-documented cognitive bias.

Findings from Pek et al. (in press) show that when selection bias is presented in published p-values — the very scenario Z-curve was intended to be applied — estimates of the expected discovery rate (EDR), expected replication rate (ERR), and Sorić’s False Discovery Risk (FDR) are themselves biased.

The magnitude and direction of this bias depend on multiple factors (e.g., number of p-values, selection mechanism of p-values) and cannot be corrected or detected from empirical data alone. The manuscript’s main contribution rests on the assumption that Z-curve yields reasonable estimates of the “reliability of published studies,” operationalized as a high ERR, and that the difference between the observed discovery rate (ODR) and EDR quantifies the extent of QRPs and publication bias.

The paper reports an ERR of .76, 95% CI [.53, .91] and concludes that research on the palliative hypothesis may be more reliable than findings in many other areas of psychology. There are several issues with this claim. First, the assertion that Sotola (2023) validated ERR estimates from the Z-curve reflects confirmation bias – I have not read Röseler (2023) and cannot comment on the argument made in it. The argument rests solely on the descriptive similarly between the ERR produced by Z-curve and the replication rate reported by the Open Science Collaboration (2015). However, no formal test of equivalence was conducted, and no consideration was given to estimate imprecision, potential bias in the estimates, or the conditions under which such agreement might occur by chance.

At minimum, if Z-curve estimates are treated as predicted values, some form of cross-validation or prediction interval should be used to quantify prediction uncertainty. More broadly, because ERR estimates produced by Z-curve are themselves likely biased (as shown in Pek et al., in press), and because the magnitude and direction of this bias are unknown, comparisons about ERR values across literatures do not provide a strong evidential basis for claims about the relative reliability of research areas.

Furthermore, the width of the 95% CI spans roughly half of the bounded parameter space of [0, 1], indicating substantial imprecision. Any claims based on these estimates should thus be contextualized with appropriate caution.

Another key result concerns the comparison of EDR = .52, 95% CO [.14, .92], and ODR = .81, 95% CI = [.69, .90]. The manuscript states that “When these two estimates are highly discrepant, this is consistent with the presence of questionable research practices (QRPS) and publication bias in this area of research (Brunner & Schimmack, 2020).

But in this case, the 95% CIs for the EDR and ODR in this work overlapped quite a bit, meaning that they may not be significantly different…” (p. 22). There are several issues with such a claim. First, Z curve results cannot directly support claims about the presence of QRPs.

The EDR reflects the proportion of significant p values expected under no selection bias, but it does not identify the source of selection bias (e.g., QRPs, fraud, editorial decisions). Using Z curve requires accepting its assumed missing data mechanism—a strong assumption that cannot be empirically validated.

Second, a descriptive comparison between two estimates cannot be interpreted as a formal test of difference (e.g., eyeballing two estimates of means as different does not tell us whether this difference is not driven by sampling variability). Means can be significantly different even if their confidence intervals overlap (Cumming & Finch, 2005).

A formal test of the difference is required. Third, EDR estimates can be biased. Even under ideal conditions, convergence to the population values requires extremely large numbers of studies (e.g., > 3000, see Figure 1 of Pek et al., in press).

The current study only has 64 tests. Thus, even if a formal test of the difference of ODR – EDR was conducted, little confidence could be placed on the result if the EDR estimate is biased and does not reflect the true population value.

Although I am critical of the outputs of Z curve analysis due to its poor statistical performance under realistic conditions, the manuscript has several strengths. These include adherence to good meta analytic practices such as providing a PRISMA flow chart, clearly stating inclusion and exclusion criteria, and verifying the calculation of p values. These aspects could be further strengthened by reporting test–retest reliability (given that a single author coded all studies) and by explicitly defining the population of selected p values. Because there appears to be heterogeneity in the results, a random effects meta analysis may be appropriate, and study level variables (e.g., type of hypothesis or analysis) could be used to explain between study variability. Additionally, the independence of p values has not been clearly addressed; p values may be correlated within articles or across studies. Minor points: The “reliability” of studies should be explicitly defined. The work by Manapat et al. (2022) should be cited in relation to Nagy et al. (2025). The findings of Simmons et al. (2011) applies only to single studies.

However, most research is published in multi-study sets, and follow-up simulations by Wegener at al. (2024) indicate that the Type I error rate is well-controlled when methodological constraints (e.g., same test, same design, same measures) are applied consistently across multiple studies – thus, the concerns of Simmons et al. (2011) pertain to a very small number of published results.

I could not find the reference to Schimmack and Brunner (2023) cited on p. 17.

Rebuttal to Core Claims in Recent Critiques of z-Curve

1. Claim: z-curve “performs poorly under realistic conditions”

Rebuttal

The claim that z-curve “performs poorly under realistic conditions” is not supported by the full body of available evidence. While recent critiques demonstrate that z-curve estimates—particularly EDR—can be biased under specific data-generating and selection mechanisms, these findings do not justify a general conclusion of poor performance.

Z-curve has been evaluated in extensive simulation studies that examined a wide range of empirically plausible scenarios, including heterogeneous power distributions, mixtures of low- and high-powered studies, varying false-positive rates, different degrees of selection for significance, and multiple shapes of observed z-value distributions (e.g., unimodal, right-skewed, and multimodal distributions). These simulations explicitly included sample sizes as low as k ≈ 100, which is typical for applied meta-research in psychology.

Across these conditions, z-curve demonstrated reasonable statistical properties conditional on its assumptions, including interpretable ERR and EDR estimates and confidence intervals with acceptable coverage in most realistic regimes. Importantly, these studies also identified conditions under which estimation becomes less informative—such as when the observed z-value distribution provides little information about missing nonsignificant results—thereby documenting diagnosable scope limits rather than undifferentiated poor performance.

Recent critiques rely primarily on selective adversarial scenarios and extrapolate from these to broad claims about “realistic conditions,” while not engaging with the earlier simulation literature that systematically evaluated z-curve across a much broader parameter space. A balanced scientific assessment therefore supports a more limited conclusion: z-curve has identifiable limitations and scope conditions, but existing simulation evidence does not support the claim that it generally performs poorly under realistic conditions.

2. Claim: Bias in EDR or ERR renders these estimates uninterpretable or misleading

Rebuttal

The critique conflates the possibility of bias with a lack of inferential value. All methods used to evaluate published literatures under selection—including effect-size meta-analysis, selection models, and Bayesian hierarchical approaches—are biased under some violations of their assumptions. The existence of bias therefore does not imply that an estimator is uninformative.

Z-curve explicitly reports uncertainty through bootstrap confidence intervals, which quantify sampling variability and model uncertainty given the observed data. No evidence is presented that z-curve confidence intervals systematically fail to achieve nominal coverage under conditions relevant to applied analyses. The appropriate conclusion is that z-curve estimates must be interpreted conditionally and cautiously, not that they lack statistical meaning.

3. Claim: Reliable EDR estimation requires “extremely large” numbers of studies (e.g., >3000)

Rebuttal

This claim overgeneralizes results from specific, highly constrained simulation scenarios. The cited sample sizes correspond to conditions in which the observed data provide little identifying information, not to a general requirement for statistical validity.

In applied statistics, consistency in the limit does not imply that estimates at smaller sample sizes are meaningless; it implies that uncertainty must be acknowledged. In the present application, this uncertainty is explicitly reflected in wide confidence intervals. Small sample sizes therefore affect precision, not validity, and do not justify dismissing the estimates outright.

4. Claim: Differences between ODR and EDR cannot support inferences about selection or questionable research practices

Rebuttal

It is correct that differences between ODR and EDR do not identify the source of selection (e.g., QRPs, editorial decisions, or other mechanisms). However, the critique goes further by implying that such differences lack diagnostic value altogether.

Under the z-curve framework, ODR–EDR discrepancies are interpreted as evidence of selection, not of specific researcher behaviors. This inference is explicitly conditional and does not rely on attributing intent or mechanism. Rejecting this interpretation would require demonstrating that ODR–EDR differences are uninformative even under monotonic selection on statistical significance, which has not been shown.

5. Claim: ERR comparisons across literatures lack evidential basis because bias direction is unknown

Rebuttal

The critique asserts that because ERR estimates may be biased with unknown direction, comparisons across literatures lack evidential value. This conclusion does not follow.

Bias does not eliminate comparative information unless it is shown to be large, variable, and systematically distorting rankings across plausible conditions. No evidence is provided that ERR estimates reverse ordering across literatures or are less informative than alternative metrics. While comparative claims should be interpreted cautiously, caution does not imply the absence of evidential content.

6. Claim: z-curve validation relies on “affirming belief through confirmation”

Rebuttal

This characterization misrepresents the role of simulation studies in statistical methodology. Simulation-based evaluation of estimators under known data-generating processes is the standard approach for assessing bias, variance, and coverage across frequentist and Bayesian methods alike.

Characterizing simulation-based validation as epistemically deficient would apply equally to conventional meta-analysis, selection models, and hierarchical Bayesian approaches. No alternative validation framework is proposed that would avoid reliance on model-based simulation.

7. Implicit claim: Effect-size meta-analysis provides a firmer basis for credibility assessment

Rebuttal

Effect-size meta-analysis addresses a different inferential target. It presupposes that studies estimate commensurable effects of a common hypothesis. In heterogeneous literatures, pooled effect sizes represent averages over substantively distinct estimands and may lack clear interpretation.

Moreover, effect-size meta-analysis does not estimate discovery rates, replication probabilities, or false-positive risk, nor does it model selection unless explicitly extended. No evidence is provided that effect-size meta-analysis offers superior performance for evaluating evidential credibility under selective reporting.

Summary

The critiques correctly identify that z-curve is a model-based method with assumptions and scope conditions. However, they systematically extend these points beyond what the evidence supports by:

extrapolating from selective adversarial simulations,
conflating potential bias with lack of inferential value,
overgeneralizing small-sample limitations,
and applying asymmetrical standards relative to conventional methods.

A scientifically justified conclusion is that z-curve provides conditionally informative estimates with quantifiable uncertainty, not that it lacks statistical validity or evidential relevance.

Reply to Erik van Zwet: Z-Curve Only Works on Earth

January 27, 2026Z-Curve, ZcurveAndrew Gelman, Causal Inference, Concerns, Concerns about the z-curve method, Erik van Zwet, Statistical Modeling, Z-cruveUlrich Schimmack

In the 17th century, early telescopic observations of Mars suggested that the planet might be populated. Now imagine a study that aims to examine whether Martians are taller than humans. The problem is obvious: although we may assume that Martians exist, we cannot observe or measure them, and therefore we end up with zero observations of Martian height. Would we blame the t-test for not telling us what we want to know? I hope your answer to this rhetorical question is “No, of course not.”

If you pass this sanity check, the rest of this post should be easy to follow. It responds to criticism by Erik van Zwet (EvZ), hosted and endorsed by Andrew Gelman on his blog,

“Concerns about the z-curve method.”

EvZ imagines a scenario in which z-curve is applied to data generated by two distinct lines of research. One lab conducts studies that test only true null hypotheses. While exact effect sizes of zero may be rare in practice, attempting to detect extremely small effects in small samples is, for all practical purposes, equivalent. A well-known example comes from early molecular genetic research that attempted to link variation in single genes—such as the serotonin transporter gene—to complex phenotypes like Neuroticism. It is now well established that these candidate-gene studies produced primarily false positive results when evaluated with the conventional significance threshold of α = .05.

In response, molecular genetics fundamentally changed its approach. Researchers began testing many genetic variants simultaneously and adopted much more stringent significance thresholds to control the multiple-comparison problem. In the simplified example used here, I assume α = .001, implying an expected false positive rate of only 1 in 1,000 tests. I further assume that truly associated genetic predictors—single nucleotide polymorphisms (SNPs)—are tested in very large samples, such that sampling error is small and true effects yield z-values around 6. This is, of course, a stylized assumption, but it serves to illustrate the logic of the critique.

Figure 1 illustrates a situation with 1,000 studies from each of these two research traditions. Among the 1,000 candidate-gene studies, only one significant result is expected by chance. Among the genome-wide association studies (GWAS), power to reject the null hypothesis at α = .001 is close to 1, although a small number (3–4 out of 1,000) of studies may still fail to reach significance.

At this point, it is essential to distinguish between two scenarios. In the first scenario, all 999 non-significant results are observed and available for analysis. If we could recover the full distribution of results—including non-significant ones—we could fit models to the complete set of z-values. Z-curve can, in principle, be applied to such data, but it was not designed for this purpose.

Z-curve was developed for the second scenario. In this scenario, the light-purple, non-significant results exist only in researchers’ file drawers and are not part of the observed record. This situation—selection for statistical significance—is commonly referred to as publication bias. In psychology, success rates above 90% strongly suggest that statistical significance is a necessary condition for publication (Sterling, 1959). Under such selection, non-significant results provide no observable information, and only significant results remain. In extreme cases, it is theoretically possible that all published significant findings are false positives (Rosenthal, 1979), and in some literatures—such as candidate-gene research or social priming—this possibility is not merely theoretical.

Z-curve addresses uncertainty about the credibility of published significant results by explicitly conditioning on selection for significance and modeling only those results. When success rates approach 90% or higher, there is often no alternative: non-significant results are simply unavailable.

In Figure 1, the light-purple bars represent non-significant results that exist only in file drawers. Z-curve is fitted exclusively to the dark-purple, significant results. Based on these data, the fitted model (red curve), which is centered near the true value of z = 6, correctly infers that the average true power of the studies contributing to the significant results is approximately 99% when α = .001 (corresponding to a critical value of z ≈ 3.3).

Z-curve also estimates the Expected Discovery Rate (EDR). Importantly, the EDR refers to the average power of all studies that were conducted in the process of producing the observed significant results. This conditioning is crucial. Z-curve does not attempt to estimate the total number of studies ever conducted, nor does it attempt to account for studies from populations that could not have produced the observed significant findings. In this example, candidate-gene studies that produced non-significant results—whether published or not—are irrelevant because they did not contribute to the set of significant GWAS results under analysis.

What matters instead is how many GWAS studies failed to reach significance and therefore remain unobserved. Given the assumed power, this number is at most 3–4 out of 1,000 (<1%). Consequently, an EDR estimate of 99% is correct and indicates that publication bias within the relevant population of studies is trivial. Because the false discovery rate is derived from the EDR, the implied false positive risk is effectively zero—again, correctly so for this population.

EvZ’s criticism of z-curve is therefore based on a misunderstanding of the method’s purpose and estimand. He evaluates z-curve against a target that includes large numbers of studies that leave no trace in the observed record and have no influence on the distribution of significant results being analyzed. But no method that conditions on observed significant results can recover information about such studies—nor should it be expected to.

Z-curve is concerned exclusively with the credibility of published significant results. Non-significant studies that originate from populations that do not contribute to those results are as irrelevant to this task as the height of Martians.

Response to van Zwet’s Critic of Our Z-Curve Method

January 13, 2026Z-Curve, ZcurveExpected Discovery Rate, False Positive Risk, Gelman, Selection Bias, Slope, Type-S Errror, van zwet, Z-CurveUlrich Schimmack

Bartoš, F., & Schimmack, U. (2022). Z-curve 2.0: Estimating replication rates and discovery rates. Meta-Psychology, 6, Article e0000130. https://doi.org/10.15626/MP.2022.2981

Brunner, J., & Schimmack, U. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta- Psychology. MP.2018.874, https://doi.org/10.15626/MP.2018.874

van Zwet, E., Gelman, A., Greenland, S., Imbens, G., Schwab, S., & Goodman, S. N. (2024). A New Look at P Values for Randomized Clinical Trials. NEJM evidence, 3(1), EVIDoa2300003. https://doi.org/10.1056/EVIDoa2300003

The Story of Two Z-Curve Models

Erik van Zwet recently posted a critique of the z-curve method on Andrew Gelman’s blog.

Concerns about the z-curve method | Statistical Modeling, Causal Inference, and Social Science

Meaningful discussion of the severity and scope of this critique was difficult in that forum, so I address the issue more carefully here.

van Zwet identified a situation in which z-curve can overestimate the Expected Discovery Rate (EDR) when it is inferred from the distribution of statistically significant z-values. Specifically, when the distribution of significant results is driven primarily by studies with high power, the observed distribution contains little information about the distribution of nonsignificant results. If those nonsignificant results are not reported and z-curve is nevertheless used to infer them from the significant results alone, the method can underestimate the number of missing nonsignificant studies and, as a consequence, overestimate the Expected Discovery Rate (EDR).

This is a genuine limitation, but it is a conditional and diagnosable one. Crucially, the problematic scenarios are directly observable in the data. Problematic data have an increasing or flat slope of the significant z-value distribution and a mode well above the significance threshold. In such cases, z-curve does not silently fail; it signals that inference about missing studies is weak and that EDR estimates should not be trusted.

This is rarely a problem in psychology, where most studies have low power, the mode is at the significance criterion, and the slope decreases, often steeply. This pattern implies a large set of non-significant results and z-curve provides good estimates in these scenarios. It is difficult to estimate distributions of unobserved data, leading to wide confidence intervals around these estimates. However, there is no fixed number of studies that are needed. The relevant question is whether the confidence intervals are informative enough to support meaningful conclusions.

One of the most powerful set of studies that I have actually seen comes from epidemiology, where studies often have large samples to estimate effect sizes precisely. In these studies, power to reject the null hypothesis is actually not really important, but the data serve as a good example of a set of studies with high power, rather than low power as in psychology.

However, even this example shows a decreasing slope and a mode at significance criterion. Fitting z-curve to these data still suggests some selection bias and no underestimation of reported non-significant results. This illustrates how extreme van Zwet’s scenario must be to produce the increasing-slope pattern that undermines EDR estimation.

What about van Zwet’s Z-Curve Method?

It is also noteworthy that van Zwet does not compare our z-curve method (Bartos & Schimmack, 2022; Brunner & Bartos, 2020) to his own z-curve method that was used to analyze z-values from clinical trials (van Zwet et al., 2024).

The article fits a model to the distribution of absolute z-values (ignoring whether results show a benefit or harm to patients). The key differences between the two approaches are that (a) van Zwet et al.’s model uses all z-values and assumes (implicitly) that there is no selection bias, and (b) that true effect sizes are never zero and errors can only be sign errors. Based on these assumptions, the article concludes that no more than 2% of clinical trials produce a result that falsely rejects a true hypothesis. For example, a statistically significant result could be treated as an error only if the true effect has the opposite sign (e.g., the true effect increases smoking, but a significant result is used to claim it reduced smoking).

The advantage of this method is that it is not necessary to estimate the EDR from the distribution of only significant results, but it does so only by assuming that publication bias does not exist. In this case, we can just count the observed non-significant and significant results and use the observed discovery rate to estimate average power and the false positive risk.

The trade-off is clear. z-curve attempts to address selection bias and sometimes lacks sufficient information to do so reliably; van Zwet’s approach achieves stable estimates by assuming the problem away. The former risks imprecision when information is weak; the latter risks bias when its core assumption is violated.

In the example from epidemiology, there is evidence of some publication bias and omission of non-significant results. Using van Zwet’s model would be inappropriate because it would overestimate the true discovery rate. The focus on sign errors alone is also questionable and should be clearly stated as a strong assumption. It implies that significant results in the right direction are not errors, even if effect sizes are close to zero. For example, a significant result that suggests it extends life is considered a true finding, even if the effect size is one day.

False positive rates do not fully solve that problem, but false positive rates that include zero as a hypothetical value for the population effect size are higher and treat small effects close to zero as errors rather than treating half of them as correct rejections of the null hypothesis. For example, an intervention that decreases smoking by 1% of all smokers is not really different from one that increases it by 1%, but a focus on signs treats only the latter one as an error.

In short, van Zwet’s critique identifies a boundary condition for z-curve, not a general failure. At the same time, his own method rests on a stronger and untested assumption—no selection bias—whose violation would invalidate its conclusions entirely. No method is perfect and using a single scenario to imply that a method is always wrong is not a valid argument against any method. By the same logic, van Zwet’s own method could be declared “useless” whenever selection bias exists, which is precisely the point: all methods have scope conditions.

Using proper logic, we suggest that all methods work when assumptions are met. The main point is to test whether they are met or not. We clarified that z-curve estimation of the EDR assumes that enough low powered studies produced significant results to influence the distribution of significant results. If the slope of significant results is not decreasing, this assumption does not hold and z-curve should not be used to estimate the EDR. Similarly, users of van Zwets first method should first test whether selection bias is present and not use it when it does. They should also examine whether they think a proportion of studies could have tested practically true null hypotheses and not use the method when this is a concern.

Finally, the blog post responds to Gelman’s polemic about our z-curve method and earlier work by Jager and Leek (2014), by noting that Gelman’s critic of other methods exist in parallel to his own work (at least co-authorship) that also modeled distribution of z-values to make claims about power and the risk of false inferences. The assumption of this model that selection bias does not exist is peculiar, given Gelman’s typical writing about low power and the negative effects of selection for significance. A more constructive discussion would apply the same critical standards to all methods—including one’s own.

Frequently Asked Questions about Z-Curve

January 2, 2026Z-Curve, ZcurveUlrich Schimmack

under development

Can’t find what you are looking for?
1. Ask an AI to search replicationindex.com to find answers that are not here)
2. Send me an email and I will answer your question and add it to the FAQ list.

Does z-curve offer options for small sample (small-N) literatures like animal research?

Short answer:
Yes — z-curve 3.0 adds new transformation methods and a t-curve option that make the method more appropriate for analyses involving small samples (e.g., N < 30). These options are designed to mitigate biases that arise when you convert small-sample test statistics to z-scores using standard normal approximations. Z-curve.3.0 also allows researchers to use t-distributions (t-curve) with a fixed df that is more similar to the distributions of test statistics from small samples than the standard normal distribution.

Details:

The z-curve 3.0 tutorial (Chapter 8) explains that instead of only converting p-values to z-scores, you can now:
- Try alternative transformations of t-values that better reflect their sampling distribution, and
- Use a direct t-curve model that fits t-distributions with specified degrees of freedom instead of forcing a normal approximation. This “t-curve” option is recommended when studies have similar and genuinely small degrees of freedom (like many animal experiments). (replicationindex.com)
These improvements help reduce bias introduced by naïve normal transformations, though they don’t completely eliminate all small-sample challenges, and performance can still be unstable when degrees of freedom vary widely or are extremely small. (replicationindex.com)

Link to the tutorial:
🔗 Z-Curve 3.0 Tutorial (Introduction and links to all chapters) — https://replicationindex.com/2025/07/08/z-curve-tutorial-introduction/ (replicationindex.com)

Z-Curve.3.0 Tutorial: Introduction

July 8, 2025Z-Curve, ZcurveUlrich Schimmack

Links to Additional Resources and Answers to Frequently Asked Questions

Chapters

This post is Chapter 1. The R-code for this chapter can be found on my github:
zcurve3.0/Tutorial.R.Script.Chapter1.R at main · UlrichSchimmack/zcurve3.0
(the picture for this post shows a “finger-plot, you can make your own with the code)

Chapter 2 shows the use of z-curve.3.0 with the Open Science Collaboration Reproducibility Project (Science, 2015) p-values of the original studies.
zcurve3.0/Tutorial.R.Script.Chapter2.R at main · UlrichSchimmack/zcurve3.0

Chapter 3 shows the use of z-curve.3.0 with the Open Science Collaboration Reproducibility Project (Science, 2015) p-values of the replication studies.
zcurve3.0/Tutorial.R.Script.Chapter3.R at main · UlrichSchimmack/zcurve3.0

Chapter 4 shows how you can run simulation studies to evaluate the performance of z-curve for yourself.
zcurve3.0/Tutorial.R.Script.Chapter4.R at main · UlrichSchimmack/zcurve3.0

Chapter 5 uses the simulation from Chapter 4 to compare the performance of z-curve with p-curve, another method that aims to estimate the average power of only significant results that is used to estimate the expected replication rate with z-curve.
zcurve3.0/Tutorial.R.Script.Chapter5.R at main · UlrichSchimmack/zcurve3.0

Chapter 6 uses the simulation from Chapter 4 to compare the performance of the default z-curve method with a z-curve that assumes a normal distribution of population effect sizes. The simulation highlights the problem of making distribution assumptions. One of the strengths of z-curve is that it does not make an assumption about the distribution of power.
zcurve3.0/Tutorial.R.Script.Chapter6.R at main · UlrichSchimmack/zcurve3.0

Chapter 7 uses the simulation from Chapter 4 to compare the performance of z-curve to a Bayesian mixture model (bacon). The aim of bacon is different, but it also fits a mixture model to a set of z-values. The simulation results show that z-curve performs better than the Bayesian mixture model.
zcurve3.0/Tutorial.R.Script.Chapter7.R at main · UlrichSchimmack/zcurve3.0

Chapter 8 uses the simulation from Chapter 4 to examine the performance of z-curve with t-values from small studies (N = 30). It introduces a new transformation method that performs better than the default method from z-curve.2.0 and it introduces the t-curve option to analyze t-values from small studies with t-distributions.
zcurve3.0/Tutorial.R.Script.Chapter8.R at main · UlrichSchimmack/zcurve3.0

Chapter 9 simulates p-hacking by combining small samples with favorable trends into a larger sample with a significant result (patchwork samples). The simulation simulates studies with between-subject two-group designs with varying means and SD of effect sizes and sample sizes. It also examines the ability of z-curve to detect p-hacking and compares the performance of the default z-curve that does not make assumptions about the distribution of power and a z-curve model that assumes a normal distribution of power.
zcurve3.0/Tutorial.R.Script.Chapter9.R at main · UlrichSchimmack/zcurve3.0

Brief ChatGPT Generated Summary of Key Points

What Is Z-Curve?

Z-curve is a statistical tool used in meta-analysis, especially for large sets of studies (e.g., more than 100). It can also be used with smaller sets (as few as 10 significant results), but the estimates become less precise.

There are several types of meta-analysis:

Direct replication: Studies that test the same hypothesis with the same methods.
Example: Several studies testing whether aspirin lowers blood pressure.
Conceptual replication: Studies that test a similar hypothesis using different procedures or measures.
Example: Different studies exploring how stress affects memory using different tasks and memory measures.

In direct replications, we expect low variability in the true effect sizes. In conceptual replications, variability is higher due to different designs.

Z-curve was primarily developed for a third type of meta-analysis: reviewing many studies that ask different questions but share a common feature—like being published in the same journal or during the same time period. In these cases, estimating an average effect size isn’t very meaningful because effects vary so much. Instead, z-curve focuses on statistical integrity, especially the concept of statistical power.

What Is Statistical Power?

I define statistical power as the probability that a study will produce a statistically significant result (usually p < .05).

To understand this, we need to review null hypothesis significance testing (NHST):

Researchers test a hypothesis (like exercise increasing lifespan) by conducting a study.
They calculate the effect size (e.g., exercise increase the average lifespan by 2 years) and divide it by the standard error to get a test statistic (e.g., a z-score).
Higher test-statistics imply a lower probability that the null hypothesis is true. The null hypothesis is that there is no effect. If the probability is below the conventional criterion of 5%, the finding is interpreted as evidence of an effect.

Power is the probability of obtaining a significant result, p < .05.

Hypothetical vs. Observed Power

Textbooks often describe power in hypothetical terms. For example, before collecting data, a researcher might assume an effect size and calculate how many participants are needed for 80% power.

But z-curve does something different. It estimates the average true power of a set of studies. It is only possible to estimate average true power for sets of studies because power estimates based on a single study are typically too imprecise to be useful. Z-curve provides estimates of the average true power of a set of studies and the uncertainty in these estimates.

Populations of Studies

Brunner and Schimmack (2020) introduced an important distinction:

All studies ever conducted (regardless of whether results were published).
Only published studies, which are often biased toward significant results.

If we had access to all studies, we could simply calculate power by looking at the proportion of significant results. For example, if 50% of all studies show p < .05, then the average power is 50%.

In reality, we only see a biased sample—mostly significant results that made it into journals. This is called selection bias (or publication bias), and it can mislead us.

What Z-Curve Does

Z-curve helps us correct for this bias by:

Using the p-values from published studies.
Converting them to z-scores (e.g., p = .05 → z ≈ 1.96).
Modeling the distribution of these z-scores to estimate:
- The power of the studies we see,
- The likely number of missing studies,
- And the amount of bias.

Key Terms in Z-Curve

Term	Meaning
ODR (Observed Discovery Rate)	% of studies that report significant results
EDR (Expected Discovery Rate)	Estimated % of significant results we’d expect if there were no selection bias
ERR (Expected Replication Rate)	Estimated % of significant studies that would replicate if repeated exactly
FDR (False Discovery Rate)	Estimated % of significant results that are false positives

Understanding the Z-Curve Plot

Figure 1. Histogram of z-scores from 1,984 significant tests. The solid red line shows the model’s estimated distribution of observed z-values. The dashed line shows what we’d expect without selection bias. The Observed Discovery Rate (ODR) is 100%, meaning all studies shown are significant. However, the Expected Discovery Rate (EDR) is only 40%, suggesting many non-significant results were omitted. The Expected Replication Rate (ERR) is also 40%, indicating that only 40% of these significant results would likely replicate. The False Discovery Rate (FDR) is estimated at 8%.

Notice how the histogram spikes just above z = 2 (i.e., just significant) and drops off below. This pattern signals selection for significance, which is unlikely to occur due to chance alone.

Homogeneity vs. Heterogeneity of Power

Sometimes all studies in a set have similar power (called homogeneity). In that case, the power of significant and non-significant studies is similar.

However, z-curve allows for heterogeneity, where studies have different power levels. This flexibility makes it better suited to real-world data than methods that assume all studies are equally powered.

When power varies, high-power studies are more likely to produce significant results. That’s why, under heterogeneity, the ERR (for significant studies) is often higher than the EDR (for all studies).

Summary of Key Concepts

Meta-analysis = Statistical summary of multiple studies.
Statistical significance = p < .05.
Power = Probability of finding a significant result.
Selection bias = Overrepresentation of significant results in the literature.
ODR = Observed rate of p < .05.
EDR = Expected rate of p < .05 without bias.
ERR = Estimated replication success rate of significant results.

Full Introduction

Z-curve is a statistical tool for meta-analysis of larger sets of studies (k > 100). Although it can be used with smaller sets of studies (k > 10 significant results), confidence intervals are likely to be very wide. There are also different types of meta-analysis. The core application of meta-analysis is to combine information from direct replication studies. that is studies that test the same hypothesis (e.g., the effect of aspirin on blood pressure). The most widely used meta-analytic tools aim to estimate the average effect size for a set of studies with the same research question. A second application is to quantitatively review studies on a specific research topic. These studies are called conceptual replication studies. They test the same or related hypothesis, but with different experimental procedures (paradigms). The main difference between meta-analysis of direct and conceptual replication studies is that we would expect less variability in population effect sizes (not the estimates in specific samples) in direct replications, whereas variability is expected to be higher in conceptual replication studies with different experimental manipulations and dependent variables.

Z-curve can be applied to meta-analysis of conceptual replication studies, but it was mainly developed for a third type of meta-analysis. These meta-analyses examine sets of studies with different hypotheses and research designs. Usually, these studies share a common feature. For example, they may be published in the same journal, belong to a specific scientific discipline or sub-discipline, or a specific time period. The main question of interest here is not the average effect size that is likely to vary widely from study to study. The purpose of a z-curve analysis is to examine the credibility or statistical integrity of a set of studies. The term credibility is a broad term that covers many features of a study. Z-curve focuses on statistical power as one criterion for the credibility of a study. To use z-curve and to interpret z-curve results it is therefore important to understand the concept of statistical power. Unfortunately, statistical power is still not part of the standard education in psychology. Thus, I will provide a brief introduction to statistical power.

Statistical Power

Like many other concepts in statistics, statistical power (henceforth power, the only power that does not corrupt), is a probability. To understand power, it is necessary to understand the basics of null-hypothesis significance testing (NHST). When resources are insufficient to estimate effect sizes, researchers often have to settle for the modest goal to examine whether a predicted positive effect (exercise increases longevity) is positive, or a predicted negative effect is negative (asparin lowers blood pressure). The common approach to do so is to estimate the effect size in a sample, estimate the sampling error, compute the ratio of the two, and then compute the probability that the observed effect size or an even bigger one could have been obtained without an effect; that is, a true effect size of 0. Say, the effect of exercise on longevity is an extra 2 years, the sampling error is 1 year, and the test statistic is 2/1 = 2. This value would correspond to a p-value of .05 that the true effect is positive (not 2 years, but greater than 0). P-values below .05 are conventionally used to decide against the null hypothesis and to infer that the true effect size is positive if the estimate is positive or that the true effect is negative if the estimate is negative. Now we can define power. Power is the probability of obtaining a significant result, which typically means a p-value below .05. In short,

Power is the probability of obtaining a statistically significant result.

This definition of power differs from the textbook definition of power because we need to distinguish between different types of powers or power calculations. The most common use of power calculations relies on hypothetical population effect sizes. For example, let’s say we want to conduct a study of exercise and longevity without any prior studies. Therefore, we do not know whether exercise has an effect or how big the effect is. This does not stop us from calculating power because we can just make assumptions about the effect size. Let’s say we assume the effect is two years. The main reason to compute hypothetical power is to plan sample sizes of studies. For example, we have information about the standard deviation of people’s life span and can compute power for hypothetical sample sizes. A common recommendation is to plan studies with 80% power to obtain a significant result with the correct sign.

It would be silly to compute the hypothetical power for an effect size of zero. First, we know that the probability of a significant result without a real effect is set by the research. When they use p < .05 as a rule to determine significance, the probability of obtaining a significant result without a real effect is 5%. If they use p < .01, it is 1%. No calculations are needed. Second, researchers conduct power analysis to find evidence for an effect. So, it would make no sense to do the power calculation with a value of zero. This is null hypothesis that researchers want to reject, and they want a reasonable sample size to do so.

All of this means that hypothetical power calculations assume a non-zero effect size and power is defined as the conditional probability to obtain a significant result for a specified non-zero effect size. Z-curve is used to compute a different type of power. The goal is to estimate the average true power of a set of studies. This average can be made up of a mix of studies in which the null hypothesis is true or false. Therefore, z-curve estimates are no longer conditional on a true effect. When the null hypothesis is true, power is set by the significance criterion. When there is an effect, power is a function of the size of the effect. All of the discussion of conditional probability is just needed to understand the distinction between the definition of power in hypothetical power calculations and in empirical estimates of power with z-curve. The short and simple definition of power is simply the probability of a study to produce a significant result.

Populations of Studies

Brunner and Schimmack (2020) introduce another distinction between power estimates that is important for the understanding of z-curve. One population of studies are all studies that have been conducted independent of the significance criterion. Let’s assume researchers’ computers were hooked up to the internet and whenever they conduct a statistical analysis, the results are stored in a giant database. The database will contain millions of p-values, some above .05 and others below .05. We could now examine the science-wide average power of null hypothesis significance tests. In fact, it would be very easy to do so. Remember, power is defined as the probability to obtain a significant result. We can therefore just compute the percentage of significant results to estimate average power. This is no different than averaging the results of 100,000 roulette games to see how often a table produces “red” or “black” as an outcome. If the table is biased and has more power to get “red” results, you could win a lot of money with that knowledge. In short,

The percentage of significant results in a set of studies provides an estimate of the average power of the set of studies that was conducted.

We would not need a tool like z-curve, if power estimation were that easy. The reason why we need z-curve is that we do not have access to all statistical tests that were conducted in science, psychology, or even a single lab. Although data sharing is becoming more common, we only see a fraction of results that are published in journal articles or preprints on the web. The published set of results is akin to the proverbial tip of the iceberg, and many results remain unreported and are not available for meta-analysis. This means, we only have a sample of studies.

Whenever statisticians draw conclusions about populations from samples, it is necessary to worry about sampling bias. In meta-analyses, this bias is known as publication bias, but a better term for it is selection bias. Scientific journals, especially in psychology, prefer to publish statistically significant results (exercise increases longevity) over non-significant results (exercise may or may not increase longevity). Concerns about selection bias are as old as meta-analyses, but actual meta-analyses have often ignored the risk of selection bias. Z-curve is one of the few tools that can be used to detect selection bias and quantify the amount of selection bias (the other tool is the selection model for effect size estimation).

To examine selection bias, we need a second approach to estimate average power, other than computing the average of significant results. The second approach is to use the exact p-values of a study (e.g., p = .17, .05, .005) and to convert them into z-values (e.g., z = 1, 2, 2.8). These z-values are a function of the true power of a study (e.g., a study with 50% power has an expected z-value of ~ 2), and sampling error. Z-curve uses this information to obtain a second estimate of the average power of a set of studies. If there is no selection bias, the two estimates should be similar, especially in reasonably large sets of studies. However, often the percentage of significant result (power estimate 1) is higher than the z-curve estimate (power estimate 2). This pattern of results suggests selection for significance.

In conclusion, there are two ways to estimate the average power of a set of studies. Without selection bias, the two estimates will be similar. With selection bias, the estimate based on counting significant results will be higher than the estimate based on the exact p-values.

Figure 1 illustrates the extreme scenario that the true power of studies was just 40%, but selection bias filtered out all non-significant results.

Figure 1. Histogram of z-scores from 1,984 significant tests (based on a simulation of 5,000 studies with 40% power). The solid red line represents the z-curve fit to the distribution of observed z-values. The dashed red line shows the expected distribution without selection bias. The vertical red line shows the significance criterion, p < .05 (two-sided, z ~ 2). ODR = Observed Discovery Rate, EDR = Expected Discovery Rate, ERR = Expected Replication Rate. FDR = False Positive Risk, not relevant for the Introduction.

The figure shows a z-curve plot. Understanding this plot is important for the use of z-curve. First, the plot is a histogram of absolute z-values. Absolute z-values are used because in field-wide meta-analyses the sign has no meaning. In one study, researchers predicted a negative result (aspirin decreases blood pressure) and in another study they predicted a positive result (exercises increases longevity). What matters is that the significant result was used to reject the null hypothesis in either direction. Z-values above 6 are not shown because they are very strong, imply nearly 100% power. The critical range of z-scores are z-scores between 2 (p = .05, just significant) and 4 (~ p = .0001).

The z-curve plot makes it easy to spot selection for significance because there are many studies with just significant results (z > 2) and no studies with just not-significant results that are often called marginally significant results because they are used in publications to reject the null hypothesis with a relaxed criterion. A plot like this cannot be produced by sampling error.

In a z-curve plot, the percentage of significant results is called the observed discovery rate. Discovery is a term used in statistic for a significant result. It does not mean a breaking-news discovery. It just means p < .05. The ODR is 100% because all results are significant. This would imply that all studies tested a true hypothesis with 100% power. However, we know that this is not the case. Z-curve uses the distribution of significant z-scores to estimate power, but there are two populations of power. One population is all studies, including the missing non-significant results. I will explain later how z-curve estimates power. Here it is only important that the estimate is 40%. This estimate is called the expected discovery rate. That is, if we could get access to all missing studies, we would see that only 40% of the studies were significant. Expected therefore means without selection bias and open access to all studies. The difference between the ODR and EDR quantifies the amount of selection bias. Here selection bias inflates the ODR from 40% to 100%.

It is now time to introduce another population of studies. This is the population of studies with significant results. We do not have to assume that all of these studies were published. We just assume that the published studies were not selected based on their p-values. This is a common assumption in selection models. We will see later how changing this assumption can change results.

It is well known that selection introduces bias in averages. Selection for significance, selects studies that had positive sampling error that produced z-scores greater than 2, while the expected z-score without sampling error is only 1.7, not significant on its own. Thus, a simple power calculation for the significant results would overestimate power. Z-curve corrects for this bias and produces an unbiased estimate of the average power of the population of studies with significant results. This estimate of power after selection for significance is called the expected replication rate (ERR). The reason is that average power of the significant results predicts the percentage of significant results if the studies with significant results were replicated exactly, including the same sample sizes. The outcome of this hypothetical replication project would be 40% significant results. The decrease from 100% to 40% is explained by the effect of selection and regression to the mean. A study that had an expected value of 1.7, but sampling error pushed it to 2.1 and produced a significant result is unlikely to have the same sampling error and produce a significant result again.

At the bottom of z-curve 3.0, you see estimates of local power. These are average power estimates for ranges of z-values. The default is to use steps of z = .05. You see that the strength of the observed z-values does not matter. Z-values between 0 and 0.5 are estimated to have 40% power as do z-values between 5.5 and 6. This happens when all studies have the same power. When studies differ in power, local power increases because studies with higher power are more likely to produce larger z-values.

When all studies have the same power, power is said to be homogenous. When studies have different levels of power, power is heterogeneous. Homogeneity or small heterogeneity in power imply that it is easy to infer the power of studies with non-significant results from studies with significant results. The reason is that power is more or less the same. Some selection models like p-curve assume homogeneity. For this reason, it is not necessary to distinguish populations of studies with or without significant results. It is assumed that the true power is the same for all studies, and if the true power is the same for all studies, it is also the same for all subsets of studies. This is different for z-curve. Z-curve allows for heterogeneity in power, and z-curve 3.0 provides a test of heterogeneity. If there is heterogeneity in power, the ERR will be higher than the EDR because studies with higher power are more likely to produce a significant result (Brunner & Schimmack, 2020).

To conclude, the introduction introduced basic statistical concepts that are needed to conduct z-curve analyses and to interpret the results correctly. The key constructs are

Meta-Analysis: the statistical analysis of results from multiple studies
Null Hypothesis Significance Testing
Statistical Significance: p < .05 (alpha)
(Statistical) Power: the probability of obtaining a significant result
Conditional Power: the probability of obtaining a significant result with a true effect
Populations of Studies: A set of studies with a common characteristic
Set of all studies: studies with non-significant and significant results
Selection Bias: An overrepresentation of significant results in a set of studies
(Sub)Set of studies with significant results: Subset of studies with p < .05
Observed Discovery Rate (ODR): the percentage of significant results in a set of studies
Expected Discovery Rate (EDR): the z-curve estimate of the discovery rate based on z-values
Expected Replication Rate (ERR): the z-curve estimate of average power for the subset of significant results.

Guest Post by Jerry Brunner: Response to an Anonymous Reviewer

September 18, 2024Guest Post, Jerry Brunner, Power, Z-Curve, ZcurveBrunner, Pek, Post-Hoc-Power, Power, Publication Bias, Wegener, Z-CurveUlrich Schimmack

Introduction

Jerry Brunner is a recent emeritus from the Department of Statistics at the University of Toronto Mississauga. Jerry first started in psychology, but was frustrated by the unscientific practices he observed in graduate school. He went on to become a professor in statistics. Thus, he is not only an expert in statistis. He also understands the methodological problems in psychology.

Sometime in the wake of the replication crisis around 2014/15, I went to his office to talk to him about power and bias detection. . Working with Jerry was educational and motivational. Without him z-curve would not exist. We spend years on trying different methods and thinking about the underlying statistical assumptions. Simulations often shattered our intuitions. The Brunner and Schimmack (2020) article summarizes all of this work.

A few years later, the method is being used to examine the credibility of published articles across different research areas. However, not everybody is happy about a tool that can reveal publication bias, the use of questionable research practices, and a high risk of false positive results. An anonymous reviewer dismissed z-curve results based on a long list of criticisms (Post: Dear Anonymous Reviewer). It was funny to see how ChatGPT responds to these criticisms (Comment). However, the quality of ChatGPT responses is difficult to evaluate. Therefore, I am pleased to share Jerry’s response to the reviewer’s comments here. Let’s just say that the reviewer was wise to make their comments anonymously. Posting the review and the response in public also shows why we need open reviews like the ones published in Meta-Psychology by the reviewers of our z-curve article. Hidden and biased reviews are just one more reason why progress in psychology is so slow.

Jerry Brunner’s Response

This is Jerry Brunner, the “Professor of Statistics” mentioned the post. I am also co-author of Brunner and Schimmack (2020). Since the review Uli posted is mostly an attack on our joint paper (Brunner and Schimmack, 2020), I thought I’d respond.

First of all, z-curve is sort of a moving target. The method described by Brunner and Schimmack is strictly a way of estimating population mean power based on a random sample of tests that have been selected for statistical significance. I’ll call it z-curve 1.0. The algorithm has evolved over time, and the current z-curve R package (available at https://cran.r-project.org/web/packages/zcurve/index.html) implements a variety of diagnostics based on a sample of p-values. The reviewer’s comments apply to z-curve 1.0, and so do my responses. This is good from my perspective, because I was in on the development of z-curve 1.0, and I believe I understand it pretty well. When I refer to z-curve in the material that follows, I mean z-curve 1.0. I do believe z-curve 1.0 has some limitations, but they do not overlap with the ones suggested by the reviewer.

Here are some quotes from the review, followed by my answers.

(1) “… z-curve analysis is based on the concept of using an average power estimate of completed studies (i.e., post hoc power analysis). However, statisticians and methodologists have written about the problem of post hoc power analysis …”

This is not accurate. Post-hoc power analysis is indeed fatally flawed; z-curve is something quite different. For later reference, in the “observed” power method, sample effect size is used to estimate population effect size for a single study. Estimated effect size is combined with observed sample size to produce an estimated non-centrality parameter for the non-central distribution of the test statistic, and estimated power is calculated from that, as an area under the curve of the non-central distribution. So, the observed power method produces an estimated power for an individual study. These estimates have been found to be too noisy for practical use.

The confusion of z-curve with observed power comes up frequently in the reviewer’s comments. To be clear, z-curve does not estimate effect sizes, nor does it produce power estimates for individual studies.

(2) “It should be noted that power is not a property of a (completed) study (fixed data). Power is a performance measure of a procedure (statistical test) applied to an infinite number of studies (random data) represented by a sampling distribution. Thus, what one estimates from completed study is not really “power” that has the properties of a frequentist probability even though the same formula is used. Average power does not solve this ontological problem (i.e., misunderstanding what frequentist probability is; see also McShane et al., 2020). Power should always be about a design for future studies, because power is the probability of the performance of a test (rejecting the null hypothesis) over repeated samples for some specified sample size, effect size, and Type I error rate (see also Greenland et al., 2016; O’Keefe, 2007). z-curve, however, makes use of this problematic concept of average power (for completed studies), which brings to question the validity of z-curve analysis results.”

The reviewer appears to believe that once the results of a study are in, the study no longer has a power. To clear up this misconception, I will describe the model on which z-curve is based.

There is a population of studies, each with its own subject population. One designated significance test will be carried out on the data for each study. Given the subject population, the procedure and design of the study (including sample size), significance level and the statistical test employed, there is a probability of rejecting the null hypothesis. This probability has the usual frequentist interpretation; it’s the long-term relative frequency of rejection based on (hypothetical) repeated sampling from the particular subject population. I will use the term “power” for the probability of rejecting the null hypothesis, whether or not the null hypothesis is exactly true.

Note that the power of the test — again, a member of a population of tests — is a function of the design and procedure of the study, and also of the true state of affairs in the subject population (say, as captured by effect size).

So, every study in the population of studies has a power. It’s the same before any data are collected, and after the data are collected. If the study were replicated exactly with a fresh sample from the same population, the probability of observing significant results would be exactly the power of the study — the true power.

This takes care of the reviewer’s objection, but let me continue describing our model, because the details will be useful later.

For each study in the population of studies, a random sample is drawn from the subject population, and the null hypothesis is tested. The results are either significant, or not. If the results are not significant, they are rejected for publication, or more likely never submitted. They go into the mythical “file drawer,” and are no longer available. The studies that do obtain significant results form a sub-population of the original population of studies. Naturally, each of these studies has a true power value. What z-curve is trying to estimate is the population mean power of the studies with significant results.

So, we draw a random sample from the population of studies with significant results, and use the reported results to estimate population mean power — not of the original population of studies, but only of the subset that obtained significant results. To us, this roughly corresponds to the mean power in a population of published results in a particular field or sub-field.

Note that there are two sources of randomness in the model just described. One arises from the random sampling of studies, and the other from random sampling of subjects within studies. In an appendix containing the theorems, Brunner and Schimmack liken designing a study (and choosing a test) to the manufacture of a biased coin with probability of heads equal to the power. All the coins are tossed, corresponding to running the subjects, collecting the data and carrying out the tests. Then the coins showing tails are discarded. We seek to estimate the mean P(Head) for all the remaining coins.

(3) “In Brunner and Schimmack (2020), there is a problem with ‘Theorem 1 states that success rate and mean power are equivalent …’ Here, the coin flip with a binary outcome is a process to describe significant vs. nonsignificant p-values. Focusing on observed power, the problem is that using estimated effect sizes (from completed studies) have sampling variability and cannot be assumed to be equivalent to the population effect size.”

There is no problem with Theorem 1. The theorem says that in the coin tossing experiment just described, suppose you (1) randomly select a coin from the population, and (2) toss it — so there are two stages of randomness. Then the probability of observing a head is exactly equal to the mean P(Heads) for the entire set of coins. This is pretty cool if you think about it. The theorem makes no use of the concept of effect size. In fact, it’s not directly about estimation at all; it’s actually a well-known result in pure probability, slightly specialized for this setting. The reviewer says “Focusing on observed power …” But why would he or she focus on observed power? We are talking about true power here.

(4) “Coming back to p-values, these statistics have their own distribution (that cannot be derived unless the effect size is null and the p-value follows a uniform distribution).

They said it couldn’t be done. Actually, deriving the distribution of the p-value under the alternative hypothesis is a reasonable homework problem for a masters student in statistics. I could give some hints …

(5) “Now, if the counter argument taken is that z-curve does not require an effect size input to calculate power, then I’m not sure what z-curve calculates because a value of power is defined by sample size, effect size, Type I error rate, and the sampling distribution of the statistical procedure (as consistently presented in textbooks for data analysis).”

Indeed, z-curve uses only p-values, from which useful estimates of effect size cannot be recovered. As previously stated, z-curve does not estimate power for individual studies. However, the reviewer is aware that p-values have a probability distribution. Intuitively, shouldn’t the distribution of p-values and the distribution of power values be connected in some way? For example, if all the null hypotheses in a population of tests were true so that all power values were equal to 0.05, then the distribution of p-values would be uniform on the interval from zero to one. When the null hypothesis of a test is false, the distribution of the p-value is right skewed and strictly decreasing (except in pathological artificial cases), with more of the probability piling up near zero. If average power were very high, one might expect a distribution with a lot of very small p-values. The point of this is just that the distribution of p-values surely contains some information about the distribution of power values. What z-curve does is to massage a sample of significant p-values to produce an estimate, not of the entire distribution of power after selection, but just of its population mean. It’s not an unreasonable enterprise, in spite of what the reviewer thinks. Also, it works well for large samples of studies. This is confirmed in the simulation studies reported by Brunner and Schimmack.

(6) “The problem of Theorem 2 in Brunner and Schimmack (2020) is assuming some distribution of power (for all tests, effect sizes, and sample sizes). This is curious because the calculation of power is based on the sampling distribution of a specific test statistic centered about the unknown population effect size and whose variance is determined by sample size. Power is then a function of sample size, effect size, and the sampling distribution of the test statistic.”

Okay, no problem. As described above, every study in the population of studies has its own test statistic, its own true (not estimated) effect size, its own sample size — and therefore its own true power. The relative frequency histogram of these numbers is the true population distribution of power.

(7) “There is no justification (or mathematical derivation) to show that power follows a uniform or beta distribution (e.g., see Figure 1 & 2 in Brunner and Schimmack, 2000, respectively).”

Right. These were examples, illustrating the distribution of power before versus after selection for significance — as given in Theorem 2. Theorem 2 applies to any distribution of true power values.

(8) “If the counter argument here is that we avoid these issues by transforming everything into a z-score, there is no justification that these z-scores will follow a z-distribution because the z-score is derived from a normal distribution – it is not the transformation of a p-value that will result in a z-distribution of z-scores … it’s weird to assume that p-values transformed to z-scores might have the standard error of 1 according to the z-distribution …”

The reviewer is objecting to Step 1 of constructing a z-curve estimate, given on page 6 of Brunner and Schimmack (2020). We start with a sample of significant p-values, arising from a variety of statistical tests, various F-tests, chi-squared tests, whatever — all with different sample sizes. Then we pretend that all the tests were actually two-sided z-tests with the results in the predicted direction, equivalent to one-sided z-tests with significance level 0.025. Then we transform the p-values to obtain the z statistics that would have generated them, had they actually been z-tests. Then we do some other stuff to the z statistics.

But as the reviewer notes, most of the tests probably are not z-tests. The distributions of their p-values, which depend on the non-central distributions of their test statistics, are different from one another, and also different from the distribution for genuine z-tests. Our paper describes it as an approximation, but why should it be a good approximation? I honestly don’t know, and I have given it a lot of thought. I certainly would not have come up with this idea myself, and when Uli proposed it, I did not think it would work. We both came up with a lot of estimation methods that did not work when we tested them out. But when we tested this one, it was successful. Call it a brilliant leap of intuition on Uli’s part. That’s how I think of it.

Uli’s comment.
It helps to know your history. Well before psychologists focused on effect sizes for meta-analysis, Fisher already had a method to meta-analyze p-values. P-Curve is just a meta-analysis of p-values with a selection model. However, p-values have ugly distributions and Stouffer proposed the transformation of p-values into z-scores to conduct meta-analyses. This method was used by Rosenthal to compute the fail-safe-N, one of the earliest methods to evaluate the credibility of published results (Fail-Safe-N). Ironically, even the p-curve app started using this transformation (p-curve changes). Thus, p-curve is really a version of z-curve. The problem with p-curve is that it has only one parameter and cannot model heterogeneity in true power. This is the key advantage of z-curve.1.0 over p-curve (Brunner & Schimmack, 2020). P-curve is even biased when all studies have the same population effect size, but different sample sizes, which leads to heterogeneity in power (Brunner, 2018].

Such things are fairly common in statistics. An idea is proposed, and it seems to work. There’s a “proof,” or at least an argument for the method, but the proof does not hold up. Later on, somebody figures out how to fill in the missing technical details. A good example is Cox’s proportional hazards regression model in survival analysis. It worked great in a large number of simulation studies, and was widely used in practice. Cox’s mathematical justification was weak. The justification starts out being intuitively reasonable but not quite rigorous, and then deteriorates. I have taught this material, and it’s not a pleasant experience. People used the method anyway. Then decades after it was proposed by Cox, somebody else (Aalen and others) proved everything using a very different and advanced set of mathematical tools. The clean justification was too advanced for my students.

Another example (from mathematics) is Fermat’s last theorem, which took over 300 years to prove. I’m not saying that z-curve is in the same league as Fermat’s last theorem, just that statistical methods can be successful and essentially correct before anyone has been able to provide a rigorous justification.

Still, this is one place where the reviewer is not completely mixed up.

Another Uli comment
Undergraduate students are often taught different test statistics and distributions as if they are totally different. However, most tests in psychology are practically z-tests. Just look at a t-distribution with N = 40 (df = 38) and try to see the difference to a standard normal distribution. The difference is tiny and invisible when you increase sample sizes above 40! And F-tests. F-values with 1 experimenter degree of freedom are just squared t-values, so the square root of these is practically a z-test. But what about chi-square? Well, with 1 df, chi-square is just a squared z-score, so we can use the square root and have a z-score. But what if we don’t have two groups, but compute correlations or regressions? Well, the statistical significance test uses the t-distribution and sample sizes are often well above 40. So, t and z are practically identical. It is therefore not surprising to me that approximating empirical results with different test-statistics can be approximated with the standard normal distribution. We could make teaching statistics so much easier, instead of confusing students with F-distributions. The only exception are complex designs with 3 x 4 x 5 ANOVAs, but they don’t really test anything and are just used to p-hack. Rant over. Back to Jerry.

(9) “It is unclear how Theorem 2 is related to the z-curve procedure.”

Theorem 2 is about how selection for significance affects the probability distribution of true power values. Z-curve estimates are based only on studies that have achieved significant results; the others are hidden, by a process that can be called publication bias. There is a fundamental distinction between the original population of power values and the sub-population belonging to studies that produce significant results. The theorems in the appendix are intended to clarify that distinction. The reviewer believes that once significance has been observed, the studies in question no longer even have true power values. So, clarification would seem to be necessary.

(10) “In the description of the z-curve analysis, it is unclear why z-curve is needed to calculate “average power.” If p < .05 is the criterion of significance, then according to Theorem 1, why not count up all the reported p-values and calculate the proportion in which the p-values are significant?”

If there were no selection for significance, this is what a reasonable person would do. But the point of the paper, and what makes the estimation problem challenging, is that all we can observe are statistics from studies with p < 0.05. Publication bias is real, and z-curve is designed to allow for it.

(11) “To beat a dead horse, z-curve makes use of the concept of “power” for completed studies. To claim that power is a property of completed studies is an ontological error …”

Wrong. Power is a feature of the design of a study, the significance test, and the subject population. All of these features still exist after data have been collected and the test is carried out.

Uli and Jerry comment:
Whenever a psychologist uses the word “ontological,” be very skeptical. Most psychologists who use the word understand philosophy as well as this reviewer understands statistics.

(12) “The authors make a statement that (observed) power is the probability of exact replication. However, there is a conceptual error embedded in this statement. While Greenwald et al. (1996, p. 1976) state “replicability can be computed as the power of an exact replication study, which can be approximated by [observed power],” they also explicitly emphasized that such a statement requires the assumption that the estimated effect size is the same as the unknown population effect size which they admit cannot be met in practice.”

Observed power (a bad estimate of true power) is not the probability of significance upon exact replication. True power is the probability of significance upon exact replication. It’s based on true effect size, not estimated effect size. We were talking about true power, and we mistakenly thought that was obvious.

(13) “The basis of supporting the z-curve procedure is a simulation study. This approach merely confirms what is assumed with simulation and does not allow for the procedure to be refuted in any way (cf. Popper’s idea of refutation being the basis of science.) In a simulation study, one assumes that the underlying process of generating p-values is correct (i.e., consistent with the z-curve procedure). However, one cannot evaluate whether the p-value generating process assumed in the simulation study matches that of empirical data. Stated a different way, models about phenomena are fallible and so we find evidence to refute and corroborate these models. The simulation in support of the z-curve does not put the z-curve to the test but uses a model consistent with the z-curve (absent of empirical data) to confirm the z-curve procedure (a tautological argument). This is akin to saying that model A gives us the best results, and based on simulated data on model A, we get the best results.”

This criticism would have been somewhat justified if the simulations had used p-values from a bunch of z-tests. However, they did not. The simulations reported in the paper are all F-tests with one numerator degree of freedom, and denominator degrees of freedom depending on the sample size. This covers all the tests of individual regression coefficients in multiple regression, as well as comparisons of two means using two-sample (and even matched) t-tests. Brunner and Schmmack say (p. 8)

Because the pattern of results was similar for F-tests
and chi-squared tests and for different degrees of freedom,
we only report details for F-tests with one numerator
degree of freedom; preliminary data mining of
the psychological literature suggests that this is the case
most frequently encountered in practice. Full results are
given in the supplementary materials.

So I was going to refer the reader (and the anonymous reviewer, who is probably not reading this post anyway) to the supplementary materials. Fortunately I checked first, and found that the supplementary materials include a bunch of OSF stuff like the letter submitting the article for publication, and the reviewers’ comments and so on — but not the full set of simulations. Oops.

All the code and the full set of simulation results is posted at

https://www.utstat.utoronto.ca/brunner/zcurve2018

You can download all the material in a single file at

https://www.utstat.utoronto.ca/brunner/zcurve2018.zip

After expanding, just open index.html in a browser.

Actually we did a lot more simulation studies than this, but you have to draw the line somewhere. The point is that z-curve performs well for large numbers of studies with chi-squared test statistics as well as F statistics — all with varying degrees of freedom.

(14) “The simulation study was conducted for the performance of the z-curve on constrained scenarios including F-tests with df = 1 and not for the combination of t-tests and chi-square tests as applied in the current study. I’m not sure what to make of the z-curve performance for the data used in the current paper because the simulation study does not provide evidence of its performance under these unexplored conditions.”

Now the reviewer is talking about the paper that was actually under review. The mistake is natural, because of our (my) error in not making sure that the full set of simulations was included in the supplementary materials. The conditions in question are not unexplored; they are thoroughly explored, and the accuracy of z-curve for large samples is confirmed.

(15+) There are some more comments by the reviewer, but these are strictly about the paper under review, and not about Brunner and Schimmack (2020). So, I will leave any further response to others.

Z-Curve: An even better p-curve

April 25, 2021P-Curve, P-Hacking., Z-CurveUlrich Schimmack

So far Simmons, Nelson, and Simonsohn have not commented on this blog post. I now submitted it as a commentary to JEP-General. Let’s see whether it will be send out for review and whether they will comment as (anonymous) reviewers.

Abstract

P-Curve was a first attempt to take the problem of selection for significance seriously and to evaluate whether a set of studies provides credible evidence against the null-hypothesis after taking selection bias into account. Here I showed that p-curve has serious limitations and provides misleading evidence about the strength of evidence against the null-hypothesis. I showed that all of the information that is provided by a p-curve analysis (Simonsohn, Nelson, & Simmons, 2014) is also provided by a z-curve analysis (Bartos & Schimmack, 2021). Moreover, z-curve provides additional information about the presence and the amount of selection bias. As z-curve is superior than p-curve, the rational choice is to use z-curve to examine the credibility of significant results.

Keywords: Publication Bias, Selection Bias, Z-Curve, P-Curve, Expected Replication Rate, Expected Discovery Rate, File-Drawer, Power

Introduction

In 2011, it dawned on psychologists that something was wrong with their science. Daryl Bem had just published an article with nine studies that showed an incredible finding (Bem, 2011). Participants’ responses were influenced by random events that had not yet occurred. Since then, the flaws in research practices have become clear and it has been shown that they are not limited to mental time travel (Schimmack, 2020). For decades, psychologists assumed that statistically significant results reveal true effects and reported only statistically significant results (Motyl et al., 2017; Sterling, 1959; Sterling et al., 1995). However, selective reporting of significant results undermines the purpose of significance testing to distinguish true and false hypotheses. If only significant results are reported, most published results could be false positive results (Simmons, Nelson, & Simonsohn, 2011).

Selective reporting of significant results also undermines the credibility of meta-analyses (Rosenthal, 1979), which explains why meta-analyses also suggest humans posses psychic abilities (Bem & Honorton, 1994). Thus, selection bias not only invalidates the results of original studies, it also threatens the validity of conclusions based on meta-analyses that do not take selection bias into account.

Concerns about a replication crisis in psychology led to an increased focus on replication studies. An ambitious project found that only 37% of studies in (cognitive & social) experimental psychology could be replicated (Open Science Collaboration, 2015). This dismal result created a crisis of confidence in published results. To alleviate these concerns, psychologists developed new methods to detect publication bias. These new methods showed that Bem’s paranormal results were obtained with the help of questionable research practices (Francis, 2012; Schimmack, 2012), which explained why replication attempts were unsuccessful (Galak et al., 2012). Furthermore, Francis showed that many published articles in the prestigious journal Psychological Science show signs of publication bias (Francis, 2014). However, the presence of publication bias does not imply that the published results are false (positives). Publication bias may merely inflate effect sizes without invalidating the main theoretical claims. To address the latter question it is necessary to conduct meta-analyses that take publication bias into account. In this article, I compare two methods that were developed for this purpose; p-curve (Simonsohn et al., 2014), and z-curve (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020). P-curve was introduced in 2014 and has already been used in many articles. Z-curve was developed in 2015, but was only published recently in a peer-reviewed journal. Experimental psychologists who are familiar with speed-accuracy tradeoffs may not be surprised to learn that z-curve is a superior method. As Brunner and Schimmack (2020) demonstrated with simulation studies, p-curve often produces inflated estimates of the evidential value of original studies. This bias was not detected by the developers of p-curve because they did not evaluate their method with simulation studies. Moreover, their latest version of p-curve was never peer-reviewed. In this article, I first provide a critical review of p-curve and then show how z-curve addresses all of them.

P-Curve

P-curve is a name for a family of statistical tests that have been combined into the p-curve app that researchers can use to conduct p-curve analyses, henceforth called p-curve . The latest version of p-curve is version 4.06 that was last updated on November 30, 2017 (p-curve.com).

The first part of a p-curve analysis is a p-curve plot. A p-curve plot is a histogram of all significant p-values where p-values are placed into five bins, namely p-values ranging from 0 to .01, .01 to .02, .02 to .03, .03 to .04, and .04 to .05. If the set of studies contains mostly studies with true effects that have been tested with moderate to high power, there are more p-values between 0 and .01 than between .04 and .05. This pattern has been called a right-skewed distribution by the p-curve authors. If the distribution is flat or reversed (more p-values between .04 and .05 than between 0 and .01), the data lack evidential value; that is, the results are more consistent with the null-hypothesis than with the presence of a real effect.

The main limitation of p-curve plots is that it is difficult to evaluate ambiguous cases. To aid in the interpretation of p-curve plots, p-curve also provides statistical tests of evidential value. One test is a significance tests against the null-hypothesis that all significant p-values are false positive results. If this null-hypothesis can be rejected with the traditional alpha criterion of .05, it is possible to conclude that at least some of the significant results are not false positives. The main problem with this significance test is that it does not provide information about effect sizes. A right-skewed p-curve with a significant p-values may be due to weak evidence with many false positive results or strong evidence with few false positives.

To address this concern, the p-curve app also provides an estimate of statistical power. When studies are heterogeneous (i.e., different sample sizes or effect sizes or both) this estimate is an estimate of mean unconditional power (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020). Unconditional power refers to the fact that a significant result may be a false positive result. Unconditional power does not condition on the presence of an effect (i.e., the null-hypothesis is false). When the null-hypothesis is true, a result has a probability of alpha (typically 5%) to be significant. Thus, a p-curve analysis that includes some false positive results, includes some studies with a probability equal to alpha and others with probabilities greater than alpha.

To illustrate the p-curve app, I conducted a meta-analysis of all published articles by Leif D. Nelson, one of the co-authors of p-curve I found 119 studies with codable data and coded the most focal hypothesis for each of these studies. I then submitted the data to the online p-curve app. Figure 1 shows the output.

Visual inspection of the p-curve plot shows a right-skewed distribution with 57% of the p-values between 0 and .01 and only 6% of p-values between .04 and .05. The statistical tests against the null-hypothesis that all of the significant p-values are false positives is highly significant. Thus, at least some of the p-values are likely to be true positives. Finally, the power estimate is very high, 97%, with a tight confidence interval ranging from 96% to 98%. Somewhat redundant with this information, the p-curve app also provides a significance test for the hypothesis that power is less than 33%. This test is not significant, which is not surprising given the estimated power of 97%.

The p-curve results are surprising. After all, Nelson openly stated that he used questionable research practices before he became aware of the high false positive risk associated with these practices. “We knew many researchers—including ourselves—who readily admitted to dropping dependent variables, conditions, or participants to achieve significance.” (Simmons, Nelson, & Simonsohn, 2018, p. 255). The impressive estimate of 97% power is in stark contrast to the claim that questionable research practices were used to produce Nelson’s results. A z-curve analysis of the data shows that the p-curve results provide false information about the robustness of Nelson’s published results.

Z-Curve

Like p-curve, z-curve analyses are supplemented by a plot of the data. The main difference is that p-values are converted into z-scores using the formula for the inverse normal distribution; z = qnorm(1-p/2). The second difference is that significant and non-significant p-values are plotted. The third difference is that z-curve plots have a much finer resolution than p-curve plots. Whereas p-curve bins all z-scores from 2.58 to infinity into one bin (p < .01), z-curve uses the information about the distribution of z-scores all the way up to z = 6 (p = .000000002; 1/500,000,000). Z-statistics greater than 6 are assigned a power of 1.

Visual inspection of the z-curve plot reveals something that the p-curve plot does not show, namely there is clear evidence for the presence of selection bias. Whereas p-curve suggests that “highly” significant results (0 to .01) are much more common than “just” significant results (.04 to .05), z-curve shows that just significant results (.05 to .005) are much more frequent than highly significant (p < .005) results. The difference is due to the implicit definition of high and low in the two plots. The high frequency of highly significant (p < .01) results in the p-curve plots is due to the wide range of values that are lumped together into this bin. Once it is clear that many p-values are clustered just below .05 (z > 1.96, the vertical red line), it is immediately notable that there are too few just non-significant (z < 1.96) values. This steep drop in frequencies for just significant to just not significant values is inconsistent with random sampling error. Thus, publication bias is readily visible by visual inspection of a z-curve plot. In contrast, p-curve plots provide no information about publication bias because non-significant results are not shown. Even worse, right skewed distributions are often falsely interpreted as evidence that there is no publication bias or use of questionable research practices (e.g., Rusz, Le Pelley, Kompier, Mait, & Bijleveld, 2020). This misinterpretation of p-curve plots can be easily avoided by inspection of z-curve plots.

The second part of a z-curve analysis uses a finite mixture model to estimate two statistical parameters of the data. These parameters are called the expected discovery rate and the expected replication rate (Bartos & Schimmack, 2021). Another term for these parameters is mean power before selection and mean power after selection for significance (Brunner & Schimmack, 2020). The meaning of these terms is best understood with a simple example where a researcher tests 100 false hypotheses and 100 true hypotheses with 100% power. The outcome of this study produces significant and non-significant p-values. The expected value for the frequency of significant p-values is 100 for the 100 true hypotheses tested with 100% power and 5% for the 100 false hypotheses that produce 5 significant results when alpha is set to 5%. Thus, we are expecting 105 significant results and 95 non-significant results. In this example, the discovery rate is 105/200 = 52.5%. With real data, the discovery rate is often not known because not all statistical tests are published. When selection for significance is present, the observed discovery rate is an inflated estimate of the actual discovery rate. For example, if 50 of the 95 non-significant results are missing, the observed discovery rate is 105/150 = 70%. Z-curve.2.0 uses the distribution of the significant z-scores to estimate the discovery rate by taking selection bias into account. That is, it uses the truncated distribution for z-scores greater than 1.96 to estimate the shape of the full distribution (i.e., the grey curve in Figure 2). This produces an estimate of the mean power before selection for significance. As significance is determined by power and sampling error, the estimate of mean power provides an estimate of the expected discovery rate. Figure 2 shows an observed discovery rate of 87%. This is in line with estimates of discovery rates around 90% in psychology journals (Motyl et al., 2017; Sterling, 1959; Sterling et al., 1995). However, the z-curve estimate of the expected discovery rate is only 27%. The bootstrapped, robust confidence interval around this estimate ranges from 5% to 51%. As this interval does not include the value for the observed discovery rate, the results provide statistically significant evidence that questionable research practices were used to produce 89% significant results. Moreover, the difference between the observed and expected discovery rate is large. This finding is consistent with Nelson’s admission that many questionable research practices were used to achieve significant results (Simmons et al., 2018). In contrast, p-curve provides no information about the presence or amount of selection bias.

The power estimate provided by the p-curve app is the mean power of studies with a significant result. Mean power for these studies is equal or greater to the mean power of all studies because studies with higher power are more likely to produce a significant result (Brunner & Schimmack, 2020). Bartos and Schimmack (2021) refer to mean power after selection for significance as the expected replication rate. To explain this term, it is instructive to see how selection for significance influences mean power in the example with 100 test of true null-hypotheses and 100 tests of true alternative hypotheses with 100% power. We expect only 5 false positive results and 100 true positive results. The average power of these 105 studies is (5 * .05 + 100 * 1)/105 = 95.5%. This is much higher than the mean power before selection for significance which was based on 100 rather than just 5 tests of a true null-hypothesis. For Nelson’s data, p-curve produced an estimate of 97% power. Thus, p-curve predicts that 96% of replication attempts of Nelson’s published results would produce a significant result again. The z-curve estimate in Figure 2 shows that this is a dramatically inflated estimate of the expected replication rate. The z-curve estimate is only 52% with a robust 95% confidence interval ranging from 40% to 68%. Simulation studies show that z-curve estimates are close to the simulated values, whereas p-curve estimates are inflated when the studies are heterogeneous (Brunner & Schimmack, 2020). The p-curve authors have been aware of this bias in p-curve estimates since January 2018 (Simmons, Nelson, & Simonsohn, 2018), but they have not changed their app or warned users about this problem. The present example clearly shows that p-curve estimates can be highly misleading and that it is unscientific to use or interpret p-curve estimates of the expected replication rate.

Published Example

Since p-curve was introduced, it has been cited in over 500 articles and it has been used in many meta-analyses. While some meta-analyses correctly interpreted p-curve results to demonstrate merely that a set of studies have some evidential value (i.e., the nil-hypothesis that all significant results are false positives), others went further and drew false conclusions from a p-curve analysis. Moreover, meta-analyses that used p-curve missed the opportunity to quantify the amount of selection bias in a literature. To illustrate how meta-analysts can benefit from a z-curve analysis, I reexamined a meta-analysis of the effects of reward stimuli on attention (Rusz, et al., 2020).

Using their open data (https://osf.io/rgeb6/), I first reproduced their p-curve analysis using the p-curve app (http://www.p-curve.com/app4/). Figure 3 show that 42% of the p-values are between .01 and 0, whereas only 7% of the p-values are between .04 and .05. The figure also shows that the observed p-curve is similar to the p-curve that is predicted by a homogeneous set of studies with 33% power. Nevertheless, power is estimated to be 52%. Rusz et al. (2020) interpret these results as evidence that “this set of studies contains evidential value for reward-driven distraction” and that “It provides no evidence for p-hacking” (p. 886).

Figure 4 shows the z-curve for the same data. Visual inspection of the z-curve plot shows that there are many more just-significant than just-not-significant results. This impression is confirmed by a comparison of the observed discovery rate (74%) versus the expected discovery rate (27%). The bootstrapped, robust 95% confidence interval, 8% to 58%, does not include the observed discovery rate. Thus, there is statistically significant evidence that questionable research practices inflated the percentage of significant results. The expected replication rate is also lower (37%) than the p-curve estimate (52%). With an average power of 37%, it is clear that published studies are underpowered. Based on these results, it is clear that effect-size meta-analysis that do not take selection bias into account produce inflated effect size estimates. Moreover, when the ERR is higher than the EDR, studies are heterogenous, which means that some studies have even less power than the average power of 37%, and some of these may be false positive results. It is therefore unclear which reward stimuli and which attention paradigms show a theoretically significant effect and which do not. However, meta-analysts often falsely generalize an average effect to individual studies. For example, Rusz et al. (2020) concluded from their significant average effect size (d ~ .3) that high-reward stimuli impair cognitive performance “across different paradigms and across different reward cues” (p. 887). This conclusion is incorrect because they mean effect size is inflated and could be based on subsets of reward stimuli and paradigms. To demonstrate that a specific reward stimulus influences performance on a specific task would require high powered replication studies for the various combinations of rewards and paradigms. At present, the meta-analysis merely shows that some rewards can interfere with some tasks.

Conclusion

Simonsohn et al. (2014) introduced p-curve as a statistical tool to correct for publication bias and questionable research practices in meta-analyses. In this article, I critically reviewed p-curve and showed several limitations and biases in p-curve results. The first p-curve methods focussed on statistical significance and did not quantify the strength of evidence against the null-hypothesis that all significant results are false positives. This problem was solved by introducing a method that quantified strength of evidence as the mean unconditional power of studies with significant results. However, the estimation method was never validated with simulation studies. Independent simulation studies showed that p-curve systematically overestimates power when effect sizes or sample sizes are heterogeneous. In the present article, this bias inflated mean power for Nelson’s published results from 52% to 97%. This is not a small or negligible deviation. Rather, it shows that p-curve results can be extremely misleading. In an application to a published meta-analysis, the bias was less extreme, but still substantial, 37% vs. 52%, a 15 percentage points difference. As the amount of bias is unknown unless p-curve results are compared to z-curve results, researchers can simply use z-curve to obtain an estimate of mean power after selection for significance or the expected replication rate.

Z-curve not only provides a better estimate of the expected replication rate. It also provides an estimate of the expected discovery rate; that is the percentage of results that are significant if all studies were available (i.e., after researchers empty their file drawer). This estimate can be compared to the observed discovery rate to examine whether selection bias is present and how large it is. In contrast, p-curve provides no information about the presence of selection bias and the use of questionable research practices.

In sum, z-curve does everything that p-curve does better and it provides additional information. As z-curve is better than p-curve on all features, the rational choice is to use z-curve in future meta-analyses and to reexamine published p-curve analyses with z-curve. To do so, researchers can use the free R-package zcurve (Bartos & Schimmack, 2020).

References

Bartoš, F., & Schimmack, U. (2020). “zcurve: An R Package for Fitting Z-curves.” R package version 1.0.0

Bartoš, F., & Schimmack, U. (2021). Z-curve.2.0: Estimating the replication and discovery rates. Meta-Psychology, in press.

Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100, 407–425. http://dx.doi.org/10.1037/a0021524

Bem, D. J., & Honorton, C. (1994). Does psi exist? Replicable evidence for an anomalous process of information transfer. Psychological Bulletin, 115(1), 4–18. https://doi.org/10.1037/0033-2909.115.1.4

Brunner, J. & Schimmack, U. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta-Psychology, 4, https://doi.org/10.15626/MP.2018.874

Francis, G. (2012). Too good to be true: Publication bias in two prominent studies from experimental psychology. Psychonomic Bulletin & Review, 19, 151–156. http://dx.doi.org/10.3758/s13423-012-0227-9

Francis G., (2014). The frequency of excess success for articles in Psychological Science. Psychonomic Bulletin and Review, 21, 1180–1187. https://doi.org/10.3758/s13423-014-0601-x

Galak, J., LeBoeuf, R. A., Nelson, L. D., & Simmons, J. P. (2012). Correcting the past: Failures to replicate. Journal of Personality and Social Psychology, 103, 933–948. http://dx.doi.org/10.1037/a0029709

Motyl, M., Demos, A. P., Carsel, T. S., Hanson, B. E., Melton, Z. J., Mueller, A. B., Prims, J. P., Sun, J., Washburn, A. N., Wong, K. M., Yantis, C., & Skitka, L. J. (2017). The state of social and personality science: Rotten to the core, not so bad, getting better, or getting worse? Journal of Personality and Social Psychology, 113(1), 34–58. https://doi.org/10.1037/pspa0000084

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716–aac4716. https://doi.org/10.1126/science.aac4716

Rosenthal, R. (1979). The file drawer problem and tolerance for null results. Psychological Bulletin, 86(3), 638–641. https://doi.org/10.1037/0033-2909.86.3.638

Rusz, D., Le Pelley, M. E., Kompier, M. A. J., Mait, L., & Bijleveld, E. (2020). Reward-driven distraction: A meta-analysis. Psychological Bulletin, 146(10), 872–899. https://doi.org/10.1037/bul0000296

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17, 551–566. https://doi.org/10.1037/a0029487

Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne. 61 (4), 364-376. https://doi.org/10.1037/cap0000246

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366. http://dx.doi.org/10.1177/0956797611417632

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2018). False-positive citations. Perspectives on Psychological Science, 13(2), 255–259. https://doi.org/10.1177/1745691617698146

Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve: A key to the file-drawer. Journal of Experimental Psychology: General, 143(2), 534–547. https://doi.org/10.1037/a0033242

Sterling, T. D. (1959). Publication decision and the possible effects on inferences drawn from tests of significance – or vice versa. Journal of the American Statistical Association, 54, 30–34. https://doi.org/10.2307/2282137

Sterling, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication decisions revisited: The effect of the outcome of statistical tests on the decision to publish and vice versa. The American Statistician, 49, 108–112. https://doi.org/10.2307/2684823

Replicability Audit of Ap Dijksterhuis

April 23, 2021Audit, Dijksterhuis, Z-CurveUlrich Schimmack

Abstract

This blog post reports a replicability audit of Ap Dijksterhuis 48 most highly cited articles that provide the basis for his H-Index of 48 (WebofScience, 4/23/2021). The z-curve analysis shows lack of evidential value and a high false positive risk. Rather than dismissing all findings, it is possible to salvage 10 findings by setting alpha to .001 to maintain a false positive risk below 5%. The main article that contains evidential value was published in 2016. Based on these results, I argue that 47 of the 48 articles do not contain credible empirical information that supports the claims in these articles. These articles should not be cited as if they contain empirical evidence.

INTRODUCTION

“Trust is good, but control is better”

Since 2011, it has become clear that social psychologists misused the scientific method. It was falsely assumed that a statistically significant result ensures that a finding is not a statistical fluke. This assumption is false for two reasons. First, even if the scientific method is used correctly, statistically significance can occur without a real effect in 5% of all studies. This is a low risk if most studies test true hypothesis with high statistical power, which produces a high discovery rate. However, if many false hypotheses are tested and true hypotheses are tested with low power, the discovery rate is low and the false discovery risk is high. Unfortunately, the true discovery rate is not known because social psychologists only published significant results. This selective reporting of significant results renders statistically significance insignificant. In theory, all published results could be false positive results.

The question is what we, the consumers of social psychological research, should do with thousands of studies that provide only questionable evidence. One solution is to “burn everything to the ground” and start fresh. Another solution is to correct the mistake in the application of the scientific method. I compare this correction to the repair of the Hubble telescope (https://www.nasa.gov/content/hubbles-mirror-flaw). Only after the Hubble telescope was launched into space, it was discovered that a mistake was made in the creation of the mirror. Replacing the mirror in space was impractical. As a result, a correction was made to take the discrepancy in the data into account.

The same can be done with significance testing. To correct for the misuse of the scientific method, the criterion for statistical significance can be lowered to ensure an acceptably low risk of false positive results. One solution is to apply this correction to articles on a specific topic or to articles in a particular journal. Here, I focus on authors for two reasons. First, authors are likely to use a specific approach to research that depends on their training and the field of study. Elsewhere I demonstrated that researchers differ considerably in their research practices (Schimmack, 2021). More controversial, I also think that authors are accountable for their research practices. If they realize that they made mistakes, they could help the research community by admitting to their mistakes and retract articles or at least express their loss of confidence in some of their work (Rohrer et al., 2020).

Ap Dijksterhuis

Ap Dijksterhuis is a well-known social psychologist. His main focus has been on unconscious processes. Starting in the 1990s, social psychologists became fascinated by unconscious and implicit processes. This triggered what some call an implicit revolution (Greenwald & Banaji, 1995). Dijksterhuis has been prolific and his work is highly cited, which earned him an H-Index of 48 in WebOfScience.

However, after 2011 it became apparent that many findings in this literature are difficult to replicate (Kahneman, 2012). A large replication project also failed to replicate one of Dijksterhuis’s results (O’Donnell et al., 2018). It is therefore interesting and important to examine the credibility of Dijksterhuis’s studies.

Data

I used WebofScience to identify the most cited articles by Dijksterhuis (datafile). I then coded empirical articles until the number of coded articles matched the number of citations. The 48 articles reported 105 studies with a codable focal hypothesis test.

The total number of participants was 7,470 with a median sample size of N = 57 participants. For each focal test, I first computed the exact two-sided p-value and then computed a z-score for the p-value divided by two. Consistent with practices in social psychology, all reported studies supported predictions, even when the results were not strictly significant. The success for p < .05 (two-tailed) was 100/105 = 95%, which has been typical for social psychology for decades (Sterling, 1959).

The z-scores were submitted to a z-curve analysis (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020). The first part of a z-curve analysis is the z-curve plot (Figure 1).

The vertical red line at z = 1.96 represents the significance criterion with alpha = .05 (two-tailed). The figure shows that most p-values are just significant with z-scores just above 1.96. The distribution of z-scores is abnormal in the sense that random sampling error alone cannot produce the steep drop on the left side of the significance criterion. This provides visual evidence of selection for significance.

The second part of a z-curve analysis is to fit a finite mixture model to the distribution of the significant z-scores (z > 1.96). The model tries to match the distribution as closely as possible. The best fitting curve is shown with the grey/black checkered line. It is notable that the actual data decrease a bit more steeply than the grey curve. This shows a problem for the curve to fit the data even though the curve. This suggests that significance was obtained with massive p-hacking which produces an abundance of just significant results. This is confirmed with a p-curve analysis that shows more p-values between .04 and .05 than p-values between 0 and .01; 24% vs. 19%, respectively (Simonsohn et al., 2014).

The main implication of a left-skewed p-curve is that most significant results do not provide evidence against the null-hypothesis. This is confirmed by the z-curve analysis. A z-curve analysis projects the model based on significant results into the range of non-significant results. This makes it possible to estimate how many tests were conducted to produce the observed significant results (assuming a simple selection model). The results for these data suggest that that the reported significant results are only 5% of all statistical tests, which is what would be expected if only false hypotheses were tested. As a result, the false positive risk is 100%. Z-curve also computes bootstrapped confidence intervals around these estimates. The upper bound for the estimated discovery rate is 12%. Thus, most of the studies had a very low chance of producing a significant result (low power), even if they did not test a false hypothesis (low statistical power). With a low discover rate of 12%, the risk that a significant result is a false positive result is still 39%. This is unacceptably high.

The estimated replication rate of 7% is slightly higher than the estimated discovery rate of 5%. This suggests some heterogeneity across the studies which leads to higher power for studies that produced significant results. However, even 7% replicability is very low. Thus, most studies are expected to produce a non-significant result in a replication attempt.

Based on these results, it would be reasonable to burn everything to the ground and to dismiss the claims made in these 48 articles as empirically unfounded. However, it is also possible to reduce the false positive risk by increasing the significance threshold. With alpha = .01 the FDR is 19%, with alpha = .005 it is 10%, and with alpha = .001 it is 2%. So, to keep the false positive risk below 5%, it is possible to set alpha to .001. This renders most findings non-significant, but 10 findings remain significant.

One finding is evidence that liking of one’s initials has retest reliability. A more interesting finding is that 4 significant (p < .001) results were obtained in the most recent, 2016) article that also included pre-registered studies. This suggests that Dijksterhuis changed research practices in the wake of the replicability crisis. Thus, new articles that have not garnered a lot of citations may be more credible, but the pre-2011 articles lack credible empirical evidence for most of the claims made in these articles.

DISCLAIMER

It is nearly certain that I made some mistakes in the coding of Ap Dijksterhuis’s articles. However, it is important to distinguish consequential and inconsequential mistakes. I am confident that I did not make consequential errors that would alter the main conclusions of this audit. However, control is better than trust and everybody can audit this audit. The data are openly available and the z-curve code is also openly available. Thus, this replicability audit is fully transparent and open to revision.

Moreover, the results are broadly consistent with the z-curve results based on automated extraction of test statistics (Schimmack, 2021). Based on automated coding, Dijksterhuis has an EDR of 17, with a rank of 312 out of 357 social psychologists. The reason for the higher EDR is that automated coding does not distinguish focal and non-focal tests and focal tests tend to have lower power and a higher risk of being false positives.

If you found this audit interesting, you might also be interested in other replicability audits (Replicability Audits).

Smart P-Hackers Have File-Drawers and Are Not Detected by Left-Skewed P-Curves

April 20, 2021P-Curve, Z-CurveUlrich Schimmack

Abstract

In the early 2010s, two articles suggested that (a) p-hacking is common, (b) false positives are prevalent, and (c) left-skewed p-curves reveal p-hacking to produce false positive results (Simmons et al., 2011; Simonsohn, 2014a). However, empirical application of p-curve have produced few left-skewed p-curves. This raises question about the absence of left-skewed z-curves. One explanation is that some p-hacking strategies do not produce notable left skew and that these strategies may be used more often because they require fewer resources. Another explanation could be that file-drawering is much more common than p-packing. Finally, it could be that most of the time p-hacking is used to inflate true effect sizes rather than to chase false positive results. P-curve plots do not allow researchers to distinguish these alternative hypotheses. Thus, p-curve should be replaced by more powerful tools that detect publication bias or p-hacking and estimate the amount of evidence against the null-hypothesis. Fortunately, there is an app for this (zcurve package).

Introduction

Simonsohn, Nelson, and Simmons (2014) coined the term p-hacking for a set of questionable research practices that increase the chances of obtaining a statistically significant result. In the worst case scenario, p-hacking can produce significant results without a real effect. In this case, the statistically significant result is entirely explained by p-hacking.

Simonsohn et al. (2014) make a clear distinction between p-hacking and publication bias. Publication bias is unlikely to produce a large number of false positive results because it requires 20 attempts to produce a single significant result in either direction or 40 attempts to get a significant result with a predicted direction. In contrast, “p-hacking can allow researchers to get most studies to reveal significant relationships between truly unrelated variables (Simmons et al., 2011)” (p. 535).

There have been surprisingly few investigations of the best way to p-hack studies. Some p-hacking strategies may work in simulation studies that do not impose limits on resources, but they may not be practical in real applications of p-hacking. I postulate that the main goal of p-hacking is to get significant results with minimal resources rather than with a minimum number of studies and that p-hacking is more efficient with a file drawer of studies that are abandoned.

Simmons et al. (2011) and Simonsohn et al. (2014) suggest one especially dumb p-hacking strategy, namely simply collecting more data until a significant result emerges.

“For example, consider a researcher who p-hacks by analyzing data after every five per-condition participants and ceases upon obtaining significance.” (Simonsohn et al., 2014).

This strategy is known to produce more p-values close to .04 than .01.

The main problem with this strategy is that sample sizes can get very large before the significant result emerges. I limited the maximum sample size before a researcher would give up to N = 200. A limit of 20 makes sense because N = 200 would allow a researcher to run 20 studies with the starting sample size of N = 10 to get a significant result. The p-curve plot shows a similar distribution as the simulation in the p-curve article.

The success rate was 25%. This means, 75% of studies with N = 200 produced a non-significant result that had to be put in the file-drawer. Figure 2 shows the distribution of sample sizes for the significant results.

The key finding is that the chances of a significant results drop drastically after the first attempt. The reason is that the most favorable results in the first trial produce a significant result in the first trial. As a result, the non-significant ones are less favorable. It would be better to start a new study because the chances to get a significant result are higher than adding participants after an unsuccessful attempt. In short, just adding participants to get significant is a dumb p-hacking method.

Simonsohn et al. (2014) do not disclose the stopping rule, but they do show that they got only 5.6% significant results compared to the 25% with N = 200. This means they stopped much earlier. Simulation suggest that they stopped when N = 30 (n = 15 per cell) did not produce a significant result (1 million simulations, success rate = 5.547%). The success rates for N = 10, 20, and 30 were 2.5%, 1.8%, and 1.3%, respectively. These probabilities can be compared to a probability of 2.5 for each test with N = 10. It is clear that trying three studies is a more efficient strategy than to add participants until N reaches 30. Moreover, neither strategy avoids producing a file drawer. To avoid a file-drawer, researchers would need to combine several questionable research practices (Simmons et al., 2011).

Simmons et al. (2011) proposed that researchers can add covariates to increase the number of statistical tests and to increase the chances of producing a significant result. Another option is to include several dependent variables. To simplify the simulation, I am assuming that dependent variables and covariates are independent of each other. Sample size has no influence on these results. To make the simulation consistent with typical results in actual studies, I used n = 20 per cell. Adding covariates or additional dependent variables requires the same amount of resources. For example, participants make additional ratings for one more item and this item is either used as a covariate or as a dependent variable. Following Simmons et al. (2011), I first simulated a scenario with 10 covariates.

The p-curve plot is similar to the repeated peaking plot and is called left-skewed. The success rate, however, is disappointing. Only 4.48% of results were statistically significant. This suggests that collecting data to be used as covariates is another dumb p-hacking strategy.

Adding dependent variables is much more efficient. In the simple scenario, with independent DVs, the probability of obtaining a significant result equals 1-(1-.025)^11 = 24.31%. A simulation with 100,000 trials produced a percentage of 24.55%. More important, the p-curve is flat.

Correlation among the dependent variables produces a slight left-skewed distribution, but not as much as the other p-hacking methods. With a population correlation of r = .3, the percentages are 17% for p < .01 and 22% for p between .04 and .05.

These results provide three insights into p-hacking that have been overlooked. First, some p-hacking methods are more effective than others. Second, the amount of left-skewness varies across p-hacking methods. Third, efficient p-hacking produces a fairly large file-drawer of studies with non-significant results because it is inefficient to add participants to data that failed to produce a significant result.

Implications

False P-curve Citations

The p-curve authors made it fairly clear what p-curve does and what it does not do. The main point of a p-curve analysis is to examine whether a set of significant results was obtained at least partially with some true effects. That is, at least in a subset of the studies the null-hypothesis was false. The authors call this evidential value. A right-skewed p-curve suggests that a set of significant results have evidential value. This is the only valid inference that can be drawn from p-curve plots.

“We say that a set of significant findings contains evidential value when we can rule out selective reporting as the sole [italics added] explanation of those findings” (p. 535).

The emphasize on selective reporting as the sole explanation is important. A p-curve that shows evidential value can still be biased by p-hacking and publication bias, which can lead to inflated effect size estimates.

To make sure that I interpret the article correctly, I asked one of the authors on twitter and the reply confirmed that p-curve is not a bias test, but strictly a test that some real effects contributed to a right-skewed p-curve. The answer also explains why the p-curve authors did not care about testing for bias. They assume that bias is almost always present; which makes it unnecessary to test for it.

Although the authors stated the purpose of p-curve plots clearly, many meta-analysists have misunderstood the meaning of a p-curve analysis and have drawn false conclusions about right-skewed p-curves. For example, Rivers (2017) writes that a right-skewed p-curve suggests “that the WIT effect is a) likely to exist, and b) unlikely biased by extensive p-hacking.” The first inference is correct. The second one is incorrect because p-curve is not a bias detection method. A right-skewed p-curve could be a mixture of real effects and bias due to selective reporting.

Rivers also makes a misleading claim that a flat p-curve shows the lack of evidential value, whereas “a significantly left-skewed distribution indicates that the effect under consideration may be biased by p-hacking.” These statements are wrong because a flat p-curve can also be produced by p-hacking, especially when a real effect is also present.

Rivers is by no means the only one who misinterpreted p-curve results. Using the 10 most highly cited articles that applied p-curve analysis, we can see the same mistake in several articles. A tutorial for biologists claims “p-curve can, however, be used to identify p-hacking, by only considering significant findings” (Head, 2015, p. 3). Another tutorial for biologists repeats this false interpretation of p-curves. “One proposed method for identifying P-hacking is ‘P-curve’ analysis” (Parker et al., 2016, p. 714). A similar false claim is made by Polanin et al. (2016). “The p-curve is another method that attempts to uncover selective reporting, or “p-hacking,” in primary reports (Simonsohn, Nelson, Leif, & Simmons, 2014)” (p. 211). The authors of a meta-analysis of personality traits claim that they conduct p-curve analyses “to check whether this field suffers from publication bias” (Muris et al., 2017, 186). Another meta-analysis on coping also claims “p-curve analysis (Simonsohn, Nelson, & Simmons, 2014) allows the detection of selective reporting by researchers who “file-drawer” certain parts of their studies to reach statistical significance” (Cheng et al., 2014; p. 1594).

Shariff et al.’s (2016) article on religious priming effects provides a better explanation of p-curve, but their final conclusion is still misleading. “These results suggest that the body of studies reflects a true effect of religious priming, and not an artifact of publication bias and p-hacking.” (p. 38). The first part is correct, but the second part is misleading. The correct claim would be “not solely the result of publication bias and p-hacking”, but it is possible that publication bias and p-hacking inflate effect size estimates in this literature. The skew of p-curves simply does not tell us about this. The same mistake is made by Weingarten et al. (2016). “When we included all studies (published or unpublished) with clear hypotheses for behavioral measures (as outlined in our p-curve disclosure table), we found no evidence of p-hacking (no left-skew), but dual evidence of a right-skew and flatter than 33% power.” (p. 482). While a left-skewed p-curve does reveal p-hacking, the absence of left-skew does not ensure that p-hacking was absent. The same mistake is made by Steffens et al. (2017), who interpret a right-skewed p-curve as evidence “that the set of studies contains evidential value and that there is no evidence of p-hacking or ambitious p-hacking” (p. 303).

Although some articles correctly limit the interpretation of the p-curve to the claim that the data contain evidential value (Combs et al., 2015; Rand, 2016; Siks et al., 2018), the majority of applied p-curve articles falsely assume that p-curve can reveal the presence or absence of p-hacking or publication bias. This is incorrect. A left-skewed p-curve does provide evidence of p-hacking, but the absence of left-skew does not imply that p-hacking is absent.

How prevalent are left-skewed p-curves?

After 2011, psychologists were worried that many published results might be false positive results that were obtained with p-hacking (Simmons et al., 2011). As the combination of p-hacking in the absence of a real effect does produce left-skewed p-curves, one might expect that a large percentage of p-curve analyses revealed left-skewed distributions. However, empirical examples of left-skewed p-curves are extremely rare. Take, power-posing as an example. It is widely assumed these days that original evidence for power-posing was obtained with p-hacking and that the real effect size of power-posing is negligible. Thus, power-posing would be expected to show a left-skewed p-curve.

Simmons and Simonsohn (2017) conducted a p-curve analysis of the power-posing literature. They did not observe a left-skewed p-curve. Instead, the p-curve was flat, which justifies the conclusion that the studies contain no evidential value (i.e., we cannot reject the null-hypothesis that all studies tested a true null-hypothesis). The interpretation of this finding is misleading.

“In this Commentary, we rely on p-curve analysis to answer the following question: Does the literature reviewed by Carney et al. (2015) suggest the existence of an effect once one accounts for selective reporting? We conclude that it does not. The distribution of p values from those 33 studies is indistinguishable from what would be expected if (a) the average effect size were zero and (b) selective reporting (of studies or analyses) were solely responsible for the significant effects that were published”

The interpretation only focus on selective reporting (or testing of independent DVs) as a possible explanation for lack of evidential value. However, usually the authors emphasize p-hacking as the most likely explanation for significant results without evidential value. Ignoring p-hacking is deceptive because a flat p-curve can occur as a combination of p-hacking and real effect, as the authors showed themselves (Simonsohn et al., 2014).

Another problem is that significance testing is also one-sided. A right-skewed p-curve can be used to reject the null-hypotheses that all studies are false positives, but the absence of significant right skew cannot be used to infer the lack of evidential value. Thus, p-curve cannot be used to establish that there is no evidential value in a set of studies.

There are two explanations for the surprising lack of left-skewed p-curves in actual studies. First, p-hacking may be much less prevalent than is commonly assumed and the bigger problem is publication bias which does not produce a left-skewed distribution. Alternatively, false positive results are much rarer than has been assumed in the wake of the replication crisis. The main reason for replication failures could be that published studies report inflated effect sizes and that replication studies with unbiased effect size estimates are underpowered and produce false negative results.

How useful are Right-skewed p-curves?

In theory, left-skew is diagnostic of p-hacking, but in practice left-skew is rarely observed. This leaves right-skew as the only diagnostic information of p-curve plots. Right skew can be used to reject the null-hypothesis that all of the significant results tested a true null-hypothesis. The problem with this information is shared by all significance tests. It does not provide evidence about the effect size. In this case, it does not provide evidence about the percentage of significant results that are true positives (the false positive risk), nor does it quantify the strength of evidence.

This problem has been addressed by other methods that quantify how strong the evidence against the null-hypothesis is. Confusingly, the p-curve authors used the term p-curve for a method that estimates the strength of evidence in terms of the unconditional power of the set of studies (Simonsohn et al., 2014b). The problem with these power estimates is that they are biased when studies are heterogeneous (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020). Simulation studies show that z-curve is a superior method to quantify the strength of evidence against the null-hypothesis. In addition, z-curve.2.0 provides additional information about the false positive risk; that is the maximum number of significant results that may be false positives.

In conclusion, p-curve plots no longer produce meaningful information. Left-skew can be detected in z-curves plots as well as in p-curve plots and is extremely rare. Right skew is diagnostic of evidential value, but does not quantify the strength of evidence. Finally, p-curve plots are not diagnostic when data contain evidential value and bias due to p-hacking or publication bias.