All posts by Ulrich Schimmack

About Ulrich Schimmack

Since Cohen (1962) published his famous article on statistical power in psychological journals, statistical power has not increased. The R-Index makes it possible f to distinguish studies with high power (good science) and studies with low power (bad science). Protect yourself from bad science and check the R-Index before you believe statistical results. View all posts by Ulrich Schimmack →

Dear authors, don’t waste our and your own time.

March 5, 2026UncategorizedUlrich Schimmack

Publish or perish. I heard this in the 1990s, but it is even more true today. Submitting manuscript to publish has gotten easier, too. It cost me real money to mail three copies of a manuscript from Germany to the United States (Schimmack, 1996). Now, you just need to check all the boxes on a submission portal. Not an easy task, but virtually cost free.

This system is like a lottery, where tickets are cheap and winnings can be rewarding. No wonder, authors are playing the lottery and submitting manuscripts in large numbers, even if chances of rejection are high. Maybe journals should charge for submissions rather than for publications.

Anyhow, I just reviewed a manuscript in 30 minutes. It was conceptually flawed. More importantly, my own AI – trained on this area of research – also spotted the conceptual problem, and several others that I didn’t even bother to read as it would take too long for a human reader to do so (life is short at age 60). It also wrote a nice and detailed review, much better than most human reviews. Of course, it had the advantage of being trained on this research area, but I also submitted the manuscript to a generic AI with no special knowledge. It also spotted the fatal conceptual mistake. This brings me to the main point of this rant.

Dear authors, do yourself and others a favor. Use AI to review your paper before you submit it. Even better ask it to evaluate it from the perspective of legendary Reviewer 2 and address critical issues before you submit it to a journal. You save yourself time and effort, but more importantly, you are a good citizen and do not clog the peer-review system with flawed manuscripts in the hope that they pass peer-review despite major problems.

Thank you for your attention.

A Z-Curve Analysis of Emotion Journals: Soto & Schimmack 2024

March 1, 2026Credibility, Z-Curve, ZcurveUlrich Schimmack

For the full article see:

Full citation: Soto, M. D., & Schimmack, U. (2024). Credibility of results in emotion science: A Z-curve analysis of results in the journals Cognition & Emotion and Emotion. Cognition and Emotion. https://doi.org/10.1080/02699931.2024.2443016

OSF repository: https://osf.io/42vxd/

Purpose of this document: This is a detailed analytical summary written entirely in the summarizer’s own words. It is intended to make the paper’s methods, results, and arguments accessible for discussion and analysis without reproducing copyrighted text. Readers should consult the original article for exact language and figures.

Structured Summary

1. Motivation and Research Question

The paper addresses whether the replication crisis — documented most prominently by the Open Science Collaboration (2015), which found only 36% of psychology results replicated — extends to the emotion research literature specifically. The authors note that the OSC findings were limited to articles from 2008 and may not generalize to emotion research, which has its own dedicated journals and traditions.

The two journals examined are Cognition & Emotion (established 1987) and Emotion (established 2001 by APA). The authors aimed to assess: (a) how much selection bias exists in these journals, (b) what proportion of published results might be false positives, (c) what the expected replication rate is, and (d) whether these indicators have improved over time in response to the replication crisis.

2. Z-Curve Method: How It Works

The paper uses Z-curve 2.0 (Bartoš & Schimmack, 2022), which takes a set of test statistics, converts them to absolute z-scores, and fits a finite mixture model to the distribution of statistically significant z-values (those exceeding 1.96). The method produces four key estimates:

Expected Discovery Rate (EDR): An estimate of the average true power of studies before selection for significance. This represents what proportion of all conducted tests (including unpublished ones) would be expected to reach significance. It is conceptually the mean power across the full population of tests.

Expected Replication Rate (ERR): An estimate of mean power after selection for significance — that is, among published significant results. Because significance selection favors higher-powered studies, ERR is always higher than EDR. The authors frame ERR as an optimistic upper bound on expected replication success.

Observed Discovery Rate (ODR): Simply the proportion of extracted test statistics that were statistically significant at p < .05. Comparing ODR to EDR quantifies selection bias: a large gap indicates that many non-significant results went unreported.

False Discovery Risk (FDR): Computed from the EDR using Soric’s (1989) formula, which gives the maximum proportion of significant results that could be false positives given a particular discovery rate.

The authors explicitly note that ERR overestimates actual replication success (comparing z-curve’s ERR for the OSC dataset to the actual 36% rate), and they recommend interpreting the true replication rate as falling somewhere between EDR and ERR, citing Sotola (2023) for empirical support.

3. Methods

3.1 Test Statistic Extraction

The authors collected the complete set of published articles from both journals (3,831 from C&E covering 1987–2023; 2,323 from Emotion covering 2001–2023). Using custom R code built on the pdftools package (Ooms, 2024), they automatically extracted reported test statistics: F-tests, t-tests, chi-square tests (with df between 1 and 6 only, to exclude SEM model-fit tests), z-tests, and 95% confidence intervals of odds ratios and regression coefficients.

Chi-square tests with df > 6 were excluded because these typically come from structural equation modeling, where rejecting the null indicates poor model fit rather than a substantive finding. Confidence intervals were excluded when reported alongside test statistics to avoid double-counting. Meta-analysis articles were excluded entirely.

The extraction code was designed to handle various notation formats across journals and was iteratively refined. However, the authors acknowledge that the automated process cannot extract statistics from tables or figures, and cannot distinguish between focal and non-focal hypothesis tests.

After exclusions (including test statistics with N < 30, since t-to-z conversion is unreliable at very low df), the final samples were 30,513 z-scores from 1,902 C&E articles and 35,457 z-scores from 1,953 Emotion articles. The majority were F-tests (62% C&E, 53% Emotion) and t-tests (26% C&E, 28% Emotion).

3.2 Statistical Analysis — The Clustering Approach

This is a critical methodological detail. The authors used the zcurve_clustered function with the “b” method. This method works by sampling a single test statistic from each article during model fitting, thereby addressing within-article dependence. This directly addresses concerns about independence violations that arise when multiple test statistics are extracted from the same paper.

The EM algorithm was applied to significant z-values between 1.96 and 6 (values above 6 are treated as having essentially 100% power). The fitted mixture model uses seven discrete components (z = 0 through 6), and the estimated weights are used to compute EDR and ERR. The model then extrapolates the full distribution to estimate what the non-significant portion would look like without selection.

3.3 Time Trend Analysis

Annual z-curve estimates were computed for each publication year and regressed on linear and quadratic predictors of year. The quadratic term tested whether improvements accelerated after 2011 (when the replication crisis became prominent).

3.4 Hand-Coded Focal Tests

To address the limitation that automatic extraction conflates focal and non-focal tests, the authors also present results from 241 hand-coded articles from 2010 and 2020, drawn from an ongoing project covering 30+ journals and 4,000+ studies (Schimmack, 2020). This sample contained 227 significant tests out of 241 total.

4. Results

4.1 Main Z-Curve Estimates

The two journals produced remarkably similar results:

Parameter	Cognition & Emotion	Emotion
ODR	71% [70%, 71%]	70% [70%, 70%]
EDR	30% [14%, 53%]	31% [15%, 53%]
ERR	66% [59%, 73%]	65% [59%, 71%]
FDR	12% [5%, 32%]	12% [5%, 30%]

The ODR-EDR gap (approximately 40 percentage points) provides clear evidence of selection bias in both journals, confirmed visually by a sharp drop in observed z-scores just below the significance threshold of 1.96.

The ERR of approximately 65% suggests that the majority of published significant results should replicate with the same sample size, though the authors stress this is an optimistic estimate. The FDR point estimate of 12% is comparable to medical clinical trial journals (14% per Schimmack & Bartoš, 2023) and substantially lower than the most pessimistic predictions (Ioannidis, 2005). However, the upper bound of the FDR confidence interval (~30%) is high enough to warrant concern.

4.2 Time Trends

Sample sizes (degrees of freedom): Both journals showed significant linear increases over time, with some acceleration (significant quadratic trends). Median within-group df increased from roughly 50 in the early years to over 100 in recent years for Emotion, and showed a particularly sharp increase in C&E’s most recent years.

ODR: Both journals showed significant linear decreases in ODR over time (approximately 0.45 percentage points per year), suggesting that non-significant results are being reported more frequently. However, the quadratic terms were non-significant, meaning this trend preceded the replication crisis rather than being a response to it.

EDR: Both journals showed significant increases in EDR over time, consistent with increasing sample sizes leading to higher power. The combination of decreasing ODR and increasing EDR indicates that selection bias has diminished, though it remains present.

ERR: Increased over time for both journals, with C&E showing a significant acceleration (quadratic trend) suggesting the replication crisis may have prompted improvements.

FDR: Decreased over time as a direct consequence of the increasing EDR.

4.3 Hand-Coded Focal Test Results

The 241 hand-coded focal tests from 2010 and 2020 yielded:

Parameter	Estimate	95% CI
ODR	94%	[91%, 97%]
EDR	27%	[10%, 67%]
ERR	65%	[53%, 75%]
FDR	14%	[3%, 50%]

The ODR for focal tests (94%) is substantially higher than the 70–71% from automatic extraction, confirming that automatic extraction captures many non-focal, non-significant tests that dilute the ODR. However, the EDR, ERR, and FDR estimates are comparable to the automatically extracted results and fall within their confidence intervals. This is an important robustness check: the key z-curve parameters are not substantially altered by the inclusion of non-focal tests.

4.4 Alpha Adjustment Analysis

The authors examined the effect of lowering the significance threshold on discovery rates and false positive risk. Lowering alpha from .05 to .01 retains approximately half of all significant results while reducing FDR to below 5% for most publication years. Further reductions to .005 or .001 have diminishing returns for FDR reduction but increasingly sacrifice power.

5. Discussion and Interpretation

The authors frame their results as relatively encouraging for emotion research compared to worst-case scenarios. Key interpretive points:

The FDR of approximately 12% (though with wide CIs) suggests that most published significant results in emotion journals are not false positives. However, the upper bound of the CI leaves open the possibility of rates up to 30%.

The ERR of 65% predicts that most significant results should replicate with the same sample size, but this is optimistic. Adjusting for the estimated FDR, power for true effects may be approximately 72%, close to the conventional 80% benchmark but with substantial heterogeneity — half of studies have less power than this average.

The authors recommend treating results with p-values between .05 and .01 with skepticism, and suggest that alpha = .01 provides a better balance between false positive risk and power loss for the emotion literature specifically. They emphasize this recommendation is for evaluating existing literature, not as a new publication standard.

On effect sizes, the authors warn that selection bias inflates point estimates, making even meta-analytic effect sizes unreliable unless bias correction is applied. They advocate for honest reporting of all results, including non-significant ones, as essential for accurate meta-analysis.

6. Limitations Acknowledged by the Authors

The authors explicitly discuss several limitations:

Z-curve’s selection model assumes that publication probability is a function of power. In reality, questionable research practices (QRPs) can produce significance without real effects, potentially inflating EDR estimates and underestimating selection bias.
Simulation studies of z-curve performance under QRP-generated data are lacking.
The N > 30 exclusion removes some studies, though supplementary analyses with the full sample show similar results.
Automated extraction cannot distinguish focal from non-focal tests (addressed by the hand-coded analysis).
The automated extraction cannot reliably capture statistics from tables or figures.

7. Key Methodological Features Relevant to the Pek et al. Debate

Several aspects of this paper are directly relevant to criticisms raised by Pek et al.:

Independence assumption: Soto & Schimmack explicitly used zcurve_clustered with the “b” method, which samples one test statistic per article during bootstrapping. This directly addresses the concern about within-article dependence. The method section states this clearly.

Focal vs. non-focal tests: The paper includes both automatic extraction (all tests) and hand-coded focal tests, and shows that the z-curve parameters (EDR, ERR, FDR) are comparable across both approaches. This addresses the concern that including non-focal tests distorts results.

Appropriate caveats: The authors consistently describe ERR as optimistic, characterize the true replication rate as lying between EDR and ERR, acknowledge the wide confidence intervals on EDR and FDR, and explicitly discuss the limitations of the selection model assumption.

Asymmetric interpretation: The paper notes that z-curve evaluations of credibility are asymmetric — low values raise concerns about a literature, but high values do not guarantee credibility.

8. Summary Table of All Z-Curve Estimates

Analysis	N tests	N sig	ODR	EDR [95% CI]	ERR [95% CI]	FDR [95% CI]
C&E (auto)	30,513	21,628	71%	30% [14%, 53%]	66% [59%, 73%]	12% [5%, 32%]
Emotion (auto)	35,457	24,824	70%	31% [15%, 53%]	65% [59%, 71%]	12% [5%, 30%]
Focal (hand-coded)	241	227	94%	27% [10%, 67%]	65% [53%, 75%]	14% [3%, 50%]

Summary prepared for analytical discussion purposes. All descriptions reflect the summarizer’s interpretation of the original work. For exact language, figures, and supplementary analyses, consult the published article.

Anti Psychological Science (APS)

February 18, 2026UncategorizedUlrich Schimmack

No polite ChatGPT edits. Unfiltered raw Schimmack. Love it or hate it.

It was supposed to be the American Psychological Society (APS), but international researchers complained – especially those who want to publish in prestigious American journals – and APS became the Association for Psychological Science.

Psychological Science is now a brand name and many departments have been renamed to be Departments of Psychological Science. However, you do not become a science, just because you call yourself one, you actually have to behave like a science. And that seems to be something that many psychologists do not want to do because it would mean giving data to decide about the truth. Just like William James, many psychologists like their theories more than truth So, they continue to conduct silly statistical rituals (Gigerenzer) that are biased to show either evidence for their beliefs (p < .05) or no evidence against them (p > .05) and justify another biased test.

Every generation there have been a few psychologists who were frustrated by the futility of this and made suggestions to improve things (Meehl, Cohen, Gigerenzer) or just also fake the data (Stapel). You have to give it to Stapel. Why collect data if their only purpose is to add p < .05 to any claim one wants to make?

Since the early 2010s, thanks to Bargh and Bem, more people are calling for change, but progress is slow and stalling. Meanwhile, most published articles continue to report claims with p-values below .05.

A cynical approach to this sad state of affairs would be to say “fuck it”, “burn it all down,” and enjoy life. However, some people just can’t let go. We (Brunner, Bartos, Schimmack) developed a statistical method that helps readers to distinguish between good and bad significant results. Good ones come from studies with high statistical power that are likely to replicate. Bad ones are studies with low power or even false positive results that will not replicate. Of course, there is no hard line, but we can identify subsets of good studies, if they exist.

You would think an aspirational science would welcome a tool that can salvage good results from decades of research with mostly significant results. Which ones are trustworthy? Which ones are like pornception (Bem, 2011)?

But being a science would mean that we have to expose the fact that some results were made up – not like Stapel on his laptop – but by collecting and analyzing data, year after year, painstaking work to get significant results – and many unpublished failures. No, we cannot have this. Therefore, we have to fight the method that can distinguish good and bad research.

To fight this method, we need to get a peer-reviewed article that claims “the method does not work.” To do so, the article does not have to be evaluated by statisticians or present good arguments. All we need is a quotable peer-reivewed article, because peer-reviewed equals truth, which is also why extrasensory perception is true (Bem, 2011, JPSP).

Now reviewers can quote the criticism – and not cite evidence that contradicts these claims – and editors can use the peer-review to reject the article. The key feature of science is to fight motivational biases. If a system just amplifies misinformation and glorifies misinformation that passed peer-review, it is not a science. Maybe APS really means Anti-Psychological Science.

The question is how long this game of self- and other-deception can continue? At what point will public interest in psychology wane because it never produces any useful results that advance society, health, and wellbeing? Science is worth defending against the attacks by Trumpians, but I am not sure psychological science is part of this.

“Valid Replications Require Valid Methods—And Originals Don’t?”

February 13, 2026Replication, Replication Crisis, Replication Failures, Social PsychologyAmodio, Contextual Sensitivity, Defending P-Hacking, Harmon-Jones, Motivated Biases, Repression, Schmeichel, Self-Serving AttributionUlrich Schimmack

Harmon-Jones, E., Harmon-Jones, C., Amodio, D. M., Gable, P. A., & Schmeichel, B. J. (2025). Valid replications require valid methods: Recommendations for best methodological practices with lab experiments. Motivation Science, 11(3), 235–245

“Far from over.” (Frank Wang, tennis buddy when he is down 2:5)

The replication crisis shook social psychology in the 2010s. Heated debates—often on social media—divided critics, reformers, and defenders of the published record. The heat has cooled, but the crisis is far from over. The central empirical problems remain: unusually high rates of statistically significant results in journals, implausible success rates given typical power, and repeated failures to reproduce headline findings under rigorous conditions.

A striking pattern in parts of the methodological commentary that followed is explanatory asymmetry. Replication failures are readily attributed to contextual factors, subtle procedural differences, or “messy methods,” while the same standards are not applied with equal force to original studies. If minor contextual differences can wipe out an effect, then original results should also be unstable—yet the published record historically looks unnaturally successful. Any account that explains failure must also explain success.

There is also an ironic subtext: some of the strongest defenses of fragile effects come from researchers who study motivation and bias, yet methodological narratives can display their own motivated reasoning—favoring interpretations that protect prior conclusions. None of this requires imputing bad faith. It is enough to recognize that professional stakes and identity can shape what kinds of explanations feel plausible.

To avoid my own biases, I asked ChatGPT to evaluate bias in this article. More importantly, ChatGPT also provided an explanation for the rating.

Bias Evaluation

Harmon-Jones et al. (2025) argue that many replication failures in motivation and emotion research arise not from invalid theories or false positives, but from “messy methods.” They provide extensive practical recommendations regarding laboratory setup, experimenter behavior, manipulation strength, measurement sensitivity, replication design, data management, and statistical interpretation. The article is methodologically rich and offers useful guidance for improving internal validity in lab experiments.

However, when situated within the broader replication debate, the paper exhibits a consistent asymmetry in explanatory framing. On a scale from –10 (strongly defensive of existing literature) to +10 (strongly skeptical that most results are true), this article falls around –4 to –5: moderately biased in defense of established findings.

The basis for this rating is outlined below.

Core Contribution: Internal Validity Matters

The article’s strongest contribution is its detailed emphasis on internal validity. The authors correctly note that laboratory experiments are sensitive systems in which:

Subtle environmental cues may influence participant motivation.
Experimenter demeanor and appearance can affect outcomes.
Manipulations must be strong and construct-valid.
Dependent variables must be sensitive and properly timed.
Multilab projects introduce coordination risk.
Data handling errors can contaminate results.

These are real methodological concerns. The paper provides concrete, experience-based guidance that would likely improve experimental rigor if widely adopted. It is especially valuable as a practical resource for researchers conducting lab-based motivation studies.

Asymmetry in Causal Attribution

The principal concern is not methodological advice but explanatory direction.

Replication failures are repeatedly attributed to:

Context sensitivity
Weak or improperly implemented manipulations
Insensitive measures
Experimenter variability
Procedural deviations in multilab collaborations
Data management errors

These are legitimate explanations in some cases. However, the article does not apply equivalent scrutiny to original studies.

There is little engagement with:

Publication bias
Inflated effect sizes
Researcher degrees of freedom
Selective reporting
Power deficiencies in original work
Theory elasticity

The explanatory burden for null replications is placed largely on replication implementation rather than on possible inflation or fragility in the original literature.

This directional asymmetry is what produces the defensive tilt.

Context Sensitivity as a Buffer

The authors cite contextual sensitivity as a key explanation for replication variability. Conceptually, psychological effects can depend on time, culture, and population. However, the article treats contextual sensitivity as supporting evidence for interpreting replication failures, without addressing debate over the empirical robustness of this claim.

More importantly, the paper does not quantify how strong contextual sensitivity would need to be to account for large-scale null findings in well-powered, preregistered, multilab studies. If minor environmental differences are sufficient to eliminate effects, then those effects are fragile by definition. That implication is not confronted directly.

Treatment of Ego Depletion

The article references the large preregistered multisite ego-depletion test (Vohs et al., 2021), which included proponents of the theory. Rather than interpreting the null results as evidence about true effect size, the authors emphasize coordination errors and deviations across labs.

While procedural complications can occur, the study was high-powered and preregistered. A pattern of near-zero effects across many sites cannot be explained solely by minor procedural noise without implying extreme fragility.

The possibility that the true effect is very small or nonexistent is not seriously engaged. This reinforces the asymmetry in explanatory weighting.

The “Psychological Sledgehammer” Standard

The recommendation that manipulations should function like a “psychological sledgehammer” raises an additional issue. If only very strong manipulations count as valid tests, then many real-world operationalizations will be deemed insufficient. This narrows the acceptable domain of theory testing and increases the probability that null findings are attributed to weakness of implementation rather than limitations of theory.

That standard shifts the evidentiary burden in a way that implicitly protects established effects.

The Excess Success Gap

A major omission concerns the historically high rate of statistically significant findings in psychology journals—often described as exceeding 90%.

If effects are highly context-sensitive and fragile, then original studies should also frequently fail. Minor variations in lab setup, experimenter behavior, and measurement sensitivity would generate many null outcomes. Yet the published literature overwhelmingly reports positive results.

These two claims cannot comfortably coexist without additional explanation.

There are only a few ways to reconcile fragile effects with excess success:

Many effects are actually robust and high-powered.
Publication bias and selective reporting filter out null results.
Researchers iteratively tune operationalizations and analyses until significance is obtained.
Journals selectively publish successful implementations.

The article does not engage this macro-level constraint. It does not integrate publication bias or excess-success analysis into its explanatory framework. As a result, replication failures are treated as methodologically suspect, while the structural inflation of original literatures remains largely unaddressed.

This omission materially strengthens the case for a defensive bias rating.

What Prevents a More Extreme Rating

Despite these concerns, the article does not:

Deny p-hacking or questionable practices.
Reject replication as essential.
Claim that all replication failures are invalid.
Dismiss preregistration.
Attack statistical reform movements.

It offers constructive methodological advice and acknowledges complexity in statistical inference. The tilt is moderate, not extreme.

Overall Assessment

Harmon-Jones et al. provide valuable, concrete guidance on improving internal validity in laboratory research. Their emphasis on methodological nuance is important and often neglected in replication debates.

However, the paper consistently places greater explanatory weight on replication imperfections than on possible inflation or fragility in original findings. It does not reconcile its fragility narrative with the excess success of published psychology, nor does it engage deeply with quantitative evidence regarding effect size shrinkage and false discovery rates.

For these reasons, the article can be fairly characterized as moderately biased in defense of existing literature — approximately –4 to –5 on a –10 to +10 scale.

P.S. Why my bias rating would be more extreme

The most important cue for bias is that success rates over 90% in psychology journals have been documented repeatedly since Sterling (1959). Any article that avoids talking about this implausible result that undermines the meaning of statistical significance is often biased and downplays the amount of selection bias in psychology. Insiders know that only significant results can be published. This unscientific incentive structure is the root cause of the replication crisis, not contextual sensitivity. Failure to mention Sterling is a red flag.

The second red flag is the citation of van Bavel as a reference to contextual sensitivity. Van Bavel claimed to have shown that contextual sensitivity explains the lower replication rate in social psychology. However, Inbar showed that they did not present the critical tests of an interaction and that this test was not significant. There is no evidence that contextual sensitivity contributes to low success rates in social psychology. Rather, social psychologists never ran direct replications and used contextual sensitivity as a way to protect their theories from disconfirming evidence. Change something trivial and you get significance again: great, the effect is robust. If not, clearly the effect was real before but not in this context. Now publish only the significant results and claim that the theory is universally true across time, place, and populations. This is how it was done, and it was wrong. Sadly, some social psychologists cannot just say, sorry, we messed up, now let’s move on.

An Introduction to Z-Curve 3.0 Options

February 5, 2026UncategorizedUlrich Schimmack

All options are set as global variables at the beginning of installing the functions with source(zcurve3). Afterwards they can be changed like any other R object

1. Curve Type: Default Z-Values, Option Fit t-Distributions with a Fixed df

CURVE.TYPE <- “z” # Set to “t” for t-distribution
df = c() # set to the df of the t-distribution

2. Speed Control Parameters

parallel <- FALSE # Placeholder – parallel functionality not yet implemented
max_iter <- 1e6 # Max iterations for model estimation
max_iter_boot <- 1e5 # Max iterations for bootstrapped estimates

EM.criterion <- 1e-3 # Convergence threshold for EM algorithm
EM.max.iter <- 1000 # Max iterations for EM

Plot.Fitting <- FALSE # Plot fitting curve (only for Est.Method = “OF” or “EXT”)

PLOT SETTINGS

Title <- “” # Optional plot title

letter.size <- 1 # Text size in plots
letter.size.1 <- letter.size # Used for version labels in plot
y.line.factor <- 3 # Controls spacing of plot text

x.lim.min <- 0 # X-axis lower bound
x.lim.max <- 6 # X-axis upper bound
ymax <- 0.6 # Y-axis upper bound
ymin <- 0 # OUTDATED Y-axis lower bound (for label space)

Show.Histogram <- TRUE # Toggle histogram in plot
Show.Text <- TRUE # Toggle model results in plot
Show.Curve.All <- TRUE # Show predicted z-curve
Show.Curve.Sig <- FALSE # Option: show z-curve only for significant values
Show.Significance <- TRUE # Show z = critical value line
Show.KD <- FALSE # Toggle kernel density overlay (density method only)

sig.levels <- c() # Optional: mark additional p-value thresholds on plot

int.loc <- 0.5 # Plot local power intervals below x-axis (set 0 to disable)
hist.bar.width <- 0.2 # Width of histogram bars
bw.draw <- 0.10 # Smoothing for kernel density display

CONSOLE OUTPUT

Show.Iterations <- TRUE # Show iterations for slow procedures (e.g., EXT, TEST4HETEROGENEITY)

MODEL PARAMETERS

alpha <- 0.05 # Significance level
crit <- qnorm(1 – alpha / 2) # Corresponding two-sided critical z

two.sided <- TRUE # Assume two-sided z-values (use abs(z)); not yet compatible with signed z-values

Color scheme

col.curve <- “violetred3”
col.hist <- “blue3”
col.kd <- “green3”

Est.Method <- “OF” # Estimation method: “OF”, “EM”, or “EXT” # Clustered Data: “CLU-W” (weighted),”CLU-B” (bootstrap) Int.Beg <- 1.96 # Default: critical value for alpha = .05 Int.End <- 6 # End of modeling interval (z > 6 = power = 1)

ncp <- 0:6 # Component locations (z-values at which densities are centered)
components <- length(ncp) # Number of components
zsd <- 1 # SD of standard normal z-distribution
zsds = rep(zsd,components) # one SD for each component

just <- 0.8 # Cutoff for “just significant” z-values (used in optional bias test)

ZSDS.FIXED <- FALSE # Fix SD values for EXT method
NCP.FIXED <- FALSE # Fix non-central parameter(NCP) means values for EXT method
W.FIXED <- FALSE # Fix weights for EXT method

fixed.false.positives <- 0 # If > 0, constrains proportion of false positives (e.g., weight for z = 0 component)

DENSITY-BASED SETTINGS (Only used with Est.Method = “OF”)

n.bars <- 512 # Number of bars in histogram

Augment <- TRUE # Apply correction for bias at lower bound
Augment.Regression <- FALSE # Use Slope for Augmentation
Augment.Factor <- 1 # Amount of augmentation

bw.est <- 0.05 # Bandwidth for kernel density (lower = less smoothing, higher = more smoothing)
bw.aug <- .20 # Width of Augmentation interval

INPUT RESTRICTIONS

MAX.INP.Z <- Inf # Optionally restrict very large z-values (set Inf to disable)

CONFIDENCE INTERVALS / BOOTSTRAPS

boot.iter <- 0 # Number of bootstrap iterations (suggest 500+ for final models)
ERR.CI.adjust <- 0.03 # Conservative widening of confidence intervals for ERR
EDR.CI.adjust <- 0.05 # Conservative widening for EDR

CI.ALPHA <- 0.05 # CI level (default = 95%)

CI levels for Heterogeneity Test

fit.ci <- c(.01, .025, .05, .10, .17, .20, .50, .80, .83, .90, .95, .975, .99) # CI levels for model fit test

TEST4BIAS <- FALSE # Enable optional bias test
TEST4HETEROGENEITY <- 0 # Optional heterogeneity test (slow) — set number of bootstrap iterations

Concerns About Z-Curve: Evidence From New Simulations With Few Studies

February 3, 2026Erik van Zwet, PekCoverage, Erik van Zwet, Pek, Validation, Z-CurveUlrich Schimmack

Scientific progress depends on criticism, especially when it is used to identify limitations of statistical methods and to improve them. z-curve is no exception. Over the past year, several critiques have raised questions about the robustness of z-curve estimates, particularly with respect to the expected discovery rate (EDR). These critiques deserve careful examination, but they also require accurate characterization of what z-curve assumes, what it estimates, and under which conditions its estimates are informative.

Two recent lines of criticism are worth distinguishing. First, Pek et al. (2025) show that z-curve estimates can be biased when the publication process deviates from the assumed selection model. The default z-curve model assumes that selection operates primarily on statistical significance at the conventional α = .05 threshold (z = 1.96). Pek et al. demonstrate that if researchers also suppress statistically significant results with small effect sizes—for example, not publishing a result with p = .04 because the standardized mean difference is only d = .40—then z-curve estimates can become optimistic. This result is correct: z-curve cannot diagnose selective reporting based on effect size rather than statistical significance.

There is limited direct evidence that routine selection on effect-size magnitude (beyond statistical significance) is widespread; the QRPs most commonly reported in self-surveys are largely significance-focused (John et al., 2012). In any case, imperfect correction is not a reason to ignore selection bias entirely, because uncorrected meta-analyses can markedly overestimate population effects and replicability (Carter et al., 2019).

Moreover, the selection mechanism examined by Pek et al. has a clear directional implication: when statistically significant results are additionally filtered by effect size, z-curve’s estimates of EDR and ERR can be biased upward. This matters for interpretation. If z-curve already yields low EDR or ERR estimates, then the type of misspecification studied by Pek et al. would, if present in practice, imply that the underlying parameters could be even lower. For example, an estimated EDR of 20% under the default selection model could correspond to a substantially lower true discovery rate if significant-but-small effects are systematically suppressed. Whether such effect-size–based suppression is common enough to materially affect typical applications remains an empirical question.

A second critique has been advanced by Erik van Zwet, a biostatistician who has applied models of z-value distributions developed in genomics to meta-analyses of medical trials. These models were designed for settings in which the full set of test statistics is observed and therefore do not incorporate selection bias. When applied to literatures where selection bias is present, such models can yield biased estimates. In contrast, z-curve is explicitly designed to assess the presence of selection bias and to correct for it, when it is present. When no bias is present, z-curve can also be fitted to the full z-curve, including non-significant results.

van Zwet has published a few blog posts arguing that z-curve performs poorly when estimating the expected discovery rate (EDR). Importantly, his simulations do not show problems for the expected replication rate (ERR). Thus, z-curve’s ability to estimate the average true power of published significant results is not in question. The disputed issue concerns inference about the broader population of studies, including unpublished nonsignificant results.

Some aspects of this critique require clarification. van Zwet has suggested that z-curve was evaluated only in a small number of simulations. This is incorrect. Prior work includes two large simulation studies—one conducted by František Bartoš and one conducted by me—that examined EDR confidence-interval coverage across a wide range of conditions. Based on these results, the width of the nominal 95% confidence intervals was conservatively expanded by ±5 percentage points to achieve near-nominal coverage across a wide range of realistic scenarios (see details below). Thus, EDR interval estimation was already empirically validated across many conditions with 100 or more significant results.

However, these simulations did not examine performance of z-curve with small sets of significant results. Because z-curve can technically be fit with as few as 10 significant results, it is reasonable to ask whether EDR confidence-interval coverage remains adequate when the number of significant studies is substantially smaller than 100. To address this question directly, I conducted a new simulation study focusing on the case of 50 significant results.

In addition, I introduced two diagnostics designed to assess when EDR estimation is likely to be weakly identified. Estimation of the EDR relies disproportionately on significant results from low-powered studies or false positives, because these observations provide information about the number of missing nonsignificant results. When nearly all significant results come from highly powered studies, the observed z-value distribution contains little information about what is missing. The first diagnostic therefore counts how many significant z-values fall in the interval from 1.96 to 2.96. Very small counts in this range signal that EDR estimates are driven by limited information. The second diagnostic examines the slope of the z-value density in this interval. A decreasing slope indicates information consistent with a mixture that includes low-powered studies, whereas an increasing slope reflects dominance of high-powered studies and weak identification of the EDR.

Reproducible Results of Simulation Study with 50 Significant Results

The simulation used a fully crossed factorial design with four values for each of four parameters, yielding 192 conditions. Population-level standardized mean differences were set to 0, .2, .4, or .6. Heterogeneity was modeled using normally distributed effect sizes with standard deviations (τ) of 0, .2, .4, or .6. In addition, a separate population of true null studies was included, with the proportion of false discoveries among significant results set to 0, .2, .4, or .6. Sample sizes varied across conditions, starting at n = 50 (25 observations per group). For each condition, simulations were run with exactly 50 significant results.

The simulation code is available here. The results are available here.

Across all scenarios coverage is 96%. The percentage is higher than the nominal 95% because the conservative adjustment leads to higher coverage in less changing scenarios.

The slope diagnostic works as expected. When the slope is decreasing, coverage is 97%, but when the slope is increasing it drops to 83%. Increasing slopes are more likely to lead to an overestimation than underestimation of the EDR (75%). Increasing slopes occurred in only 5% of all simulations because these scenarios assume that the majority of studies have over 50% power, which requires large samples and moderate to large effect sizes.

The number of z-values in the range between 1.96 and 2.96 also matters. At least 12 values in this range are needed to have 95% coverage. However, the slope criterion is more diagnostic than the number of z-values in this range.

A logistic regression with CI coverage (yes = 1, no = 0) as outcome and slope direction, d, SD, se 2/sqrt(N), and FDR proportion as predictors showed a strong effect of slope direction, FDR, and a slope direction x FDR interaction. Based on these results, I limited the analysis to scenarios with decreasing or flat slopes.

The effect of FDR remained significant (b = 3.55, SE = 1.47), as did the main effect of effect size (b = −2.33, SE = 1.01) and the effect size × SD interaction (b = 6.93, SE = 2.99), indicating systematic variation in coverage across conditions.

These effects are explained by how the design parameters shape the distribution of observed z-values in the critical range used to estimate the EDR (1.96–2.96). Higher FDR values imply a larger proportion of true null effects, which produces a steeper declining slope in the truncated z-distribution and increases information about the mass of missing non-significant results. In contrast, larger effect sizes generate a greater share of high-powered studies with z-values well above the truncation point, which reduces the relative influence of marginally significant results and makes the EDR less identifiable from the observed distribution.

The significant effect size × SD interaction reflects the moderating role of heterogeneity. When heterogeneity is present, even large average effect sizes produce a mixture of moderate- and high-power studies, increasing the density of z-values near the significance threshold and partially restoring information about missing results. As a consequence, the adverse effect of large average effect sizes on coverage is attenuated when heterogeneity is non-zero.

Overall, the most challenging scenarios for EDR estimation are characterized by low heterogeneity and shallow slopes in the just-significant range. In these settings, the observed z-distribution contains limited information about the unobserved, non-significant portion of the distribution, so EDR is weakly identified from the selected data alone.

Inspection of the 192 design cells indicates that the largest coverage shortfalls are concentrated in homogeneous conditions, especially when SD = 0 and FDR = 0. This limitation of the default discrete mixture approximation under near-homogeneity has been documented previously (Brunner & Schimmack, 2020). In practice, it can be addressed by fitting a homogeneity-appropriate specification, such as a single-component model with a free mean and normally distributed heterogeneity (with SD allowed to approach 0), as implemented in z-curve 3.0.

Restricting attention to scenarios with heterogeneous data (SD > .2), 89% of conditions achieved at least 95% coverage, and only 2 conditions (1.4%) fell below 90% coverage. Thus, even with adjusted confidence intervals, nominal coverage is not guaranteed in all edge cases. The remaining coverage problems arise for two reasons: (a) the selected z-distribution can be nearly uninformative about the amount of missing, non-significant evidence when the just-significant slope is shallow, and (b) the default heterogeneous specification can be misspecified when applied to approximately homogeneous data. A conservative approach to the lower coverage in some cases is to report the results as 85% confidence interval even if coverage on average is over 95%.

Conclusion

Pek et al. and van Zwet have raised broad concerns about z-curve’s estimates of the expected discovery rate (EDR), which is used to assess publication bias and to quantify the extent of missing nonsignificant results. Their arguments rely heavily on a small set of stylized scenarios. These scenarios do not show that z-curve generally produces untrustworthy results. In contrast, prior large-scale simulation studies, together with the present extension to datasets with only 50 significant results, indicate that z-curve’s EDR confidence intervals achieve near-nominal coverage across a wide range of conditions, including many that are plausible for applied research.

Importantly, the new simulations also validate simple diagnostics that indicate when EDR estimation is likely to be less reliable. In particular, the shape of the significant z-value distribution in the critical range from 1.96 to 2.96—especially whether the density decreases or increases just above the significance threshold—helps identify weak-information regimes in which EDR may be overestimated and confidence-interval coverage may be reduced. Users can therefore compare the near-threshold shape of their observed z-value distribution to the patterns observed in simulation to assess whether EDR should be interpreted cautiously in a given application.

Overall, these results support the conclusion that z-curve provides credible estimates of the expected replication rate (ERR) and, under diagnostically identifiable conditions, the expected discovery rate (EDR), and that these quantities remain useful for evaluating the credibility of literatures in which selection on statistical significance is present.

How to Avoid Pek as Reviewer 2

February 3, 2026Z-Curve, ZcurveAnonymous Reviews, Closed Review, Concerns about z-curve, Pek, Reviewer 2, Reviewer Bias, Z-Curve, zcurveUlrich Schimmack

The pattern is now familiar. I received another anonymous review by Reviewer 2 from a z-curve article that repeated Pek’s concerns about the performance of z-curve. To deal with biased reviewers, journals allow authors to mention potentially biased reviewers. I suggest doing so for Pek. I also suggest sharing a manuscript with me to ensure proper interpretation of results and to make it “reviewer-safe.”

To justify the claim that Pek is biased, researchers can use this rebuttal of Pek’s unscientific claims about z-curve.

Reviewer 2 (either Pek or a Pek parrot)

Reviewer Report:

The manuscript “A review and z-curve analysis of research on the palliative association of system justification” (Manuscript ID 1598066) extends the work of Sotola and Credé (2022), who used Z-curve analysis to evaluate the evidential value of findings related to system justification theory (SJT). The present paper similarly reports estimates of publication bias, questionable research practices (QRPs), and replication rates in the SJT literature using Z-curve. Evaluating how scientific evidence accumulates in the published literature is unquestionably important.

However, there is growing concern about the performance of meta-analytic forensic tools such as p-curve (Simonsohn, Nelson, & Simmons, 2014; see Morey & Davis-Stober, 2025 for a critique) and Z-curve (Brunner & Schimmack, 2020; Bartoš & Schimmack, 2022; see Pek et al., in press for a critique). Independent simulation studies increasingly suggest that these methods may perform poorly under realistic conditions, potentially yielding misleading results.

Justification for a theory or method typically requires subjecting it to a severe test (Mayo, 2019) – that is, assuming the opposite of what one seeks to establish (e.g., a null hypothesis of no effect) and demonstrating that this assumption leads to contradiction. In contrast, the simulation work used to support Z-curve (Brunner & Schimmack, 2020; Bartoš & Schimmack, 2022) relies on affirming belief through confirmation, a well-documented cognitive bias.

Findings from Pek et al. (in press) show that when selection bias is presented in published p-values — the very scenario Z-curve was intended to be applied — estimates of the expected discovery rate (EDR), expected replication rate (ERR), and Sorić’s False Discovery Risk (FDR) are themselves biased.

The magnitude and direction of this bias depend on multiple factors (e.g., number of p-values, selection mechanism of p-values) and cannot be corrected or detected from empirical data alone. The manuscript’s main contribution rests on the assumption that Z-curve yields reasonable estimates of the “reliability of published studies,” operationalized as a high ERR, and that the difference between the observed discovery rate (ODR) and EDR quantifies the extent of QRPs and publication bias.

The paper reports an ERR of .76, 95% CI [.53, .91] and concludes that research on the palliative hypothesis may be more reliable than findings in many other areas of psychology. There are several issues with this claim. First, the assertion that Sotola (2023) validated ERR estimates from the Z-curve reflects confirmation bias – I have not read Röseler (2023) and cannot comment on the argument made in it. The argument rests solely on the descriptive similarly between the ERR produced by Z-curve and the replication rate reported by the Open Science Collaboration (2015). However, no formal test of equivalence was conducted, and no consideration was given to estimate imprecision, potential bias in the estimates, or the conditions under which such agreement might occur by chance.

At minimum, if Z-curve estimates are treated as predicted values, some form of cross-validation or prediction interval should be used to quantify prediction uncertainty. More broadly, because ERR estimates produced by Z-curve are themselves likely biased (as shown in Pek et al., in press), and because the magnitude and direction of this bias are unknown, comparisons about ERR values across literatures do not provide a strong evidential basis for claims about the relative reliability of research areas.

Furthermore, the width of the 95% CI spans roughly half of the bounded parameter space of [0, 1], indicating substantial imprecision. Any claims based on these estimates should thus be contextualized with appropriate caution.

Another key result concerns the comparison of EDR = .52, 95% CO [.14, .92], and ODR = .81, 95% CI = [.69, .90]. The manuscript states that “When these two estimates are highly discrepant, this is consistent with the presence of questionable research practices (QRPS) and publication bias in this area of research (Brunner & Schimmack, 2020).

But in this case, the 95% CIs for the EDR and ODR in this work overlapped quite a bit, meaning that they may not be significantly different…” (p. 22). There are several issues with such a claim. First, Z curve results cannot directly support claims about the presence of QRPs.

The EDR reflects the proportion of significant p values expected under no selection bias, but it does not identify the source of selection bias (e.g., QRPs, fraud, editorial decisions). Using Z curve requires accepting its assumed missing data mechanism—a strong assumption that cannot be empirically validated.

Second, a descriptive comparison between two estimates cannot be interpreted as a formal test of difference (e.g., eyeballing two estimates of means as different does not tell us whether this difference is not driven by sampling variability). Means can be significantly different even if their confidence intervals overlap (Cumming & Finch, 2005).

A formal test of the difference is required. Third, EDR estimates can be biased. Even under ideal conditions, convergence to the population values requires extremely large numbers of studies (e.g., > 3000, see Figure 1 of Pek et al., in press).

The current study only has 64 tests. Thus, even if a formal test of the difference of ODR – EDR was conducted, little confidence could be placed on the result if the EDR estimate is biased and does not reflect the true population value.

Although I am critical of the outputs of Z curve analysis due to its poor statistical performance under realistic conditions, the manuscript has several strengths. These include adherence to good meta analytic practices such as providing a PRISMA flow chart, clearly stating inclusion and exclusion criteria, and verifying the calculation of p values. These aspects could be further strengthened by reporting test–retest reliability (given that a single author coded all studies) and by explicitly defining the population of selected p values. Because there appears to be heterogeneity in the results, a random effects meta analysis may be appropriate, and study level variables (e.g., type of hypothesis or analysis) could be used to explain between study variability. Additionally, the independence of p values has not been clearly addressed; p values may be correlated within articles or across studies. Minor points: The “reliability” of studies should be explicitly defined. The work by Manapat et al. (2022) should be cited in relation to Nagy et al. (2025). The findings of Simmons et al. (2011) applies only to single studies.

However, most research is published in multi-study sets, and follow-up simulations by Wegener at al. (2024) indicate that the Type I error rate is well-controlled when methodological constraints (e.g., same test, same design, same measures) are applied consistently across multiple studies – thus, the concerns of Simmons et al. (2011) pertain to a very small number of published results.

I could not find the reference to Schimmack and Brunner (2023) cited on p. 17.

Rebuttal to Core Claims in Recent Critiques of z-Curve

1. Claim: z-curve “performs poorly under realistic conditions”

Rebuttal

The claim that z-curve “performs poorly under realistic conditions” is not supported by the full body of available evidence. While recent critiques demonstrate that z-curve estimates—particularly EDR—can be biased under specific data-generating and selection mechanisms, these findings do not justify a general conclusion of poor performance.

Z-curve has been evaluated in extensive simulation studies that examined a wide range of empirically plausible scenarios, including heterogeneous power distributions, mixtures of low- and high-powered studies, varying false-positive rates, different degrees of selection for significance, and multiple shapes of observed z-value distributions (e.g., unimodal, right-skewed, and multimodal distributions). These simulations explicitly included sample sizes as low as k ≈ 100, which is typical for applied meta-research in psychology.

Across these conditions, z-curve demonstrated reasonable statistical properties conditional on its assumptions, including interpretable ERR and EDR estimates and confidence intervals with acceptable coverage in most realistic regimes. Importantly, these studies also identified conditions under which estimation becomes less informative—such as when the observed z-value distribution provides little information about missing nonsignificant results—thereby documenting diagnosable scope limits rather than undifferentiated poor performance.

Recent critiques rely primarily on selective adversarial scenarios and extrapolate from these to broad claims about “realistic conditions,” while not engaging with the earlier simulation literature that systematically evaluated z-curve across a much broader parameter space. A balanced scientific assessment therefore supports a more limited conclusion: z-curve has identifiable limitations and scope conditions, but existing simulation evidence does not support the claim that it generally performs poorly under realistic conditions.

2. Claim: Bias in EDR or ERR renders these estimates uninterpretable or misleading

Rebuttal

The critique conflates the possibility of bias with a lack of inferential value. All methods used to evaluate published literatures under selection—including effect-size meta-analysis, selection models, and Bayesian hierarchical approaches—are biased under some violations of their assumptions. The existence of bias therefore does not imply that an estimator is uninformative.

Z-curve explicitly reports uncertainty through bootstrap confidence intervals, which quantify sampling variability and model uncertainty given the observed data. No evidence is presented that z-curve confidence intervals systematically fail to achieve nominal coverage under conditions relevant to applied analyses. The appropriate conclusion is that z-curve estimates must be interpreted conditionally and cautiously, not that they lack statistical meaning.

3. Claim: Reliable EDR estimation requires “extremely large” numbers of studies (e.g., >3000)

Rebuttal

This claim overgeneralizes results from specific, highly constrained simulation scenarios. The cited sample sizes correspond to conditions in which the observed data provide little identifying information, not to a general requirement for statistical validity.

In applied statistics, consistency in the limit does not imply that estimates at smaller sample sizes are meaningless; it implies that uncertainty must be acknowledged. In the present application, this uncertainty is explicitly reflected in wide confidence intervals. Small sample sizes therefore affect precision, not validity, and do not justify dismissing the estimates outright.

4. Claim: Differences between ODR and EDR cannot support inferences about selection or questionable research practices

Rebuttal

It is correct that differences between ODR and EDR do not identify the source of selection (e.g., QRPs, editorial decisions, or other mechanisms). However, the critique goes further by implying that such differences lack diagnostic value altogether.

Under the z-curve framework, ODR–EDR discrepancies are interpreted as evidence of selection, not of specific researcher behaviors. This inference is explicitly conditional and does not rely on attributing intent or mechanism. Rejecting this interpretation would require demonstrating that ODR–EDR differences are uninformative even under monotonic selection on statistical significance, which has not been shown.

5. Claim: ERR comparisons across literatures lack evidential basis because bias direction is unknown

Rebuttal

The critique asserts that because ERR estimates may be biased with unknown direction, comparisons across literatures lack evidential value. This conclusion does not follow.

Bias does not eliminate comparative information unless it is shown to be large, variable, and systematically distorting rankings across plausible conditions. No evidence is provided that ERR estimates reverse ordering across literatures or are less informative than alternative metrics. While comparative claims should be interpreted cautiously, caution does not imply the absence of evidential content.

6. Claim: z-curve validation relies on “affirming belief through confirmation”

Rebuttal

This characterization misrepresents the role of simulation studies in statistical methodology. Simulation-based evaluation of estimators under known data-generating processes is the standard approach for assessing bias, variance, and coverage across frequentist and Bayesian methods alike.

Characterizing simulation-based validation as epistemically deficient would apply equally to conventional meta-analysis, selection models, and hierarchical Bayesian approaches. No alternative validation framework is proposed that would avoid reliance on model-based simulation.

7. Implicit claim: Effect-size meta-analysis provides a firmer basis for credibility assessment

Rebuttal

Effect-size meta-analysis addresses a different inferential target. It presupposes that studies estimate commensurable effects of a common hypothesis. In heterogeneous literatures, pooled effect sizes represent averages over substantively distinct estimands and may lack clear interpretation.

Moreover, effect-size meta-analysis does not estimate discovery rates, replication probabilities, or false-positive risk, nor does it model selection unless explicitly extended. No evidence is provided that effect-size meta-analysis offers superior performance for evaluating evidential credibility under selective reporting.

Summary

The critiques correctly identify that z-curve is a model-based method with assumptions and scope conditions. However, they systematically extend these points beyond what the evidence supports by:

extrapolating from selective adversarial simulations,
conflating potential bias with lack of inferential value,
overgeneralizing small-sample limitations,
and applying asymmetrical standards relative to conventional methods.

A scientifically justified conclusion is that z-curve provides conditionally informative estimates with quantifiable uncertainty, not that it lacks statistical validity or evidential relevance.

Reply to Erik van Zwet: Z-Curve Only Works on Earth

January 27, 2026Z-Curve, ZcurveAndrew Gelman, Causal Inference, Concerns, Concerns about the z-curve method, Erik van Zwet, Statistical Modeling, Z-cruveUlrich Schimmack

In the 17th century, early telescopic observations of Mars suggested that the planet might be populated. Now imagine a study that aims to examine whether Martians are taller than humans. The problem is obvious: although we may assume that Martians exist, we cannot observe or measure them, and therefore we end up with zero observations of Martian height. Would we blame the t-test for not telling us what we want to know? I hope your answer to this rhetorical question is “No, of course not.”

If you pass this sanity check, the rest of this post should be easy to follow. It responds to criticism by Erik van Zwet (EvZ), hosted and endorsed by Andrew Gelman on his blog,

“Concerns about the z-curve method.”

EvZ imagines a scenario in which z-curve is applied to data generated by two distinct lines of research. One lab conducts studies that test only true null hypotheses. While exact effect sizes of zero may be rare in practice, attempting to detect extremely small effects in small samples is, for all practical purposes, equivalent. A well-known example comes from early molecular genetic research that attempted to link variation in single genes—such as the serotonin transporter gene—to complex phenotypes like Neuroticism. It is now well established that these candidate-gene studies produced primarily false positive results when evaluated with the conventional significance threshold of α = .05.

In response, molecular genetics fundamentally changed its approach. Researchers began testing many genetic variants simultaneously and adopted much more stringent significance thresholds to control the multiple-comparison problem. In the simplified example used here, I assume α = .001, implying an expected false positive rate of only 1 in 1,000 tests. I further assume that truly associated genetic predictors—single nucleotide polymorphisms (SNPs)—are tested in very large samples, such that sampling error is small and true effects yield z-values around 6. This is, of course, a stylized assumption, but it serves to illustrate the logic of the critique.

Figure 1 illustrates a situation with 1,000 studies from each of these two research traditions. Among the 1,000 candidate-gene studies, only one significant result is expected by chance. Among the genome-wide association studies (GWAS), power to reject the null hypothesis at α = .001 is close to 1, although a small number (3–4 out of 1,000) of studies may still fail to reach significance.

At this point, it is essential to distinguish between two scenarios. In the first scenario, all 999 non-significant results are observed and available for analysis. If we could recover the full distribution of results—including non-significant ones—we could fit models to the complete set of z-values. Z-curve can, in principle, be applied to such data, but it was not designed for this purpose.

Z-curve was developed for the second scenario. In this scenario, the light-purple, non-significant results exist only in researchers’ file drawers and are not part of the observed record. This situation—selection for statistical significance—is commonly referred to as publication bias. In psychology, success rates above 90% strongly suggest that statistical significance is a necessary condition for publication (Sterling, 1959). Under such selection, non-significant results provide no observable information, and only significant results remain. In extreme cases, it is theoretically possible that all published significant findings are false positives (Rosenthal, 1979), and in some literatures—such as candidate-gene research or social priming—this possibility is not merely theoretical.

Z-curve addresses uncertainty about the credibility of published significant results by explicitly conditioning on selection for significance and modeling only those results. When success rates approach 90% or higher, there is often no alternative: non-significant results are simply unavailable.

In Figure 1, the light-purple bars represent non-significant results that exist only in file drawers. Z-curve is fitted exclusively to the dark-purple, significant results. Based on these data, the fitted model (red curve), which is centered near the true value of z = 6, correctly infers that the average true power of the studies contributing to the significant results is approximately 99% when α = .001 (corresponding to a critical value of z ≈ 3.3).

Z-curve also estimates the Expected Discovery Rate (EDR). Importantly, the EDR refers to the average power of all studies that were conducted in the process of producing the observed significant results. This conditioning is crucial. Z-curve does not attempt to estimate the total number of studies ever conducted, nor does it attempt to account for studies from populations that could not have produced the observed significant findings. In this example, candidate-gene studies that produced non-significant results—whether published or not—are irrelevant because they did not contribute to the set of significant GWAS results under analysis.

What matters instead is how many GWAS studies failed to reach significance and therefore remain unobserved. Given the assumed power, this number is at most 3–4 out of 1,000 (<1%). Consequently, an EDR estimate of 99% is correct and indicates that publication bias within the relevant population of studies is trivial. Because the false discovery rate is derived from the EDR, the implied false positive risk is effectively zero—again, correctly so for this population.

EvZ’s criticism of z-curve is therefore based on a misunderstanding of the method’s purpose and estimand. He evaluates z-curve against a target that includes large numbers of studies that leave no trace in the observed record and have no influence on the distribution of significant results being analyzed. But no method that conditions on observed significant results can recover information about such studies—nor should it be expected to.

Z-curve is concerned exclusively with the credibility of published significant results. Non-significant studies that originate from populations that do not contribute to those results are as irrelevant to this task as the height of Martians.

On the Interpretation of Z-Curve Coverage in An Extreme Simulation Scenario

January 24, 2026UncategorizedUlrich Schimmack

Abstract

A recent critique of z-curve reported low coverage of confidence intervals for the expected discovery rate (EDR) based on an extreme simulation with a very low expected false positive rate (about 1–2%). This conclusion conflates expected values with realized data. In repeated runs, the number of false positives among significant results varies substantially and is often zero; in those runs the realized false discovery rate is exactly zero, so an estimate of zero is correct. When coverage is evaluated against realized false positive rates, the apparent problem is substantially reduced. Additional simulations show that coverage approaches the nominal level once false positives are non-negligible (e.g., 5%) and improves further with larger numbers of significant results. Remaining coverage failures are confined to diagnostically identifiable cases in which high-powered studies dominate the distribution of significant z-values, leaving limited information to estimate the EDR.

On Evaluating Evidence and Interpreting Simulation Results

Science advances through skepticism. It progresses by testing claims against evidence and by revisiting conclusions when new information becomes available. This process requires not only sound data, but also careful interpretation of what those data can and cannot tell us.

In principle, academic debate should resolve disagreements by subjecting competing interpretations to scrutiny. In practice, however, disagreements often persist. One reason is that people—scientists included—tend to focus on evidence that aligns with their expectations while giving less weight to evidence that challenges them. Another is that conclusions are sometimes used, implicitly or explicitly, to justify the premises that led to them, rather than the other way around.

These concerns are not personal; they are structural. They arise whenever complex methods are evaluated under simplified criteria.

Context of the Current Discussion

Z-curve was developed to evaluate the credibility of a set of statistically significant results. It operates on the distribution of significant test statistics and estimates quantities such as the expected replication rate (ERR), the expected discovery rate (EDR), and the false discovery rate (FDR). Its performance has been evaluated using extensive simulation studies covering hundreds of conditions that varied effect sizes, heterogeneity, and false positive rates.

A recent critique raised concerns about z-curve based on a simulation in which confidence intervals for the EDR showed low coverage. From this result, it was suggested that the method is unreliable (“concerns about z-curve“).

It is useful to examine carefully what this simulation does and how its results are interpreted.

Expected Values and Realized Data

The simulation assumes two types of studies: some that test true null hypotheses and others that test false null hypotheses with very high power. From this setup, one can compute expected values—for example, the expected number of false positives or the expected discovery rate.

Expected values, however, are averages over many hypothetical repetitions. In individual simulation runs, the realized number of false positives varies. In particular, when the expected number of false positives is close to one, it is common for some runs to contain no false positives among the significant results. In those runs, the observed significant record contains no false discoveries, and the realized false discovery rate for that record is exactly zero.

Evaluating coverage by comparing z-curve estimates to a fixed expected value in every run overlooks this variability. It treats a population-level expectation as if it were the true value for each realized dataset, even when the realized data are inconsistent with that expectation. This issue is most pronounced in near-boundary settings, where the quantities of interest are weakly identifiable from truncated data.

The simulation uses an extreme configuration to illustrate a potential limitation of z-curve. The setup assumes two populations of studies: one repeatedly tests a true null hypothesis (H0), and the other tests a false null hypothesis with very high power (approximately 98%, corresponding to z ≈ 4). Z-curve is applied only to statistically significant results, consistent with its intended use.

In the specific configuration, there are 25 tests of a true H0 and 75 tests of a false H0 with 98% power. From this design, one can compute expected values: on average, 25 × .05 = 1.25 false positives are expected, implying a false discovery rate of about 1.6% among significant results. However, these values are expectations across repeated samples; they are not fixed quantities that hold in every simulation run.

Because the expected number of false positives is close to one, sampling variability is substantial. In some runs, no false positive enters the set of significant results at all. In those runs, it is not an error if z-curve assigns zero weight to the null component and estimates an FDR of zero; that estimate matches the realized composition of the observed significant results.

When I reproduced the simulation and counted the number of false positives among the significant results, I found that the realized count ranged from 0 to 5, and that 152 out of 500 runs contained no false positives. This matters for interpreting coverage: comparing z-curve estimates in these runs to the expected false discovery rate of 1.6% treats a population-level expectation as if it were the true value for each realized dataset. As a result, the reported undercoverage is driven by a mismatch between the evaluation target and the realized data in a substantial subset of runs, rather than by a general failure of z-curve.

Reexamining Z-curve Performance with Extreme Mixtures

To examine z-curve’s performance with extreme mixtures of true and false H0, I ran a new simulation that sampled 5 significant results from tests of true H0 and 95 significant results from tests of false H0 with 98% power. I used a false positive rate of 5%, because a 5% false positive rate may be considered the boundary value for an acceptable error rate. Importantly, increasing it further would benefit z-curve because it becomes easier to detect the presence of low powered hypothesis tests.

As expected, the coverage of the EDR increased. In fact, it was just shy of the nominal level of 95%, 471/500 (94%). Thus, low coverage is limited to data with fewer than 5% false positive results. For example, the model may suggest no false positives, but the true false positive rate is 4%.

It is also possible to diagnose data that can create problems with coverage. First, a decreasing slope from significance to z = 3 implies a large number of missing non-significant results that can be identified by their influence on the distribution of significant z-values. In contrast a flat or positive slope suggests that high powered studies have a stronger influence on the distribution of z-values between 2 and 3. I computed the slope using the kernel density of the observed data and regressing the densities on the z-values. A positive slope perfectly predicted bad coverage, 29/29 (100%).

Another diagnostic is the ERR. A high ERR implies that most studies have high power and that there are few low powered studies with significant results to estimate the EDR. All failures occurred when the ERR was above 90%.

Finally, we can use the weights of the low powered components (z = 0, z = 1). When these weights are zero, it is possible that the model had problems estimating these components. In all failures, both weights were zero.

Importantly, these results also show that z-curve does not inevitably fail under this type of mixture. The issue is not the false positive rate per se, but the amount of information available to estimate it. With the same false positive rate of 5%, but a larger number of significant results—for example, 50 false positives out of 1,000—z-curve reliably detects the presence of missing non-significant results, even when the slope is increasing and the ERR is high. In this case, the weight of the z = 0 component was estimated at approximately 52%. By contrast, when the estimated weight is zero and the FDR estimate is zero, the true false discovery rate may still be as high as 5%, reflecting weak identifiability rather than estimator bias.

Conclusion

The low coverage reported in this simulation is largely an evaluation artifact. In this extreme setup, the expected false positive rate (about 1–2%) is an average across runs, but the realized number of false positives among significant results varies; in many runs it is zero. In those runs, the realized FDR is exactly zero, so an estimate of zero is not an error. Treating the expected rate as the “true value” in every run mechanically produces undercoverage.

When the false discovery rate is modest (e.g., 5%) and the number of significant results is larger, coverage is close to nominal and improves further as information increases. The remaining failures are confined to diagnostically identifiable cases in which high-powered studies dominate the significant z-values, leaving too little information to estimate the EDR.

P-Hacking Preregistered Studies Can Be Detected

January 18, 2026QRP, Questionable Research PracticesBias, P-Hacking, Power, Preregistration, Questionable Preregistration, Questionable Research Practices, Test of Insufficient Variance, TIVAUlrich Schimmack

One major contribution to the growing awareness that psychological research is often unreliable was an article by Daryl Bem (2011), which reported nine barely statistically significant results to support the existence of extrasensory perception—most memorably, that extraverts could predict the future location of erotic images (“pornception”).

Subsequent replication attempts quickly failed to reproduce these findings (Galak et al., 2012). This outcome was not especially newsworthy; few researchers believed the substantive claim. The more consequential question was how seemingly strong statistical evidence could be produced for a false conclusion.

Under the conventional criterion of $p < .05$ p<.05, one false positive is expected by chance roughly 1 out of 20 times. However, obtaining statistically significant results in nine out of nine studies purely by chance is extraordinarily unlikely (Schimmack, 2012). This pattern strongly suggests that the data-generating process was biased toward significance.

Schimmack (2018) argued that the observed bias in Bem’s (2011) findings was best explained by questionable research practices (John et al., 2012). For example, unpromising studies may be abandoned and later characterized as pilot work, whereas more favorable results may be selectively aggregated or emphasized, increasing the likelihood of statistically significant outcomes. Following the publication of the replication failures, a retraction was requested. In response, the then editor, Shinobu Kitayama, declined to pursue retraction, citing that the practices in question were widespread in social psychology at the time and were not treated as clear violations of prevailing norms (Kitayama, 2018).

After more than a decade of methodological debate and reform, ignorance is no longer a credible defense for the continued use of questionable research practices. This is especially true when articles invoke open science practices—such as preregistration, transparent reporting, and data sharing—to signal credibility: these practices raise the expected standard of methodological competence and disclosure, not merely the appearance of rigor.

Nevertheless, there are growing concerns that preregistration alone is not sufficient to ensure valid inference. Preregistered studies can still yield misleading conclusions if auxiliary assumptions are incorrect, analytic choices are poorly justified, or deviations and contingencies are not transparently handled (Soto & Schimmack, 2025).

Against this backdrop, Francis (2024) published a statistical critique of Ongchoco, Walter-Terrill, and Scholl’s (2023) PNAS article reporting seven preregistered experiments on visual event boundaries and anchoring. Using a Test of Excess Significance (“excess success”) argument, Francis concluded that the uniformly significant pattern—particularly the repeated significant interaction effects—was unlikely under a no-bias, correctly specified model, reporting $p = .011$ p=.011. This result does not establish the use of questionable research practices; it shows only that the observed pattern of results is improbable under the stated assumptions, though chance cannot be ruled out.

Ongchoco, Walter-Terrill, and Scholl (2024) responded by challenging both the general validity of excess-success tests and their application to a single article. In support, they cite methodological critiques—especially Simonsohn (2012, 2013)—arguing that post hoc excess-success tests can generate false alarms when applied opportunistically or when studies address heterogeneous hypotheses.

They further emphasize preregistration, complete reporting of preregistered studies, and a preregistered replication with increased sample size as reasons their results should be considered credible—thereby raising the question of whether the significant findings themselves show evidential value, independent of procedural safeguards.

The appeal to Simonsohn is particularly relevant here because Simonsohn, Nelson, and Simmons (2014) introduced p-curve as a tool for assessing whether a set of statistically significant findings contains evidential value even in the presence of selective reporting or p-hacking. P-curve examines the distribution of reported significant p-values (typically those below .05). If the underlying effect is null and significance arises only through selection, the distribution is expected to be approximately uniform across the .00–.05 range. If a real effect is present and studies have nontrivial power, the distribution should be right-skewed, with a greater concentration of very small p-values (e.g., < .01).

I therefore conducted a p-curve analysis to assess the evidential value of the statistically significant results reported in this research program. Following Simonsohn et al. (2014), I focused on the focal interaction tests bearing directly on the core claim that crossing a visual event boundary (e.g., walking through a virtual doorway) attenuates anchoring effects. Specifically, I extracted the reported p-values for the anchoring-by-boundary interaction terms across the preregistered experiments in Ongchoco, Walter-Terrill, and Scholl (2023) and evaluated whether their distribution showed the right-skew expected under genuine evidential value.

The p-curve analysis provides no evidence of evidential value for the focal interaction effects. Although all seven tests reached nominal statistical significance, the distribution of significant p-values does not show the right-skew expected when results are driven by a genuine effect. Formal tests for right-skewness were non-significant (full p-curve: $p = .212$ p=.212; half p-curve: $p = .431$ p=.431), indicating that the results cannot be distinguished from patterns expected under selective success or related model violations.

Consistent with this pattern, the p-curve-based estimate of average power is low (13%). Although the confidence interval is wide (5%–57%), the right-skew tests already imply failure to reject the null hypothesis of no evidential value. Moreover, even under the most generous interpretation—assuming 57% power for each test—the probability of obtaining seven statistically significant results out of seven is approximately $0.57^7 \approx .020$ 0.577≈.020. Thus, invoking Simonsohn’s critiques of excess-success testing is not sufficient, on its own, to restore confidence in the evidential value of the reported interaction effects.

Some criticisms of Francis’s single-article bias tests also require careful handling. A common concern is selective targeting: if a critic applies a bias test to many papers but publishes commentaries only when the test yields a small p-value, the published set of critiques will overrepresent “positive” alarms. Importantly, this publication strategy does not invalidate any particular p-value; it affects what can be inferred about the prevalence of bias findings from the published subset.

Francis (2014) applied an excess-success test to multi-study articles in Psychological Science (2009–2012) and reported that a large proportion exhibited patterns consistent with excess success (often summarized as roughly 82% of eligible multi-study articles). Under a high-prevalence view—i.e., if such model violations are common—an individual statistically significant bias-test result is less likely to be a false alarm than under a low-prevalence view. The appropriate prevalence for preregistered studies, however, remains uncertain.

Additional diagnostics help address this uncertainty. The “lucky-bounce” test (Schimmack, unpublished) illustrates the improbability of observing only marginally significant results when studies are reasonably powered. Under a conservative assumption of 80% power, the probability that all seven interaction effects fall in the “just significant” range (.005–.05) is approximately .00022. Although this heuristic test is not peer-reviewed, it highlights the same improbability identified by other methods.

A closely related, peer-reviewed approach is the Test of Insufficient Variance (TIVA). TIVA does not rely on significance thresholds; instead, it tests whether a set of independent test statistics (expressed as $z$ z-values) exhibits at least the variance expected under a standard-normal model ( $\mathrm{Var}(z) \ge 1$ Var(z)≥1). Conceptually, it is a left-tailed chi-square test on the variance of $z$ z-scores. Because heterogeneity in power or true effects typically increases variance, evidence of insufficient variance is conservative. With the large sample sizes in these studies, transforming $F$ F-values to $t$ t- and approximate $z$ z-values is reasonable. Applying TIVA to the seven interaction tests yields $p = .002$ p=.002, indicating that the dispersion of the test statistics is unusually small under the assumption of independent tests.

These results do not establish that the seven statistically significant findings are all false positives, nor do they identify a specific mechanism. They do show, however, that perfect significance can coexist with weak evidential value: even in preregistered research, a uniformly significant pattern can be statistically inconsistent with the assumptions required for straightforward credibility.

Given these results, an independent, well-powered replication is warranted. The true power of the reported studies is unlikely to approach 80% even with sample sizes of 800 participants; if it did, at least one p-value would be expected below .005. Absent such evidence, perfect success should not be taken as evidence that a robust effect has been established.

In conclusion, the replication crisis has sharpened awareness that researchers face strong incentives to publish and that journals—especially prestigious outlets such as PNAS—prefer clean, internally consistent narratives. Open science practices have improved transparency, but it remains unclear whether they are sufficient to prevent the kinds of model violations that undermined credibility before the crisis. Fortunately, methodological reform has also produced more informative tools for evaluating evidential value.

For researchers seeking credible results, the practical implication is straightforward: avoid building evidential claims on many marginally powered studies. Rather than running seven underpowered experiments in the hope of success, conduct one adequately powered study—and, if necessary, a similarly powered preregistered replication (Schimmack, 2012). Multi-study packages are not inherently problematic, but when “picture-perfect” significance becomes the implicit standard, they increase the risk of selective success and overinterpretation. Greater awareness that such patterns can be detected statistically may help authors, reviewers, and editors better weigh these trade-offs.