Z-Curve.3.0 Tutorial: Chapter 4

July 27, 2025UncategorizedUlrich Schimmack

You can find links to the other chapters on the post with Chapter 1.

Chapter 4 examines simulation studies to evaluate z-curve.3.0. zcurve3.0/Tutorial.R.Script.Chapter4.R at main · UlrichSchimmack/zcurve3.0

Introduction to Chapter 4

I assume that you are familiar with z-curve, but you may wonder how well it works. Nowadays it is easy to get information using a search of the internet. AI makes it even easier, but it also uses a lot of resources. Here is a summary of the current collective knowledge about z-curve’s performance.

“Z‑curve has been validated through numerous simulation studies—often more extensive than what’s published—and shows high accuracy and better bias control than p‑curve when studies differ in power or effect size. Confidence intervals generally maintain coverage. In real-world replication datasets, z‑curve predictions align closely with observed replication rates—better than p‑curve. Its performance is weaker only in extreme conditions with small or very homogeneous study sets.”

The problem is that this knowledge is based on simulation studies conducted by us, the developers of z-curve (Bartos, Brunner, & Schimmack). Maybe we hacked our simulations to show good performance and hide cases with bad performance. Yes, simulation hacking is a think. The question is: did we hack our simulations?

Science is not built on trust. it is built on trustworthy evidence from independent replications. This is difficult for empirical studies, but it is easy to do for simulation studies. If you run the same code with a seed, you are merely reproducing the published results because you generate the same random data with the same sampling bias. However, if you use a different seed (or no seed), you are replicating the results with a new roll of the dice. With large samples, the difference is negligible because sampling error is small. This tutorial allows you to reproduce or replicate simulations of z-curve to see for yourself how well or poorly it performs.

The Simulation

This tutorial uses the most basic simulation of z-value distributions with normal distributions with a standard deviation of 1. This assumes that tests are z-tests and that the tests fulfil the assumption of normally distributed sampling error. This is the ideal setting for z-curve. If tests’ assumptions are violated or other tests are converted into z-values, the performance may suffer. This is a topic for another chapter that focusses on the transformation of test results into z-values.

The simulation assumes that there are three sets of z-tests. Some test a true nil hypothesis. Tis is the standard normal distribution centered at zero. With alpha = .05, these studies have an unconditional power of 5%. That is, they have a 5% probability to produce a (false) significant result. The second set has moderate power. A z-value of 2 is practically right at the criterion for significance with a two-sided z-test, z = 1.96. This means that sampling error will produce significant results and non-significant results about equally. In other words, the test has about 50% power. Finally, z-values of 4 have very high power. The reason is that a normal distribution centered at 4 has only 2.5% values below 2, which is the significance criterion. So, power is about 97.5%.

To simulate different z-curves, the three distributions are combined with different weights ranging from 0 to 1 in steps of .1. You can change that to larger or smaller steps. This produces 66 unique combinations of the three components.

Figure 1 shows the folded density distributions of the three components and the density of the mixture distribution with weights 0.6 for the z0 component, 0.1, for the z2 component, and 0.3 for the z4 component. You can create your own mixtures to see the different shapes of z-curves. Broadly speaking z-curves that decrease have a strong weight on the z0 component and low average power and z-curves that increase have a high weight on the z4 component and high average power.

The simulation program samples from the three distributions. To examine systematic biases, the simulation created 50,000 z-values. You can change this, but then the performance is a function of systematic bias and sampling error. The performance in smaller and more realistic sample sizes will be examined in the evaluation of the confidence intervals provided by z-curve.

Z-curve uses two methods to fit the model to the data. The first method developed by Brunner and Schimmack first fits a Kernel Density model to the z-curve of the significant results (z > 2). Then the model is fitted to the Kernel Density distribution. The key problem with this model is that Kernel Density has a downward bias when values are truncated at 2. To avoid this bias, the curve is extended while taking the slope around 2 into account. The technical details are not so important. This method is fast and works well for the estimation of the Expected Replication Rate. However, the expected discovery rate projects the distribution for significant results into the region of non-significant results. Here even small biases around z = 2 can have large effects on the estimates.

The second method was developed by Bartos and Schimmack for zcurve.2.0. It uses the Expectancy Maximization method that directly fits all of the data points to the model. This method tends to work better for the EDR in some cases. The main drawback is that it can be slow when a large number of z-values are fitted to the model. For this reason, I prefer to use the old-fashioned, density method to explore the data and use the EM algorithm for the final results. The simulation here compares the results for both methods.

That is the same data are analyzed with both methods. To spare you time, I also provide the saved data from this simulation study.

Results

Expected Replication Rate

Each simulation keeps non-significant and significant results, but the model is only fitted to the significant results. Keeping the non-significant results helps to evaluate the ability of z-curve to predict the distribution of the non-significant results. This is illustrated in Figure 2.

Simulation #52 has the weights 0.6 for z0, 0.0 for z2, and 0.4 for z4. The true EDR is 43% and the true ERR is 91%. The observed discovery rate is the percentage of significant results in 50,000 tests. It is close to the expected value of 43%. Visual inspection shows that z-curve guessed the distribution of non-significant results rather well with a slight tend to underestimate non-significant results. This leads to a slight overestimation of the EDR by 1 percentage point. The model also correctly estimates the true ERR of 91%.

Of course, I am showing you an example that works well. So, don’t trust the picture. Trust, the results of all 66 simulations or your own simulations.

Figure 3 shows the fit for both methods (OF = purple, EM = green). It is hard to tell the two methods apart because they both fit the data very well and are clustered on the diagonal that represents perfect fit.

These results merely confirm that extensive simulation studies by Brunner and Schimmack (2020) showed. ERR estimates are f^&%%$ good. The scientific way to say this is to compute the root mean square error (RMSE) for the difference between true and expected ERR values. The RMSE for the OF method is .007. The value for the EM method is also .007.

The minimum directional error was -2.7 percentage points for OF and -2.3 percentage points for EM. The maximum was 0.4 for OF and 0.3 for EM. These differences have no practical significance in any estimates of the ERR.

It is not clear how they stumbled on a limited number of scenarios that all showed poor performance. Here I showed that z-curve predicts true values of the ERR n 66 simulations that cover the full range of average power. These simulations were not chosen to produce good performance because they represent a complete design. Moreover, you can use the R-code to test other scenarios, as we have done in many other simulations without finding notable biases in ERR estimates.

Expected Discovery Rate

Estimating the discovery rate solely on the basis of the distribution of significant z-values is harder. As a result, the estimates are not as good as the ERR estimates, but how good or bad are they?

Visual inspection of the plot shows biases for medium power, whereas high and low EDR estimates are better. The RMSE values show slightly better performance for the “EM” method, RMSE = .051 than the OF method, RMSE = .063. It is therefore recommended to use the “EM” method for final results.

The biggest underestimation / downward bias was -11 percentage points for OF and -7 percentage points for EM. The biggest overestimation / upward bias was 17% for OF and 16% for EM.

The next figure shows a follow-up simulation for the scenario with the largest overestimation (Simulation #41). To reduce sampling error further, I used k = 200,000 studies and the fast OF method (takes about 5 seconds).

The bias is much smaller showing that z-curve estimates of the EDR also have high large-sample accuracy and that the larger bias in the previous simulation was still influenced by sampling error. Of course, real datasets are smaller and will produce biased estimates, but that will be reflected in wide confidence intervals; a topic for another chapter.

The biggest upward bias was observed in a condition when most values were from the z2 component.

Conclusion

Motivated biases are one of the best documented examples in psychology. Simulation studies help researchers to test their intuitions against reality. I have learned a lot from simulations. However, in public simulations are often used to sell a specific method, or apparently to discredit a specific method by picking scenarios that work well or where a method performs poorly. It is therefore best to see the performance of a method with your own eyes. This tutorial makes it easy to run simulation studies and with a bit of R-coding or help from an AI to modify the simulation to try out other scenarios.

These results are limited to simulations with normal distributions and do not address uncertainty of estimates in smaller samples. Thus, good performance in these simulations is necessary, but not sufficient to trust z-curve results “in the wild.” Necessary means that bad performance in these simulations would have been the end of z-curve. Z-curve only exists because it does well in these simulations. Insufficient means that we need other simulations that are similar to real data to see how it performs in these settings.

At the same time, these results are sufficient to point out that some academics (teachers at universities, but not scientists who follow a code of ethics) make false claims about the performance of z-curve that are falsified by the evidence in published articles and the results in this tutorial. ..

So, this tutorial also taught you the importance of evaluating conflicting claims based on scientific evidence. Never trust a scientist. Trust the science. When you ask an AI, always ask for the supporting evidence and ask it to challenge itself. Often AI is more trustworthy because it does not love or hate z-curve. It doesn’t give a shit about z-curve and that makes it a better judge of z-curve’s performance.

What Gelman Gets Wrong About Psychology

July 24, 2025UncategorizedUlrich Schimmack

This blog post was created in collaboration with ChatGPT. I asked ChatGPT to read Gelman’s blog posts about psychological research and discussed his criticism with ChatGPT. Based on the evidence retrieved by ChatGPT and our discussion, ChatGPT wrote the following summary of our discussion. You could do the same with an IA for my blog posts. The quality of the end product depends on the quality of the questions. I believe that this is an accurate description of Gelman’s blog posts: quick, strong opinions with little attention to empirical evidence.

* I think ChatGPT for its help with this blog post. All mistakes are of course made by ChatGPT. LOL

Why Overgeneralized Skepticism Misses the Evidence in Experimental Research

Andrew Gelman’s blog has become a go-to source for sharp, often scathing critiques of empirical research — especially in psychology. His core concerns are valid and well known: small sample sizes, flexibility in data analysis (“garden of forking paths”), publication bias, and the misuse of statistical significance. Many in the field welcome this type of scrutiny. But Gelman’s commentary often goes beyond fair criticism. Instead of offering a balanced view, he paints much of psychology as inherently unreliable, even garbage.

This post explains why that view is too sweeping, often unsupported by the very kind of data-driven inquiry Gelman himself advocates, and rooted — at least in part — in disciplinary biases shaped by his background in political science.

1. 📎 Cherry-Picked Failures ≠ Systemic Collapse

Gelman frequently cites high-profile failures (e.g., power posing, embodied metaphors, beauty-voting correlations) as emblematic of the entire field. But these examples, while real, are not representative of psychology as a whole. He rarely highlights:

Highly replicable subfields (e.g., perception, psychophysics, cognitive neuroscience),
Areas with strong internal consistency across decades (e.g., visual search, delay discounting),
The many methodological reforms underway (preregistration, Registered Reports, replication projects).

Psychology is a broad discipline, and focusing exclusively on its weakest claims distorts the picture.

2. 🧠 Not All Psychology Is Noisy

Gelman often portrays psychology as relying on noisy, underpowered studies. But many psychological studies — especially those using within-subject designs with repeated measures — have very high reliability and low error variance. In perception research, effects observed with N = 5 can replicate precisely across labs. Psychology spans a continuum from highly stable low-level phenomena to complex social behavior. Collapsing this range into a single judgment of unreliability misses the nuance entirely.

3. 🔁 Psychologists Rarely Rely on Single Studies for Effect Size Claims

A major oversight in Gelman’s critique is the assumption that psychologists treat small, noisy studies as definitive evidence about effect size magnitude. In fact:

Psychologists are well aware that small studies yield unstable estimates.
Claims about effect sizes are rarely made based on individual experiments.
Instead, researchers use meta-analysis to combine results across studies and reduce sampling noise.

And importantly: In well-controlled studies with p < .05, sign errors (wrong direction of the effect) are rare — as shown in projects like the Open Science Collaboration’s reproducibility study. This means that even small studies typically get the direction of an effect right, even if they are imprecise about its size. Psychologists know this. Gelman often implies they don’t.

4. 🔬 He Overlooks Tools that Handle Heterogeneity

Gelman claims that “there is no one true effect size” in many psychological literatures — a point that is actually well-acknowledged by psychologists themselves. Tools like random-effects meta-analysis, Bayesian model averaging, and z-curve explicitly model heterogeneity across studies and estimate how much variation exists in true effect sizes. In fact, analyses correcting for publication bias (e.g., Maier et al., 2022 using RoBMA) often show that some psychological interventions (like nudging) may have modest average effects but substantial variability — meaning that some contexts do work.

Rather than garbage, the data often show a mix of gold and dross.

5. 🧭 Bias Against Generalization May Reflect Disciplinary Training

Gelman’s own background is in political science and survey statistics — fields that deal with singular, unrepeatable events and complex causal inference (e.g., “Why did Obama win in 2008?”). In that world, generalizing across contexts is precarious, and every event is historically unique. Psychology, especially experimental psychology, operates differently. It isolates regularities in human behavior — how long it takes to name a color, how memory decays over time — using tightly controlled designs and repeated trials. These are not historically unique events; they are law-like patterns observable across time, people, and settings.

What seems naïve to a political scientist may simply reflect a different (and valid) epistemology.

6. 🧪 Gelman Sometimes Critiques Without Engaging the Data

Perhaps most concerning is that Gelman occasionally dismisses entire literatures — such as the nudging literature — without directly engaging with the available evidence. In the case of the meta-analysis by Mertens et al. (2022), he calls it garbage, while more nuanced reanalyses (e.g., using RoBMA) show a small average effect but substantial heterogeneity, with an estimated 25% of studies likely showing true effects greater than d = 0.4. These are data-driven findings. Ignoring them in favor of blanket skepticism contradicts the evidence-based reasoning Gelman himself promotes.

🎯 Conclusion: Skepticism ≠ Cynicism

Andrew Gelman has done a great service to science by helping uncover many statistical flaws. But his broad-brush skepticism of psychology often veers into cynicism, undermining valid work and demoralizing reformers within the field. A better path forward is critical realism — recognizing the flaws, but also recognizing the value, nuance, and complexity of a diverse empirical discipline.

Psychology is not garbage. It’s a field in progress — flawed, yes, but full of replicable insights into how minds and behavior work.

When Logic Changes, Should the Record? Requesting the Retraction of Schimack (2012)

July 23, 2025UncategorizedUlrich Schimmack

Cite: Schimmack, U. & ChatGPT (2025). When logic changes, should the record? Requesting the retraction of Schimmack (2012). Replication Index.

In 2012, I published an article in Psychological Methods titled “The Ironic Effect of Significant Results on the Credibility of Multiple-Study Articles” (Schimmack, 2012). The paper introduced a simple but powerful idea: if a set of published studies reports a higher proportion of significant results than would be expected based on the estimated average post-hoc power, this suggests the results may be “too good to be true.” That is, the success rate of these studies may reflect publication bias, p-hacking, or other questionable research practices (QRPs) rather than honest science.

This logic, based on Sterling et al., 1995, has been foundational to several developments in meta-science and replicability research over the past decade. It also aligns with statistical common sense: if the probability of success is low, and we still see nearly universal success, something about the process likely isn’t transparent or unbiased.

But in 2024, Psychological Methods published an article by Pek et al. that challenges this logic at its core. They argue that using observed outcomes to evaluate the expected success rate — and then inferring bias when the observed rate is too high — is not just statistically questionable. They call it an ontological error: a fundamental mistake in the nature of inference itself, because (they claim) we cannot assign probabilities to events that have already occurred.

I disagree with this argument. Like most statisticians and meta-scientists, I believe that statistical inference is inherently about comparing observed outcomes to expectations under a model. That’s what a p-value does. That’s what every goodness-of-fit test does. That’s what replication rates and power estimates are meant to assess. The logic of comparing the observed to the expected is the backbone of empirical science — not a metaphysical error.

However, in the hypothetical world in which Pek et al. are correct, and more importantly, in a world where Psychological Methods treats their position as settled and unchallenged, my 2012 article becomes indefensible on its own terms. If it is indeed an ontological error to compare observed success rates to expected ones, then my article’s entire logic — and its main conclusion — are invalid.

We submitted a commentary in 2025 defending the logic of my 2012 paper and challenging the categorical claim made by Pek et al. That commentary was rejected without invitation to revise, indicating that the editor — and thus the journal — considers the matter settled. No debate, no dialogue. Just closure.

So, I’ve taken the logical next step. If the journal has adopted the position that the logic of my article is invalid at the most fundamental level, then the appropriate action is retraction. Not correction. Not a commentary. Retraction.

Below is the letter I sent to Fred Oswald, Editor of Psychological Methods, on July 23, 2025:

Subject: Formal Request to Retract Published Article
To: Fred Oswald, Editor, Psychological Methods
From: Dr. Ulrich Schimmack
Date: July 23, 2025

Dear Fred Oswald,

I am writing to formally request the retraction of the following article:

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551–566. https://doi.org/10.1037/a0029487

This article argued that the credibility of multiple-study papers can be evaluated by comparing the observed rate of significant results to the rate expected based on average post-hoc power. It concluded that when the observed success rate substantially exceeds the expected rate, the results are “too good to be true,” suggesting publication bias or questionable research practices.

In 2024, Psychological Methods published an article by Pek et al. that characterized this type of inferential reasoning as a fundamental ontological error. Specifically, they argue that statistical inferences based on observed outcomes are invalid because probabilities cannot be assigned to events that have already occurred. According to their reasoning, the method used in the 2012 article is not just flawed but conceptually incoherent.

A commentary was submitted to the journal defending the validity of this approach and challenging the ontological framing advanced by Pek et al. That commentary was rejected without invitation for revision, signaling that the journal considers the matter settled and sides with Pek et al.’s position.

If Psychological Methods now endorses the view that the central logic of the 2012 article is invalid, then the article no longer meets the journal’s standards for methodological soundness. Its continued presence in the scholarly record, without correction or rebuttal, misleads readers into believing that its conclusions are supported by valid statistical reasoning.

Therefore, I request the retraction of the article. This is not a concession of error by the author, but the necessary course of action given the journal’s editorial stance. To hold that the logic is invalid yet leave the article uncorrected would be editorially inconsistent and undermine the integrity of the journal’s standards.

Sincerely,
Ulrich Schimmack
Department of Psychology
University of Toronto Mississauga

If the journal wants to stand by Pek et al.’s claim that comparing observed outcomes to expected frequencies is a category error, then it must have the editorial courage to follow through: retract past articles that used that logic.

If, instead, the journal is unwilling to retract, then it implicitly concedes that the issue is not settled and deserves ongoing debate. In that case, our rejected commentary should be reconsidered for publication.

Anonymous Peer Review Is Biased and Useless: Trust AI

July 22, 2025UncategorizedUlrich Schimmack

Pek et al. continue their fight against z-curve and were able to publish a confusing and misleading article on power in the journal Psychological Methods. We drafted a careful rebuttal of their main criticisms and used ChatGPT to try our arguments. It initially agreed with Pek et al., based on many misleading claims about observed power or post-hoc power in the literature. With good arguments and an example of coin tosses, we were able to clarify our position and ChatGPT agreed that it is possible to ESTIMATE the average TRUE power from a set of observed statistical results (p-values) using a model that corrects for selection bias.

We submitted our criticism to the journal as a commentary and waited three months for a response from the editor. We received three ANONYMOUS reviews and a rejection. The main editor upheld this decision after spending a good hour to make the decision.

The commentary is now published as a preprint and under review at a scientific journal that invites open peer-reviews: https://doi.org/10.31234/osf.io/dvebm_v1

For the sake of open science, I share reviews of my work. I am also pleased to share an evaluation of the reviews by ChatGPT. My ChatGPT is not unbiased because it had many conversations with me. Feel free to ask your AI for an evaluation of these reviews and post the answer in the comments section. My ChatGPT thinks that the reviewers missed the main point of our comment and just repeated false claims about average power, which remains a valuable tool to see how psychologists achieve 90% success rates in their studies (hint, they do not test 100% true hypotheses with 90% power).

ChatGPT

🔍 Rebuttal to Reviewer Comments

Manuscript Title: An Average Power Primer: Clarifying Misconceptions about Average Power and Replicability
Journal: Psychological Methods
Outcome: Rejected
Purpose: A factual assessment of whether the reviewers’ objections meaningfully challenge the central claims of the manuscript.

1. “Average power is not a meaningful or informative parameter”

This is incorrect. Average power is the expected success rate of a set of studies under repeated sampling, assuming the same design. It is a population parameter (or estimand) that characterizes the statistical reliability of a research corpus.

It plays a central role in estimating the maximum false discovery rate (Sorić, 1989).
It predicts replication outcomes under exact replications.
It is empirically diagnostic: Low average power implies high false-positive risk and low replication success—patterns confirmed in projects like the Reproducibility Project: Psychology (Open Science Collaboration, 2015).

Reviewer 1’s analogy (“an average of 50% tells me very little”) mischaracterizes how summary statistics work. Averages, by definition, collapse heterogeneity. That does not make them uninformative. It makes them summaries. No one claims average power captures the full distribution of power. That is the function of methods like z-curve.

2. “The authors commit an ontological error by applying power to completed studies”

No. The manuscript explicitly avoids this error.

It distinguishes between the estimand (true average power) and the estimate (inferred from observed data).
It makes clear that we are not assigning probabilities to fixed outcomes, but estimating population-level properties from completed studies.

Power is a property of a study design, not a specific result. Estimating average power from observed studies is methodologically equivalent to estimating a population mean from a sample. All that is required is the assumption of a data-generating process. That assumption is the foundation of all frequentist inference.

3. “Credibility” and “evidential value” are vague terms”

These terms are well-established in the literature, but we agree they should have been formally defined. In our usage:

Credibility refers to the likelihood that reported significant results reflect true underlying effects.
Evidential value refers to the extent to which a set of results departs from what would be expected under the null hypothesis alone.

Both terms describe interpretable characteristics of research literatures that relate to statistical inference and replication.

4. “Average power ignores heterogeneity”

Yes — and that’s acknowledged. But that’s not a reason to discard it.

All averages ignore heterogeneity. Meta-analytic effect size estimates also summarize across diverse designs and effects. No reviewer would suggest abandoning mean effect size estimates for that reason. The same logic applies to average power.

When heterogeneity is important, it should be modeled — which is exactly what z-curve does by estimating the full distribution of power, not just its average.

5. “Z-curve is an emerging, unvalidated method”

This is inaccurate.

Z-curve is based on finite mixtures of truncated normal distributions and incorporates selection for significance.
It has been validated through extensive simulation studies (e.g., Bartoš & Schimmack, 2022).
It has been applied in high-visibility empirical applications, including analyses of ego depletion, terror management theory, and reproducibility datasets.

If validation is judged only by mathematical proofs, then widely used tools like PET-PEESE, trim-and-fill, and even p-curve should also be excluded. Simulation-based validation is standard for new methods in meta-analysis and remains the most appropriate test of performance under realistic conditions.

6. “Observed z-values are just transformations of p-values, so z-curve is built on ‘problematic inputs’”

This misunderstands the role of observed statistics.

Z-curve models the distribution of observed z-values under selection, not individual p-values or “observed power.”
The fact that p-values, z-values, and observed power are monotonic transformations is correct but irrelevant. Z-curve does not interpret individual z-values — it uses their distributional shape to infer the underlying power distribution.

This is exactly how selection models operate.

7. “The authors redefine power unnecessarily”

We do not redefine power. We clarify standard distinctions between:

True power: based on the true (possibly unknown) effect size
Observed power: a transformation of the observed test statistic
Average power: the mean true power across studies in a set

Clarifying these concepts is essential in a literature where the misuse of “post hoc power” is widespread. Reviewer 1’s reference to Lehmann and Romano supports our definitions, not contradicts them.

8. “Confidence intervals around average power are misleading”

This critique misunderstands how confidence intervals behave for bounded parameters like proportions.

Yes, CI width varies depending on the point estimate (e.g., is narrower near 0 or 1, wider near 0.5). That is a general property of binomial-like estimates — not a flaw. The bootstrap method used by z-curve is a standard approach to estimating uncertainty under selection and mixture models.

9. “The coin analogy is flawed”

The analogy is pedagogical, not literal. It illustrates that while we cannot infer the success probability of a single study from one outcome, we can infer the average success probability across a population of studies — just as one might estimate the average fairness of a population of coins.

This mirrors how z-curve infers average power: not by analyzing individual studies in isolation, but by modeling the ensemble.

10. “This work lacks novelty”

Clarification of foundational concepts is a valid scholarly contribution — especially when misunderstanding remains widespread.

Average power continues to be mischaracterized in the literature.
Authors such as Pek et al. argue it is invalid to estimate power from completed studies — a position we show to be logically and statistically incorrect.
Our manuscript synthesizes definitions, clarifies use cases, and corrects misinterpretations.

This is exactly what a commentary should do.

✅ Conclusion

None of the reviewers’ critiques undermine the central argument:

Average power is a meaningful population parameter.
It can be validly estimated from published studies using methods like z-curve.
These estimates are informative about replication, false discovery risk, and the reliability of research literatures.

We stand by the manuscript’s reasoning and hope it will contribute to better understanding in a different venue.

The Action Letter and Reviews

FW: Your Submission MET-2025-0121 – [EMID:1f61f4216ccfe419]

MET-2025-0121
An Average Power Primer: Clarifying Misconceptions about Average Power and Replicability
Psychological Methods

Dear Ms Soto,

I have now received the reviewers’ comments on your manuscript. Based on their analysis and my own evaluation, I can no longer consider this manuscript for publication in Psychological Methods.

Your manuscript certainly touches on an important topic at the intersection of several current methodological research areas of interest to psychologists: power, replication, evaluating research literatures, and study selection. The interest in clarifying areas of confusion is commendable. Despite this, the three reviewers raised several concerns and all three ultimately recommended rejection of the current work. Their comments are thorough, so I will not reiterate them here. I hope they will be helpful to you as you revise this for another outlet.

If I were to make a suggestion in addition to the guidance offered by the reviewers, I might consider a different structure or format for the manuscript. The current submission seemed to be trying to accomplish several different goals: a primer on concepts related to average power, a directed commentary to specific points raised in papers by Pek and McShane, and a demonstration of z-curve – it might be better to focus more. If your goal is to develop a tutorial, I think that the scope would need to be expanded and the language made more precise to increase accessibility given the established earlier work on the topic. For example, the way different types of power are described differs slightly throughout the paper, which may lead to some confusion (see Reviewer 1’s comment on this). While I like that you used a coin example to teach the concept, I would suggest more carefully clarifying how using average observed power differs from using observed power on a single study (e.g. many coins vs 1 coin). Changing the definition of power upfront to focus on significance regardless of the true effect could also use additional clarification. Finally, replication should also be more explicitly defined, particularly in the context of there being multiple options (a few used by the open science framework study you use throughout the paper). Considering the paper as more of a commentary, I understand from your cover letter that you initially considered AMPPS – given that AMPPS specifically encourages commentaries, I might suggest reconsidering AMPPS, taking into account the critiques offered by the reviewers (e.g., considering what is gained by using only an average without considering variability).

For your guidance, I append the reviewers’ comments below and hope they will be useful to you.

Thank you for giving us the opportunity to consider your submission.

Respectfully,

Samantha F. Anderson, Ph.D.
Associate Editor
Psychological Methods

Reviewers’ comments:

Reviewer #1: This commentary has four aims. In particular, as explicitly discussed on page 3 of the introduction, it seeks to:

[1] Clarify the definition of average power.

[2] Argue that point and interval estimate of average power “can be used to assess the credibility of the original results despite uncertainty.”

[3] Provide a case study illustration [2] in the context of data from the terror management literature.

Less explicitly, the commentary also seeks to:

[4] Serve as an apologia for a nascent forensic meta-analytic procedure dubbed “z-curve.”

I will discuss each of these in turn. Before doing so, I note that for the purposes of this review, I will not question the commentary’s focus on “statistical significance” and the purported notion that the goal of a study or replication thereof is to attain “statistical significance.” I disagree with this perspective but that is a different discussion and so I will for the most part adopt the stance of the commentary with regard to this matter.

[1] Definition of Average Power

The commentary defines average power and estimates thereof in a number of places including:

Abstract: “the hypothetical outcome if original researchers had to replicate their studies with new samples.”

Page 3: “[I]f researchers repeated their original studies exactly, using the same methods and sample sizes, how likely would they be to obtain significant results again in a new sample with new sampling error?

Page 5: “estimate the expected success rate if the same study were repeated under identical conditions with a new random sample.”

Page 6-7: “hypothetical prospective replication studies that by definition are identical to the completed studies…The key question is what results one would expect in a hypothetical replication project where the original authors redo their studies exactly as they were done, but with a new sample.”

Page 9: “to wonder how many significant results terror management researchers would find if they redid their studies the same way with new samples.”

Page 16: “Average power is an estimate of the success rate of hypothetical exact replication studies that only differ in sampling error.”

These definitions are more or less correct with the last being the most precise of all because it invokes the notion of differing only with respect to sampling error. Nonetheless, even this definition is not quite correct because it confuses (as does the commentary in general) estimand and estimate.

Recall the following definitions:

– Estimand: Some quantity one wants to estimate (e.g., the average height of women in Spain).

– Estimator: A procedure applied to data that yields an estimate (e.g., the mean is the procedure that sums the data values and divides by the number of them).

– Estimate: The result of applying an estimator to data (e.g., 1.63 meters).

Therefore, I correct the definition given on page 16 to distinguish between estimand and estimate:

“Average power is the success rate of a set of hypothetical exact replication studies that only differ in sampling error. An estimate of average power is an estimate of this success rate.”
[where “success rate” is taken as meaning percentage of the replication studies in the set that yield “statistical significance”]

Importantly, the commentary also says what average power is not in a number of places including:

Abstract: “not to predict the outcome of future replication studies”

Page 3: “the primary purpose of estimating average power is not to predict outcomes of future replication studies”

Page 5: “The goal is not to assign a probability to the set of realized studies with a fixed outcome”

Page 7: “predicting outcomes of new replication studies is not the primary goal of estimating average power”

These all seem correct although I would omit “primary” from the last quotation.

I can see a place for a very short note that clarifies what precisely average power is and what it is not. This is valuable.

However, such a note is not necessary: this has already been done in the second page of the article by McShane, Bockenholt, and Hansen (2020) cited by the commentary (page 186 of that article using the journal page numbers).

[2] Use of Average Power and Estimates Thereof

I do not find average power useful and for a reason that is made abundantly clear by its definition in terms of differing only with respect to sampling error.

Sampling error is a fiction and we cannot repeat studies in a manner such that they “only differ in sampling error” as the definition on page 16 posits.

We are, as the commentary makes clear, in the realm of the hypothetical.

Or, as McShane, Bockenholt, and Hansen put it on their page 186, “average power is relevant to replicability if and only if replication is defined in terms of statistical significance within the classical frequentist repeated sampling framework.” In doing so, they concur that this framework means replication studies that differ only with respect to sampling error and emphasize that that framework is “purely hypothetical and ontologically impossible” (and that replication success need not be defined in terms of “statistical significance”).

Since everyone including the authors of the commentary seem to agree on this, I find it puzzling that the authors find estimates of average power useful and argue that:

Page 3: “average power estimates and their corresponding confidence intervals can be used to assess the credibility of original results despite uncertainty.”

I mean this may be narrowly true in the “purely hypothetical and ontologically impossible” fashion discussed above, but I am not sure what to do with it.

To illustrate this, let’s suppose that I know that the average power of some original set of studies is 50%. In doing so, let’s set aside the notion of estimation uncertainty and suppose that I am correct: average power really is 50% and I know it to be exactly this value.

How am I to use this to evaluate the credibility of a set of original studies even in a purely hypothetical fashion?

I do not think that average number of 50% tells me very much.

It could be that the average is 50% because all of the original studies had 50% power (let’s also set aside the objection about whether it is meaningful speak of the power of a single study) in which case it would seem that studies in this domain should use larger sample sizes.

It could alternatively be that the average is 50% because half of the original studies were “null” and thus rejected 5% of the time and the other half had 95% power, in which case there is some moderator that one needs to identify to distinguish the two.

It could be many, many other things!

Average power simply cannot distinguish among them.

This is why I really enjoyed the authors discussion of coin flips on pages 5-6:

The authors correctly state that based on a single toss of a coin with probability p of heads (0 < p < 1), we cannot infer anything about p. However, based on a single toss each of many coins each with their own probability p_i, we can infer something about the average of the p_i.

However, averages are often not so useful as my 50% example above demonstrates. It is important to also know something about the variation in (or the distribution of) the p_i.

Unfortunately, with only a single flip of each coin, we cannot infer anything about the variation in (or the distribution of) the p_i. Therefore, we cannot distinguish between possibilities such as all 50%, half 5% and half 95%, and others that yield the same average.

To do so, we would need more granular information: we would need multiple flips per coin which we do not have.

[Of course, in the analogy, the coins are studies and a flip determines “statistical (non)significance” and we never get to “flip” a given study more than a single time.]

As a final comment, in the quotation from page 3 given above, the authors discuss the “credibility of original results” and they also elsewhere discuss the “evidential value of published research” (page 1, page 3, page 7). These are ambiguous and undefined terms with no formal meaning in a technical literature. If the authors would like to continue using these terms, it would be helpful if they provided formal definitions. I am left wondering what precisely these vague concepts represent in the authors’ minds.

[3] Application to Terror Management

Given that the authors and I disagree on [2], it will be unsurprising that I do not find this application very compelling. That follows as a logical consequence.

However, I find average power particularly strange in this application. With 852 studies, it seems especially bizarre to me to focus on an average. There must be variation across these studies and known moderators that associate with or “explain” the variation. It seems strange to ignore this valuable additional information.

Also, I strongly disagree with the following claim on page 12:

“The important empirical conclusion is that the data do not rule out the possibility that the entire literature rests on false positive results. The fact that the confidence interval is very wide does not undermine this conclusion because the burden of proof is on researchers who want to provide evidence for their theory.”

This is rather presumptuous: Who are the authors to arrogate to themselves the right to choose on whom the burden of proof lies? Such haughtiness and superciliousness has no place in research, and each person can decide for himself or herself on whom to place the burden of proof.

Indeed, Neyman was quite clear when elaborating his decision theory that there was subjectivity involved in choosing which of two hypotheses would be the tested one and which would be the alternative one. He emphasized that two different people could have different perspectives and one person might reasonably choose one of the two hypotheses to be the tested hypothesis while the other person might reasonably choose the other of the two hypotheses to be the tested hypotheses (see, for example, page 106 of Neyman (1977), “Frequentist probability and frequentist statistics,” Synthese).

The authors also make a confusing statement on page 14 when they write “if all studies had an average power of 10%, selection for significance could not select for more powerful studies.” As mentioned above, it is not clear whether it is meaningful to speak of the power of a single study. However, it is certainly not meaningful to speak of the average power of a single study as in this quotation. What are you getting at here? The average of a single value is the value itself and so it is not meaningful to talk of an average of a single value as you appear to be doing here.

Finally, as discussed in greater details below, this application moves beyond average power and so all parts of the application that do so are not germane to an “Average Power Primer” commentary. Please remain firmly focused on average power.

[4] Apologia pro z-curve

This commentary is about the concept of average power. It is not the place to showcase let alone mount a full-throttled endorsement of a method that could at best be described as “emerging.” The z-curve is a new and unproven method. The papers introducing it are light on mathematical statistics. They instead make heavy use of analogy and metaphor and they use small numbers of (arguably questionably-parameterized) simulation studies to make rather broad and sweeping claims. Simulation studies are no substitute for mathematical statistics and are constrained by the creativity of the simulator as well as the motivation of the simulator to make a method look good or bad. Like all methods, the z-curve requires deep formal investigation before it is ready for the wild.

Another problem with ad hoc, improvised “emerging” methods like the z-curve is that they are a moving target. For example, should there be three unspecified components estimated along with their weights by minimizing the sum of the squared distance of a kernel density estimate and a theoretical quantity as in the z-curve of Brunner and Schimmack 2020? Should there be seven pre-specified components (at 0, 1, 2, 3, 4, 5, and 6) with weights estimated by maximum likelihood via the expectation-maximization algorithm as in the z-curve of Bartos and Schimmack 2022? Should it be whatever Schimmack and colleagues (or for that matter someone else) decide it should be at some future date? It is hard to know: the method is still under development and refinement and we should cautiously step back until it is fully baked.

Another reason the z-curve is out of place in a commentary on average power is that the purpose of the z-curve is not to estimate average power. Instead, the purpose of the z-curve is to estimate the distribution of power (the estimate of the average and other quantities is simply a byproduct of estimating the distribution).

Therefore, the wonderful coin flip analogy that the authors provide on pages 5-6 does not apply to the z-curve because the z-curve does not make use of a binary input (was the original study “statistically significant” or not) but rather makes use of a more granular continuous input (what was the observed power of the original study).

The current authors write on page 5 that observed power of single study is “problematic”. It is unclear why they think that a method that takes in a bunch of “problematic” inputs like the z-curve can somehow launder them and provide something of value.

If your response is that the z-curve takes as input observed z-statistics rather than observed power, this may be technically true but it is not a relevant objection: observed z-statistics, observed p-values, and observed power are all one-to-one transformations of one another so it all amounts to the same thing.

More focused comments:

1. Page 4: The authors redefine power, which is rather curious. It also seems unnecessary. Regardless, in doing so, they mix up concepts from Neyman-Pearson decision theory. When the tested hypothesis and alternative hypothesis are both simple point hypotheses, power refers to a probability under the alternative. However, once either the tested or alternative is composite, we no longer talk of power in that way. Instead, we simply talk of the power function where one of the input parameters to the power function, which I will denote theta (consistent with the commenatary authors), takes any value in the space Theta regardless of whether that value belongs to the partition being tested or to the alternative. See any mathematical statistics texts for this, although I would point to Chapter 3 of Lehmann and Romano (page 57):

“The probability of rejection (3.2) evaluated for a given θ in Ω_K is called the power of the test against the alternative θ. Considered as a function of θ for all θ ∈ Ω, the probability (3.2) is called the power function of the test and is denoted by β(θ).”

There is hus no need to redefine power or discuss conditional versus unconditional power or any of this.

2. Page 4: After distinguishing among hypothetical power, observed power, and true power, the authors write “We refrain from the use of terms such as a prior and post-hoc power because power calculations can be conducted before and after a study, and power calculations after a study can use hypothetical values or observed data.”

I find this a bit weasily or overly rhetorical or something. Specifically, with regard to the schema the authors introduce:

– Hypothetical power can be evaluated a priori or post hoc.

– Observed power is necessarily post hoc

– True power is simply not relevant because it is never known (whether a priori or post hoc)

Average power is simply an average of post hoc powers (filtered through—of course and as always—some model for the data).

Also, given that the authors seem to abandon this schema right after introducing it, I am not sure that they need introduce it in the first place.

3. Page 5: You criticize Pek here and throughout but I am not sure it is so on the mark. For instance, consider the following from page 364 of Greenland (especially the bit about the horses):

“Among the problems with power computed from completed studies are these:
1. Irrelevance: Power refers only to future studies done on populations that look exactly like our sample with respect to the estimates from the sample used in the power calculation; for a study as completed (observed), it is analogous to giving odds on a horse race after seeing the outcome.
2. Arbitrariness: There is no convention governing the free parameters (parameters that must be specified by the analyst) in power calculations beyond the alpha-level.
3. Opacity: Power is more counterintuitive to interpret correctly than P values and confidence limits. In particular, high power plus ‘nonsignificance’ does not imply that the data or evidence favors the null (6).”

Greenland (2012), “Nonsignificance Plus High Power Does Not Imply Support for the Null Over the Alternative.” Annals of Epidemiology.

4. Page 5, Page 8, etc.: You frequently label power a parameter or liken it unto one. Power is not a parameter or akin to one!

5. Page 10: Why did Chen only code |z| > 1.96 studies? Can you elaborate? This seems like a poor choice.

6. Page 12: You write “Power is limited by alpha and cannot be less than 5%.” This is not the case. See, for example, Chapter 4 of Lehmann and Romano.

7. Page 14: What is “zing”?

Reviewer #2: The manuscript “An average power primer: Clarifying misconceptions about average power and replicability” (MET-2025-121) engages with the perspectives presented by McShane et al. (2020) and Pek et al. (2024). The authors forward that: (a) using observed data to estimate true power to evaluate the credibility of published studies does not constitute an ontological error and (b) high uncertainty of an estimate does not negate its diagnostic value.

The manuscript forwards the use of computing average power with the Z-curve, but certain aspects of the interpretation of average power could benefit from additional clarification to mitigate potential ontological misunderstandings. Furthermore, the concepts of “credibility” and “evidentiary value,” although associated with average power, currently lack formal definitions, which may impact their interpretability. Lastly, the rationale for maintaining the diagnostic relevance of estimates with high uncertainty would be strengthened by elaboration. Below, we expand on the points, seeking more clarity on the author(s)’s position while pointing to relevant findings by McShane et al. (2020).

ONTOLOGICAL ERROR. An ontological error occurs when power is improperly ascribed to in a way that does not align with its natural (ontological status). Classical power quantifies the pre-data performance of a study/design/procedure/test over hypothetical random data. Throughout, we use the term test in place of study/design/procedure/test. Average power (i.e., the average of a population of heterogeneous tests) also quantifies test performance over hypothetical random data. An ontological error occurs when the probability of power over random data (i.e., in the context of pre-data) is interpreted as a post-data probability (i.e., applied to fixed/observed data).

The authors recognize the ontological error in the following statements:
p. 5: “We also do not discuss the problematic use of observed power to evaluate the results of a single study (Hoenig & Heisey, 2001).”

p. 5: “The goal is not to assign a probability to a set of realized studies with a fixed outcome, but to estimate the expected success rate if the same study were repeated under identical conditions with a new random sample.”

However, some instances of vague phrasing could potentially be misinterpreted in a way that suggests an ontological error, where the pre-data concept of power might appear to be applied post-data.

The following phrases may unintentionally contribute to this misunderstanding.

p. 4 – 5: “…power calculations after a study can use hypothetical values or observed data.” It would be helpful to clarify whether these calculations are interpreted as pre-data power or have a post-data interpretation.

p. 6: “… the long run probability of studies to produce the observed outcome.” It is unclear in this statement whether power is treated as a pre-data concept or a post-data concept (i.e., observed power).

p. 7: The following sentence, “Taken at face value, this finding implies that studies had an average power of 97%” could be made clearer by being specific about whether power is applied to pre-data or post-data studies.

Perhaps helpful distinction to make would be the difference between average (pre-data) power versus average observed power, in which the later (like observed power) is a transformation of the p-value
as described in Hoenig & Heisey (2001).

CREDIBILITY AND EVIDENTIARY VALUE. On p. 3, the authors state “We clarify that the primary purpose of estimating average power is not to predict outcomes of future replication studies, but to evaluate the credibility of the published studies by estimating their true average power.” Later, on p. 7, they further explain: “Estimates of average power, however, make it possible to distinguish credible results that were driven by true effect sizes from sets of studies with low average power that may contain a large percentage of false positive results (Sorić, 1989).” The term “credibility” is referenced throughout the manuscript but has not been formally defined, leaving room for multiple interpretations. Clarification on how pre-data average power or average observed power are conceptually linked to credibility would help ensure a more precise understanding of its application. A clarification of “true power” and how it helps distinguish “credible results” would also strengthen the argument.
Additionally, on p. 7, the authors write: “In short, average power estimation provides a diagnostic tool for evaluating the evidential value of sets of studies.” Similar to the term “credibility,” the term “evidential value” has not been explicitly defined. A more detailed explanation of this concept in relation to pre-data average power and credibility would enhance interpretability.

UNCERTAINTY. Including more information in terms of multiple tests in average power might decrease uncertainty in estimation (shown by McShane et al. 2020). In average power, there are two sources of uncertainty in its hierarchical setup in which tests are nested within a population of tests. Sampling variability is termed epistemic uncertainty in that it is reduced by increasing the number of sampling units (i.e., tests). However, between-test variability does not decrease with increasing the number of tests. This kind of uncertainty is called aleatory uncertainty. When more tests are sampled, the average power estimate becomes more precise with diminishing sampling variability but the distribution representing the heterogeneity between tests (aleatory uncertainty) is better estimated. A key point to consider is whether the average is a good summary statistic of a heterogenous population. Heterogenous populations might have multiple modes, a skewed or even a uniform distribution. It would help to clarify when the mean is a reasonable summary of a heterogenous distribution.

Furthermore, McShane et al.’s (2020) concern about uncertainty is that the width of the CIs about average power depends on the corresponding point estimate of average power. Thus, tight CIs will be obtained at low and high estimated values of average power, and large CIs will be observed at estimates of average power close to .50. As stated by McShane et al. (2020), a narrow CI does not purely indicate estimate precision, but that the estimated average power is close to its theoretical bounds; in contrast, a wide CI would indicate that the average power estimate is close to .50. The CIs are inherently tied to the point estimate and do not serve as an independent measure of estimate precision. Thus, they cannot be interpreted as such.

Minor points:
The definition used in the manuscript – specifically, requiring p < .05 twice for similarly implemented studies (cf. Open Science Collaboration, 2015) is relatively narrow. Several methodologists (e.g., Anderson & Maxwell, 2016; Fabrigar & Wegener, 2016, 2017; Fife & Rodgers, 2021) have raised concerns about the limitations of this definition and have proposed alternative perspectives on replication. Acknowledging these differing viewpoints would provide a more comprehensive discussion an enhance the manuscript’s engagement with the broader literature on replication. Additionally, Gigerenzer (2018) has critiqued the emphasis on statistical significance (cf. the concept of power), characterizing it as a ritual that might lead to misconceptions. Considering this critique, as well as similar points raised by McShane et al. (2020), further discussion on how the author(s)’ perspective aligns with or diverges from these concerns could be valuable. How might the author(s)’ perspective on replication address these other points made by the discipline?

p. 7: Define “false positive risk.”

A technical issue: The EDR has a CI with a lower limit of 0.05. If one were to use this CI to conduct a hypothesis test, H_0: EDR = 0.05 (not one-sided) cannot be rejected. Because 0.05 is the lowest possible EDR, EDR = 0.05 implies that the distribution of the noncentrality is concentrated at a single value of zero. However, ERR has a CI whose lower limit is 0.12. This is contradictory because a zero non-centrality means the ERR should be 0.05 (or even lower if directional). The reason for this inconsistency is likely due to bootstrapping the CI, which may not be technically sound near a parameter boundary.

p. 7: “selection bias systematically inflates the observed success rate.” This statement makes an untestable assumption that selection bias only decreases the publication rate of insignificant results. However, it is also possible that the publication rate for significant results is also imperfect.
For example, if the true EDR is 0.6, which can be written as 0.6 = 0.6 * 1 / (0.6 * 1 + 0.4 * 1) without any publication bias. Suppose that the publication rate for significant studies is 0.5 and that for insignificant studies is 0.6, then the ODR is 0.6 * 0.5 / (0.6 * 0.5 + 0.4 * 0.6) = 0.56 < 0.6.
So, for ODR to be less than EDR, it seems that we need to assume that the selection function is monotone (more likely to select significant studies for publication than insignificant ones).

p. 8: “From this finding, we can also infer that effect size estimates are likely inflated because the true population effect size would not produce a significant result. This conclusion is implied by the statistical fact that a p-value equal to alpha (.05) corresponds to 50% power.” These statements could benefit from additional clarification. The true effect size and power are population quantities whereas statistical significance and p-values pertain to sample quantities. Given this distinction, further explanation of how a direct correspondence between the two is established would enhance the clarity of the argument.

p. 8: The phrase “observed parameters” seems contradictory. Parameters are often used to describe population values that are unobserved. And observed estimates are used to describe statistics calculating using sample data (which are estimates of population parameters).

p.2: The authors claim that the Z-curve is based on solid mathematical foundations and has been validated with extensive simulation studies. The inferential targets of Z-curve (EDR, ERR, and FDR) could be more carefully examined because analytics and the reported simulations do not consider potential violations of model assumptions in empirical data (e.g., how publication bias occurs; see MacCallum, 2003 on working with imperfect models). It is unlikely that these model assumptions match empirical data, and it remains to be examined how Z-curve performs under less-than-optimal conditions that could better reflect reality.

The development of the Z-curve might be better contextualized with related forensic meta-analytic procedures, which are designed for assessing the quality of the evidence in a set of results (Morey & Davis-Stober, under review). There is much discussion on the usefulness of such forensic methods (e.g., Gelman & O’Rourke, 2014), and it would be helpful to readers to provide a review of this literature. Papers on related forensic meta-analytic procedures are by Morey (2013), Pek et al. (2022), Bishop and Thompson (2016), Erdfelder and Heck (2019), Montoya, Kershaw, and Jergens (2024), and Ulrich and Miller, 2018).

p. 12: The author(s) report a range of possible false discovery risk (FDR) values from .26 to 1.0. It is important to recognize that the FDR is an upper bound to the false finding rate (FFR; Ioannidis, 2005; Pashler & Harris, 2012). However, further clarification is needed to determine whether FDR functions as a tight upper bound to FFR. Additionally, the statement “FDR of 100% creates reasonable doubt about the credibility of evidence” may benefit from more nuanced wording. Since 100% is an upper bound of any lower value, including 0%, the actual FFR remains uncertain based solely on information about the FDR. Given this uncertainty, it may be more prudent to interpret the lower bound of the FDR cautiously when assessing the likelihood of false findings.

==
Anderson, S. F., & Maxwell, S. E. (2016). There’s More Than One Way to Conduct a Replication Study: Beyond Statistical Significance. Psychological Methods, 21, 1-12. https://doi.org/10.1037/met0000051

Bishop, D. V., & Thompson, P. A. (2016). Problems in using p-curve analysis and text-mining to detect rate of p-hacking and evidential value. PeerJ, 4, e1715. https://doi.org/10.7287/peerj.preprints.1266v4

Erdfelder, E., & Heck, D. W. (2019). Detecting evidential value and p-hacking with the p-curve tool. Zeitschrift Für Psychologie, 227(4), 249-260. https://doi.org/10.1027/2151-2604/a000383

Fabrigar, L. R., & Wegener, D. T. (2016). Conceptualizing and evaluating the replication of research results. Journal of Experimental Social Psychology, 66, 68-80. https://doi.org/10.1016/j.jesp.2015.07.009

Fabrigar, L. R., & Wegener, D. T. (2017). Further considerations on conceptualizing and evaluating the replication of research results. Journal of Experimental Social Psychology, 69, 241-243. https://doi.org/10.1016/j.jesp.2016.09.003

Fife, D. A., & Rodgers, J. L. (2021). Understanding the exploratory/confirmatory data analysis continuum: Moving beyond the “replication crisis”. American Psychologist. https://doi.org/10.1037/amp0000886

Gelman, A., & O’Rourke, K. (2014). Discussion: Difficulties in making inferences about scientific truth from distributions of published p-values. Biostatistics, 15(1), 18-23. https://doi.org/10.1093/biostatistics/kxt034

Gigerenzer, G. (2018). Statistical Rituals: The Replication Delusion and How We Got There. Advances in Methods and Practices in Psychological Science, 1(2), 198-218. https://doi.org/10.1177/2515245918771329

Hoenig, J. M., & Heisey, D. M. (2001). The abuse of power: The pervasive fallacy of power calculations for data analysis. The American Statistician, 55(1), 19-24. https://doi.org/10.1198/000313001300339897

Ioannidis, J. P. (2005). Why most published research findings are false. PLoS: Medicine, 2(8), e124. https://doi.org/10.1371/journal.pmed.0020124

MacCallum, R. C. (2003). 2001 presidential address: Working with imperfect models. Multivariate Behavioral Research, 38(1), 113-139. https://doi.org/10.1207/S15327906MBR3801_5

McShane, B. B., Böckenholt, U., & Hansen, K. T. (2020). Average Power: A Cautionary Note. Advances in Methods and Practices in Psychological Science, 3(2), 185-199. https://doi.org/10.1177/2515245920902370

Montoya, R. M., Kershaw, C., & Jurgens, C. T. (2024). The inconsistency of p-curve: Testing its reliability using the power pose and HPA debates. PloS ONE, 19(7), e0305193. https://doi.org/10.1371/journal.pone.0305193

Morey, R. D. (2013). The consistency test does not-and cannot-deliver what is advertised: A comment on Francis (2013). Journal of Mathematical Psychology, 57(5), 180-183. https://doi.org/10.1016/j.jmp.2013.03.004

Morey, R. D., & Davis-Stober, C. P. (2024). On the poor statistical properties of the p-curve meta-analytic procedure. Unpublished manuscript, School of Psychology, Cardiff University, Cardiff, United Kingdom.

Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349, aac4716. https://doi.org/10.1126/science.aac4716

Pashler, H., & Harris, C. R. (2012). Is the replicability crisis overblown? Three arguments examined. Perspectives on Psychological Science, 7(6), 531-536. https://doi.org/10.1177/1745691612463401

Pek, J., Hoisington-Shaw, K. J., & Wegener, D. T. (2022). Avoiding Questionable Research Practices Surrounding Statistical Power Analysis. In W. O’Donohue, A. Masuda, & S. Lilienfeld (Eds.), Avoiding Questionable Practices in Applied Psychology(pp. 243-267). Springer. https://doi.org/10.1007/978-3-031-04968-2_11

Pek, J., Hoisington-Shaw, K. J., & Wegener, D. T. (2024). Uses of uncertain statistical power: Designing future studies, not evaluating completed studies. Psychological Methods. https://doi.org/10.1037/met0000577

Sorić, B. (1989). Statistical “discoveries” and effect-size estimation. Journal of the American Statistical Association, 84(406), 608-610. https://doi.org/10.1080/01621459.1989.10478811

Ulrich, R., & Miller, J. (2018). Some properties of p-curves, with an application to gradual publication bias. Psychological Methods, 23(3), 546-560. https://doi.org/10.1037/met0000125

Reviewer #3: An Average Power Primer: Clarifying Misconceptions about Average Power and Replicability

This is an interesting paper that leverages published work and an R package (and an alternative) that aims to help understand replication success from average power. The ideas are not original here but in prior work. The context is important. The article is aimed to address issues with this general approach that others have downplayed and for which the authors espouse, citing recent literature on the topic.

The manuscript (ms.) discusses replication studies and their importance, but is itself doing something different (via average power of published studies). My understanding from recent work is that the power of an original study is calculated differently than the power of a replication. There are corrections. For example, in this journal Anderson and Kelley have essentially a treatise on study design for replication, both from a power and an accurate perspective. They make clear why the power of a replication – where publication bias is very likely to occur – needs to be done differently. Thus, replication studies should not be planned like original studies, which I think the authors here would agree with. Work by Anderson and colleagues also provide methods for actually doing this planning. This involves using methods in which a maximum likelihood estimate of the true parameter is found and used for study design. The authors here are not doing that, per se, but they are trying to back into power of the research area from the studies. I think this is easier said than done and the necessary assumptions are not well specified nor are they reasonable, in the sense that a researcher planning a study is in a particular context whereas the collection of published studies in an area will be from multiple contexts, populations, various treatment effect sizes, etc.

The use of power or sample size planning more generally clearly has many advantages in original research. As well as when planning replications. But there are multiple ways of thinking about sample size planning. And there are multiple ways of thinking about power, even if the average power from a given area is worth estimating (realizing each study is not an exact replication itself).

The authors say “We clarify that the primary purpose of
estimating average power is not to predict outcomes of future replication studies, but to evaluate the credibility of the published studies by estimating their true average power.” Calling into question the credibility of a corpus of studies from an area because the average power is low would be akin to calling into question the health of individuals writ large in a city in which the average health is not good. Of course the average is useful, but it misses important things (namely the various upper percentiles). I am not even sure if we need “true average power” as studies tend to differ on various dimensions. Average power says nothing about the most properly conduced and planned studies. Sure, we can average over all of that, but that doesn’t address either (a) how to plan an original study or (b) how to plan a replication study. It also would not be good to talk about the replication effectiveness of an entire area because of the average power in published research. If few studies are exact replications, does the average power of them even make sense?

The authors say “if researchers repeated their original studies exactly, using the same methods and sample sizes, how likely would they be to obtain significant results again in a new sample with new sampling error?” Well, it depends on if the original studies are those that are simply finished or if it is for those that are published (and have significant findings). Average power of significant findings that are published is clearly not the same as average power for all studies ever conducted on the topic (not withstanding the point about stepping into the same river twice, which maybe can or maybe cannot be done; McShane and Bockenholt, this journal).

Design considerations beyond power would strengthen the manuscript. The z-curve work is not new here. Psychological Methods is not the place for illustrating new methods, generally speaking.

The authors define power as “We therefore define power ( ) as the unconditional
probability of producing a significant result (Bartoš & Schimmack, 2022).” But power depends on the null hypothesis being false (definitionally). They state that “The conditioning on a non-zero effect size makes sense for a priori power analysis, but it cannot be applied to estimates of true power because some studies may have population effect sizes of zero.” To the extent that the idea they are after should be quantified, then the power that they reference should be “average power” but not “power.” As different studies will consist of different populations and different contexts, mixing multiple theoretically different values of power to have an average is limited, at best. They go on to define three types of power, which is somewhat useful but also they get into a confusing set of terms. A table would be useful here.

Regarding the replication project and post hoc power, the authors state that “The goal is not to assign a probability to the set of realized studies with a fixed outcome, but to estimate the expected success rate if the same study were repeated under identical conditions with a new random sample.” But to do this, one needs more than just the p-value of the obtained study of interest. It is not the same as estimating the true proportion of a binary outcome because of publication bias.

The coin example is not good because, again, conditional probability is important. In that sense, if the type I error is true, the probability of rejecting the null is 5% (or alpha more generally). But if the specified parameter setting and assumptions are satisfied, etc., 50% power does mean that it would be like a coin flip. But, only in this conditional situation; not only the null false, but false to the tune exactly specified in the power analysis.

The authors might benefit from discussing what is meant more about the ontological error.

The authors state: “he original studies used for the Reproducibility Project had a 97% success rate (Open Science Collaboration, 2015). Taken at face value, this finding implies that the studies had an average power of 97%. ” But I do not agree with this. This statement needs to be nuanced considerably. The issue is that there is a selection effect; and selection effects change meanings. The published studies were in part selected (if not wholly selected) due to the significant finding.

The authors aim to “argue that the primary purpose of average power estimation is not to predict the outcome of future replication studies, but the hypothetical outcome if original researchers had to replicate their studies with new samples.” But I disagree with this. Average power is only getting at the mean power (unconditionally as no covariates are used) of published studies in the literature.

The authors seek to “clarify that the primary purpose of estimating average power is not to predict outcomes of future replication studies, but to evaluate the credibility of the published studies by estimating their true average power.” But consider only a single study is done and published. How does this help us? How many studies of exactly the same scenario need to exist in order to have something reasonable to say? As most published studies (all?) are in some way different (population, age, treatment, time, context, etc.), averaging studies for power and then making an assessment about credibility is filled with potential issues and nuances.

The authors have done published work on this topic already. Other than the terror research findings, which are more aptly described in that literature or only used for an example (as is the case here), I do not find the manuscript to be contributing new information to the Psychological Methods readership. I am sympathetic with the sentiment of Pek et al. (2024) regarding post hoc power. In any instance we have a single draw from some hypothetical distribution. And to the extent to which a study is held exactly constant, we could learn something from many more draws from that study. Of course, one could attempt a meta analysis, too. But regardless, I do not find that there is enough here for the Psychological Readership.

Fake Novel Meta-Science: Why Publish Crap When You have Tenure?

July 21, 2025UncategorizedUlrich Schimmack

Meta-science is not some special science or Über-science. It works just like other science, and in psychology it means that it is not working very well. The biggest problem is that theoretical or statistically oriented research means are subject to the same capitalistic pressure to produce output that can be measured and contributes to GDP by selling their work to publishers, who earn profits, which makes research workers work at least economically valuable. I mean, most of have tenure and could spend the time we are not doing our paid job (teaching) to go sailing, play tennis, but for some reason, some engage in meaningless pseud-scientific work and flood scientific journals with garbage and waste other people’s time to read their garbage instead of playing tennis. In sort, I am frustrated, and here is why.

I am looking for articles that discuss Vevea’s random effects selection model implemented in the r-package weightr. This summer I found that this is the best way to conduct effect -size meta-analyses in psychology because it models heterogeneity in effect sizes across conceptual replication studies and models selection bias that is pretty much always present in psychology.

This is how I found the article “Effect Size Estimation From t-Statistics in the Presence of
Publication Bias” by Ulrich, Miller, and Erdfelder. I once contacted Ulrich about z-curve, but he was not intersted. Erdfelder was behind G*Power and deserves credit for this free program that helped me to understand power back in 2013.

The article is a long and boring review of different methods to correct for publication bias in effect size estimation. The main point of this review is only to conclude that they all suck and that this is the reason why they developed their own model.

“In sum, all of these approaches are either highly complex or make strong and somewhat implausible assumptions about the weight functions, thus calling for an alternative approach that is based on weaker and more flexible assumptions but still relies on a simple selection model with clear-cut interpretations of the parameters. This motivated our own approach presented in the following section”

Then follows a long introduction of their model with lots of formulas and figures that make it look really, really scientific, and probably scared the shit out of the reviewers at “Zeitschrift für Psychologie,” where this nonsense was published after it was probably rejected from serious method journals.

The “new” model is a simple fixed effects model that assumes all studies have the same population effect size. This makes the model as useless as p-curve or p-uniform the analysis of most real datasets in psychology that typically have a high amount of heterogeneity. Any article with headings like “Maximum Likelihood Estimation of True Effect Size” is not worse reading because there is no True Effect Size in psychology. The only scenario that works well for a fixed effects model is a set of studies where all studies have no effect and significant results were obtained with p-hacking. When there are real effects, they are not all the same across studies. At a minimum, a test of heterogeneity should be conducted before a fixed-effects model is used.

To show how good their new method is, they apply it to a meta-analysis by Shanks et al. 2015) of priming studies. Priming studies are prime candidates for finding no effect and a lot of p-hacking. Their new, super-duper, so much better model estimated a TRUE EFFECT SIZE of d = .44, 95%CI = .31 to .57. Wow, priming works. We were all wrong and should apologize to Bargh. Maybe Einstein primes also make people really smarter. Where is the barf emoji in wordpress?

Of course, they do not compare the result to other methods, which is often recommended because “no method works well in all circumstances,” (Carter et al., 2018), but hey their review showed the other methods suck, so why bother. Fortunately, the Shanks data are openly available (really open data sharing is the best that has come out of the replication crisis), and I can present the results.

Let’s start with the default random-effects selection model implemented in weightr. This model is called the 3PSM model because it has 3 parameters to estimate selection bias at p < .05 (two-sided), the mean and standard deviation of the population effect sizes (i.e., without sampling error).

There is surprisingly little heterogeneity. The standard deviation of effect sizes is estimated to be .09. Thus, 95% of effect sizes are assumed to be either .2 standard deviations above or below the mean. The mean is estimated at d = .44. Thus, the 95%CI is d = .33 to .55. Maybe that is why Ulrich and co did not report model comparisons. Who needs a new model if the old model works just fine. No publication. Where is the crying face emoji?

But it gets better. Many of the non-significant effect sizes are marginally significant and were reported as evidence against the null-hypothesis. We can model this with the random effects selection model by adding a step at p = .05 (p = .10, two-sided is typically the upper value to get away with marginal significance). Indeed, 8 non-significant results are marginal and only 2 are not, because they are marginally-marginally significant. With a small dataset it is difficult to estimate multiple parameters. One way to deal with marginally significant results, would be to fit the model only to “significant” results. Another one is to lower the significance criterion to treat marginal results as significant. All p-values were below p = .14, two-sided. So, I set the significance criterion to .14. In this model, the mean was d = .27, 95% CI = .18 to .36. Using only the results with p-values below .025 (one-sided) produced an estimate of d = .33, 95%CI = .22 to .44.

In short, the article fails to show that the new model is needed or better than previous models. In fact, it is worse because it does not estimate heterogeneity in effect sizes. This is not a major problem for the application to Shank’s priming data, but it would be a problem for other meta-analyses. Using a fixed effect size model with heterogenous data is scientific malpractice and leads to the wrong conclusions.

But what do we do with priming? Should we really believe that priming works in these studies? How robust are these results? Could they be replicated. With most sample sizes below N = 100, effect sizes in a single study are inflated and correction models can do only so much to estimate the true values.

Another way to look at these data is to use z-curve. Being generous, we simply compute the ratio of effect sizes over sampling error and use the t-values as if they were z-values. This produces an upward bias in studies with small samples.

Z-curve makes it clear that non-significant results (left side of 1.96) are clustered just below the criterion for significance. Treating these observations as honestly reported non-significant results would be a mistake and create upward bias in effect size estimates. It also shows that most significant values are clustered close to the value for significance. This suggests p-hacking. Based on this distribution of effect sizes, the model assumes that many more priming studies were conducted, but not reported because they failed to show support for priming. The expected discovery rate of all studies would be only 18% significant results, but this estimate is uncertain and it could be as many as 29% or as little as 5%. Because 5% is expected by chance alone, the false positive risk is 100%. That is, we cannot rule out that all of these results were obtained without a real effect, which is not implausible when we look at priming studies.

The local power estimates below the x-axis also show that even studies with a z-value of 3 (observed power > 80%) have only 24% power to show a significant effect again. The only studies worthwhile to follow up on are 3 results with z > 4, but I would not bet money on these results to replicate successfully. After all, it is priming research.

Ok, now it is time to go and play tennis and have some fun. I am not getting paid enough to spend more time on this shit, but maybe somewhere I can save some people from wasting time on this article or other articles by authors with such a low degree of trustworthiness. If you care about effect–size meta-analysis, read some tutorials on weightr. It is the best program we have so far.

Mindful Meta-Analysis: How to Interpret Heterogeneity in Meta-Analyses of Conceptual Replication Studies

July 21, 2025UncategorizedUlrich Schimmack

It is widely recognized that no single study is conclusive. Credible evidence requires successful replications by independent researchers (Cohen, 1994). The credibility crisis in psychology has revealed that many hypothesis tests lack robust support for two reasons. First, direct replication studies are rare because they are difficult to publish. Journals have a strong bias to publish new findings rather than reporting that a finding can be successfully replicated. Second, journals have a strong bias in favor of statistically significant results. Thus, it is also hard to publish replication failures with a non-significant result. As a result, psychology journals mostly publish new discoveries supported by statistically significant results, p < .05.

Statisticians have long noted that selective reporting of significant results undermines the credibility of significance testing (Sterling, 1959). When only significant findings are published, statistical significance loses its ability to distinguish between true and false claims. Under such conditions, even implausible results—like time-reversed causality—can appear credible (Bem, 2011). By the early 2010s, researchers recognized that more than half of published findings might be false positives (Simmons, Nelson, & Simonsohn, 2011). This realization helped trigger psychology’s replication crisis. As psychologists began conducting and reporting direct replications, the success rate proved to be alarmingly low (Open Science Collaboration, 2015).

One reason it took so long to recognize psychology’s replication problem is that meta-analyses suggested false positives were rare and that replication failures were mostly false negatives caused by low statistical power. A meta-analysis in the prestigious journal Psychological Bulletin even suggested that paranormal phenomena are real (Bem & Honorton, 1994). During the credibility crisis, it became clear that many meta-analytic findings are unreliable because publication bias inflates the combined effect size estimates. To address this issue, researchers have developed statistical methods that adjust for publication bias (see Carter et al., 2019 for an overview of methods). These methods sometimes reveal that the corrected effect size is not statistically significant, implying that the true effect may be zero (Carter et al., 2019; Meier et al., 2022).

While bias correction is an important improvement, these methods often overlook another major issue: the studies included in psychological meta-analyses are rarely direct replications. Instead, they are typically conceptual replications, varying in procedures and dependent variables. As a result, it is likely that the true effect sizes differ across studies. Indeed, when tested, many meta-analyses reveal substantial heterogeneity (Mertens et al., 2022), yet this variability is often ignored in interpreting results.

In this paper, I present four examples showing how conclusions based on the average effect size change when heterogeneity is taken into account. These examples also serve as a tutorial for investigating heterogeneity in meta-analyses and offer guidance for reanalyzing existing work and conducting more meaningful meta-analyses in the future.

A Brief Introduction to Meta-Analysis

Meta-analyses are used to combine the results of multiple studies to obtain better estimates of population effect sizes. For example, original studies often have low statistical power to reject false null hypothesis with small effect sizes. Combining the results of several studies with low power creates a single dataset with high statistical power. For example, after 10 studies with N = 40 participants have been published, a fixed effect meta-analysis has the power of a single study with N = 400 participants (Schimmack, 2012).

Meta-analyses work well when (a) studies are sufficiently similar so that all studies have similar population effect sizes, and (b) there is no bias in the reporting of results. A good example are the Cochrane reports in medicine (Higgins et al., 2023). Cochrane reports use only studies that test the same treatment on the same outcome. Surprisingly, they also show no sign of publication bias. As a result, a fixed effects meta-analysis of these studies solves the problem of low power in clinical trials and provides useful estimates of the true treatment effect.

In contrast, the application of standard meta-analysis to psychological results ignores that psychological studies are rarely direct replications. To solve this problem, psychologists started using random-effects meta-analysis (Hedges & Vevea, 1998). However, the use of a random effects model did not substantially alter the interpretation of meta-analyses. The main focus remained on the estimate of the average effect size. Another problem is that the standard random effects model does not take publication bias into account and produces inflated estimates when bias is present (Carter et al., 2019).

The problem of methods that correct for bias is that they are essentially fixed-effect size methods. That is, they estimate a single effect size and do not estimate heterogeneity of effect sizes. Thus, meta-analysts have to choose between methods that estimate heterogeneity or methods that correct for bias. This creates problems when these models show evidence of heterogeneity and bias (Mertens et al., 2022). Fortunately, there is one method that can estimate bias, correct for bias, and estimate heterogeneity (Carter et al., 201; Hedges & Vevea, 1996). These models are known as random effects selection models (RESM). However, the model was difficult to implement, and a user-friendly R-package was only published in 2016 (Coburn et al., 2019. Unfortunately, tutorials and applications of the RESM continue to focus on the average effect size estimate and ignore the estimate of heterogeneity (Carter et al., 2019; Meier, VanderWeele, & Mathur, 2022). Here I show how information about heterogeneity in effect sizes can be interpreted and why it is misleading to focus on the average effect size to interpret meta-analyses of conceptual replication studies.

A Simulation of Heterogeneity and Selection Bias with an Average Effect Size of Zero

Carter et al. (2019) used simulation studies to evaluate the performance of several meta-analytic methods, including the RESM. They used an RESM with three parameters (3PSM). The first parameter estimates the average effect size. The second one estimates heterogeneity, and the third one estimates the amount of bias in reporting of non-significant results.

The simulation study simulated standardized effect sizes for between-group differences (Cohen’s d). I focus on the simulation with high heterogeneity because it is most similar to meta-analyses of conceptual replication studies. The simulation assumes that population effect sizes have a normal distribution with a mean of 0.0 and a standard deviation of 0.4 (Figure 1).

Figure 1 also shows the distribution of the effect size estimates in 5,000 simulated studies using Carter et al.’s (2019) simulation of high selection bias and no p-hacking. The distribution is shifted towards positive results. The small pump of negative results is due to the simulations weaker bias against significant negative results. All of the negative results are statistically significant.

Carter et al. (2019) fitted the three parameters selection model (3PSM) to these data and found that the model underestimates the population average of zero. I replicated this finding in my simulation, mean d = -.21. More importantly, the model produced a good estimate of the true heterogeneity d = .38. Thus, there is clear evidence that the simulated effect sizes are heterogenous.

The negative bias in the estimated mean is partially explained by the presence of significant negative results. This is not modeled in the 3PSM model. To allow for less selection against negative and significant results, the model can be modified by adding a step at .975, one-tailed. In addition, there are some non-significant positive results and no non-significant negative results. This suggests different selection biases for positive and negative results. This can be modeled with a step at p = .5 (one-tailed). Using this 5PSM model, the estimated mean is d = -.07, and the estimated standard deviation is .34.

Figure 2 shows how the estimated mean and standard deviation can be used to evaluate model fit. The figure shows that the 3PSM model does not fit the data as well as the 5PSM model. Even if the estimated mean and SD of the 5PSM model are significantly different from the true values, the model predicts the true effect sizes rather well. A formal test of fit can be conducted by computing the root mean square error (RMSE) of the differences between the true and predicted effect sizes. The RMSE for the 5PSM model is .05, which is considered acceptable fit.

The predicted distribution of population effect sizes can be used to estimate the percentage of results with small, medium, or large effect sizes. Even with a negative average, there are 21^ of studies with effect sizes of d > .2. This information helps to avoid the mistake of interpreting the negative average as evidence that the effect size of all studies is zero or even negative.

A Simulation of Heterogeneity and Selection Bias with Only Positive Effect Sizes

Carter et al. (2019) simulated heterogeneity with normal distributions. This is a problem for two reasons. First, the selection model assumes normal distributions and benefits from the fact that the simulated data match the model’s assumptions (Brunner & Schimmack, 2020). Second, simulations with an average effect size of zero assume an equal number of positive and negative effect sizes because the normal distribution is symmetrical over the mean.

In actual meta-analysis of conceptual replication studies, however, it is rather unlikely that there is an equal number of studies with positive and negative results. The reason is that meta-analysis code original studies in terms of the consistency with the predicted direction of an effect. Positive results confirm a prediction, whereas negative results are opposite to a prediction. Thus, an average of zero implies that researchers are unable to predict the direction of an effect. This is even more unlikely when hypothesis tests use a two-sided significance test, and researchers sometimes make predictions after they have seen the results (Kerr, 1998). Thus, there are two possible explanations for missing negative results when the selection model predicts negative results (Figure 2). Either the negative results were suppresed or the distribution of effect sizes is not normal.

To examine the performance of the RESM without true negative results, I used the same simulation as before, but I deleted studies with negative population effect sizes. This did change the actual simulated data only slightly because there were only few cases with negative results. However, it changed the true distribution of population effect sizes from a full normal to a half normal distribution. This changes the true mean and standard deviation of the population effect sizes. The true mean is d = .32 and the true SD is .24.

The 3PSM model estimates a mean of d = .34 and a standard deviation of .24. This is very similar to the true values, but could be misinterpreted as bias under the assumption that negative results were suppressed and the true value is zero. Figure 3 shows the true density distribution compared to the predicted one by the 3PSM model.

The RMSE is .24. This is not considered good fit, but it is the best the selection model can do given the false distribution assumption. Despite the poor fit, the results do provide meaningful information about the population effect sizes that produced the observed effect size estimates. The question whether studies with negative results were not reported or never conducted is irrelevant for the evaluation of the evidence in the meta-analysis. The observed effects were produced with positive population effect sizes.

Ego Depletion

Carter et al. (2019) used their simulations to interpret conflicting results for a meta-analysis of ego-depletion. Figure 3 shows the distribution of effect sizes. There are relatively few negative effect sizes and only 5 negative effect sizes were significant. Carter et al. (2019) reported an average effect size estimate of d = .33. I reproduced this result. In addition, I found that the estimated standard deviation was .38. I also fitted the 4PSM model with a step at .5 to allow for suppression of negative results. However, the model showed equal weights for non-significant positive and negative results. As a result, the estimated mean and SD were unchanged. Figure 3 also shows the implied distribution of population effect sizes. The estimated distribution of population effect sizes implies that 63% of the population effect sizes are greater than .2.

Carter et al. (2019) come to a different conclusion. They argued that the 3PSM model is biased in the conditions that are most realistic for ego-depletion research, including high heterogeneity. They favor PET-regression model, although the inventor of this bias-correction model warned against the use of PET-regression when data are heterogenous (Stanley, 2017). Based on a negative estimate of d = -.27, Carter et al. (2019) concluded that “the true ego-depletion effect is not practically or significantly different from zero” (p. 139). This conclusion is evidently false because there is no true ego-depletion effect in a meta-analysis of conceptual replication studies with heterogeneous effect size estimates. The conclusion reveals the mindless interpretation of random effects meta-analysis as if they were fixed effects meta-analysis with a single population effect size. Figure 3 suggests that at least some of the studies produced significant results with effect sizes greater than .2.

A Meta-Analysis of Nudging

Mertens, Herberz, Hahnel, and Brosch (2022) conducted an extensive meta-analysis of nudging. Figure 8 shows the distribution of effect size estimates. There are few negative results and only 3 out of 447 results were negative and significant. This either means that there is strong selection against negative results or that the distribution of effect sizes is not normal. Mertens et al. used traditional random effects meta-analysis and showed presence of heterogeneity. They also conducted some bias analyses and found some inconclusive evidence of bias. They did not use a random effects model that corrects for bias. Here are the results.

The standard 3PSM model showed mild evidence of bias, selection against non-significant results weight = .72, 95%CI = 50 to .94 (a value of 1 would mean no bias). The estimated mean and standard deviation were .35 and .40, respectively. A modified 4PSM model that allows for selection against negative results (step at p = .5) showed clear evidence of selection against negative results, weight = .11, 95%CI = .04 to .17. The estimated mean was no close to zero, .07, and not statistically significant, 95%CI = -.07 to .21. The estimated standard deviation increased to .52.

These data are extremely helpful to illustrate the problem of focusing on the estimated average effect size. Ignoring heterogeneity and assuming a fixed effect size for all studies would suggest that corrected for bias, these studies show no evidence that nudging manipulations have an effect. A totally different conclusion emerges when heterogeneity is taken into account (see Figure 5).

The models differ in their prediction of negative results. The 3PSM model predicts fewer negative results than the 4PSM model because the 4PSM model assumes that negative results were obtained and not published. However, on the positive sides the two models are much more similar because the 4PSM compensates for the lower mean with larger heterogeneity. The 3PSM model implies that 65% of the population effect sizes are greater than d = .2 (a small effect, Cohen, 1988). The 4PSM estimate is lower, but 40% of the population effect sizes are still greater than d = .2. This is very different from the conclusion that all evidence for nudging was obtained without a real effect that follows form the interpretation of the average effect size in a random effects meta-analysis as a fixed-effect estimate of a single population effect size.

The lack of negative results makes it impossible to determine whether negative results were obtained and not published or whether the distribution of effect sizes is not normal and the predicted negative effect sizes in the 4PSM model are phantom estimates of studies that were never conducted. Extending the selection model to allow for non-normal distribution would be helpful to address this question.

Conclusion

Meta-analysis is a powerful tool to integrate results of direct replication studies to increase power and obtain more precise effect size estimates. Publication bias threatens the validity of meta-analysis, but several methods exist to correct for bias in meta-analysis of studies with the same or similar effect sizes. However, most of these methods assume that all studies have the same population effect size.

However, meta-analyses in psychology often combine evidence from conceptual replication studies that use different paradigms and have high heterogeneity in effect sizes. When this is the case, estimating a single value is not informative and often leads to false conclusions. It is therefore mandatory for valid scientific inferences to estimate the amount of heterogeneity in a dataset. The problem is that standard random effects meta-analysis ignores publication bias. Thus, the only valid tool that can be used at the moment is the random effects selection model. The key limitation of selection models is the assumption of a normal distribution of effect sizes, but I showed that the normal distribution can often approximate other distributions reasonably well and estimate the average and standard deviation of positive population effect sizes.

There are many directions for future research. First, simulation studies need to evaluate models in terms of their ability to recover the true distribution of effect sizes, not only bias in the estimate of the average effect. Second, more work on the specification of selection models needs to be done. The default 3PSM model is not always the best option, but researchers have little guidance in the specification of selection models. Third, the selection model is currently the only effect size model that models heterogeneity and bias. More research needs to be done to see whether other models can improve on this model. Finally, more work needs to be done on the interpretation of heterogeneity. One problem is that the distribution of population effect sizes is not directly tied to the observed effect sizes. A study with a small sample and a large effect size estimate may have a small population effect size. It is therefore not immediately clear how evidence of large population effect sizes can be used to identify studies with substantial effects that can be replicated. However, the first step towards these goals is to be more mindful of heterogeneity of effect sizes in psychology.

References

Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100(3), 407–425. https://doi.org/10.1037/a0021524

Bem, D. J., & Honorton, C. (1994). Does psi exist? Replicable evidence for an anomalous process of information transfer. Psychological Bulletin, 115(1), 4–18. https://doi.org/10.1037/0033-2909.115.1.4

Carter, E. C., Schönbrodt, F. D., Gervais, W. M., & Hilgard, J. (2019). Correcting for bias in psychology: A comparison of meta-analytic methods. Advances in Methods and Practices in Psychological Science, 2(2), 115–144. https://doi.org/10.1177/2515245919847196

Coburn, K. M., Vevea, J. L., & Coburn, M. K. M. (2019). weightr: Estimating weight‐function models for publication bias [R package version 2.0.2]. https://CRAN.R-project.org/package=weightr

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum Associates.

Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49(12), 997–1003. https://doi.org/10.1037/0003-066X.49.12.997

Hedges, L. V., & Vevea, J. L. (1998). Fixed‑ and random‑effects models in meta‑analysis. Psychological Methods, 3(4), 486–504

Higgins, J. P. T., Thomas, J., Chandler, J., Cumpston, M., Li, T., Page, M. J., & Welch, V. A. (Eds.). (2023). Cochrane handbook for systematic reviews of interventions (Version 6.4). Cochrane. https://training.cochrane.org/handbook

Kerr, N. L. (1998). HARKing: Hypothesizing after the results are known. Personality and Social Psychology Review, 2(3), 196–209. https://doi.org/10.1207/s15327957pspr0203_4

Maier, M., van der Weele, T. J., & Mathur, M. B. (2022). A robust method for adjusting meta-analyses for publication bias. Nature Human Behaviour, 6, 1033–1044. https://doi.org/10.1038/s41562-022-01373-0

Mertens, S., Herberz, M., Hahnel, U. J. J., & Brosch, T. (2022). The effectiveness of nudging: A meta-analysis. Proceedings of the National Academy of Sciences, 119(1), e2107346118. https://doi.org/10.1073/pnas.2107346118

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. https://doi.org/10.1126/science.aac4716

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551–566. https://doi.org/10.1037/a0029487

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632

Stanley, T. D. (2017). Limitations of PET-PEESE and other meta-analysis methods. Social Psychological and Personality Science, 8(5), 581–587. https://doi.org/10.1177/1948550617693062

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance—or vice versa. Journal of the American Statistical Association, 54(285), 30–34. https://doi.org/10.2307/2282137

Reviewer A is Back

July 20, 2025UncategorizedUlrich Schimmack

Last year, Maria Soto, and I published a z-curve analysis of emotion research in Cognition and Emotion (Soto & Schimmack, 2024). During the review process, we received critical comments from Reviewer A (A = anonymous or A = %$#%^). Critical comments are the main point of peer-review, but these comments misrepresented z-curve. In the second round of reviews, Reviewer A used simulations with sample sizes of N = 2 to question our findings that were based on real samples with sample sizes typical in psychology, not N = 2. Not surprisingly, our results held up when we conducted sensitivity analyses (Questionable Reviewer Practices). I am (not) pleased to report that Reviewer A (actually a team of authors) are back with new simulations that gaslight readers who do not have the time or motivation to delve deeply into z-curve. Fortunately, this time they are trying to publish a commentary, and their identity is known. To maintain confidentiality during the review process, I am going to continue to call them Reviewer A.

All of the simulations presented in this manuscript serve only the purpose to motivate the foregone conclusion that z-curve results should not be trusted. I will show that z-curve results are not perfect, but they provide credible information. Rather Reviewer A should not be trusted.

“To date, we have only examined a limited set of scenarios. Yet, we have found relatively few instances in which estimates from the typical use of the 𝑍-curve approach (i.e., analyzing only 𝑝-values < .05) have performed well. As a result, conclusions based on a 𝑍-curve analysis should currently be taken with caution, especially given that the extent and form of publication bias is likely to remain unknown in most application settings.”

This statement implies that z-curve lacks proper validation with simulation studies. Reviewer A does not tell readers that z-curve has been tested with hundreds of scenarios in Brunner and Schimmack (2020), Bartos and Schimmack (2022), and Schimmack and Bartos (2023). They also do not mention that it performs better than other selection models that try to correct for bias (Brunner & Schimmack, 2020). When Jerry and I developed z-curve, he would create simulations where he would know the true results and I did not. I would then use z-curve and send him the results. We only proceeded to develop the method because it did well in these blind simulation tests.

So, the question is who is more trustworthy: reviewer A or z-curve?

Now let’s examine the simulations that Reviewer A presented to support the claim that it is difficult to find situations where z-curve performs well. There are essentially three scenarios, all equally unlikely to match any real data.

Scenario 1: 10% of studies test a true null hypothesis, 90% of studies test a true alternative hypothesis; all with 80% power (no two studies have exactly the same power).

Scenario 2: 50% of studies have low power (all 17%) and 50% of studies have very high power (94%).

Scenario 3: 50% of studies have low power (all 17%) and 50% of studies have even higher power (99%).

I will not bore you with the results of these simulation studies because they are irrelevant for the evaluation of publications in psychology journals. None of these scenarios are even close to real datasets. Also, z-curve handles these scenarios well.

Selection Bias

The real reason why the authors are able to show bias in z-curve estimates is that they simulate a gradual selection bias for non-significant and significant results. For non-significant results, this implies that p-values of .4 are more likely to be reported than p-values of .8. This is as ridiculous as it is irrelevant because z-curve does not use the distribution of non-significant results that are often absent. It is well known that focal hypothesis tests have a 90% success rate, and that most of the non-significant results are missing. Thus, the simulation of gradual bias for non-significant results is irrelevant (see Figure 1).

So, let us examine the one relevant and well-known fact about selection models. All selection models, z-curve included (but also p-curve, the random effects selection model, etc.) assume that selection is based on the dichotomous criterion of statistical significance. Results that do not support a hypothesis (p > .05 or sometimes p > .10) are not reported and p-values that support a hypothesis are reported (p < .05). When this assumption is not true, a selection model will produce biased results. It is true that this has not been sufficiently examined, so we can use Reviewer A’s simulation as a first step towards examining how gradual selection of significant results influence z-curve estimates.

It is plausible that researchers may be reluctant to report a p-value of .04 or even .02. They may use some additional p-hacking to get the p-value below .01. However, Reviewer A’s simulation is based on the assumption that selection bias extends all the way to z-scores of 4 that correspond to p-values of .00006. Accordingly, researchers would still p-hack or not publish results with a p-value of .005 or even .001. The only reason to make such an implausible assumption is to exaggerate the bias in z-curve estimates. However, let’s examine the consequences of this assumption with a realistic scenario that reproduces the observed distribution of z-values in hand-coded focal tests in the emotion literature.

I focus on the hand-coded focal tests because Reviewer A also dismissed our results based on automatically extracted test statistics. To further evaluate the trustworthiness of Reviewer A, you need to know that Reviewer A simply ignored that we also published the results of these hand-coded data. They simply do not exist in the commentary.

Take a look at the plot. What looks strange in this plot? Yes, observed z-scores disappear as soon as we look at z-values below 1.96 that are not significant at p < .05. Even marginally significant results (.10 > p > .05) that are often used to reject a null hypothesis are rare. So, there is no evidence of gradual bias for non-significant results. They are just not reported or p-hacked to be significant. We do not need a formal bias test to notice that bias is present.

We also see that the frequency of z-scores above 1.96 decreases quickly. This decrease implies that the mode (peak) of the distribution of all results (including unpublished non-significant results) is in the range of non-significant results. So, it is clearly not a value of 2.8, 3.5, or 4.5 which was used in the useless simulations of Reviewer A.

The assumption that z-values between 2 and 4 have gradual selection bias cannot be tested because this biases merely flattens the slop in the histogram. That is, if this bias is present in these data, data without this bias would show an even steeper decline in z-values between 2 and 4. As a result, a model that does not correct for the gradual selection bias will overestimate the expected discovery rate (average power of all studies), the Expected Replication Rate (ERR), and underestimate the False Positive Risk (FDR; the maximum percentage of false positive results).

I used Reviewer A’s simulation code to create data that produce similar results to the ones in the emotion literature AFTER the selection bias was applied.

The EDR is 23% rather than 27%, the ERR is 61% rather than 65%, and the FDR is 18% rather than 14%. Close enough, in my opinion.

Then, I fitted the model to the data BEFORE the selection. These results tell us how much selection bias that matches the simulation could have biased our published z-curve results. As we do not know the actual biases, this is considered a sensitivity analysis.

The results show that the EDR could be 15%, the ERR could be 51%, and the FDR could be 30%. Thus, our published results MAY underestimate the risk of false positive results by 16 percentage points and overestimate the replicability of published results by 10 percentage points. This does not mean that our published results were wrong. We just do not know how researchers produced significant results.

One of our recommendations based on our FDR estimate of 14% was to lower alpha to maintain a relatively low false discovery risk. We recommended alpha = .01, as it produced an FDR below 5%. When we use alpha = .01 with these simulated data, the FDR is 6%, a bit above 5%, but as a famous statistician once pointed out, God likes .04 just as much as .06. To get below .05, we could lower alpha to .005.

Conclusion

We published a z-curve analysis of results in two leading emotion journals to provide emotion researchers with some empirical information about the false positive risk and replicability of results in these journals.

Reviewer A claims that these results provide no meaningful information about the credibility of emotion research. They also do not offer results based on a better model. Apparently, they prefer that emotion researchers either trust all results or assume that most significant results are false positives. As Kuhn noted in his work on scientific paradigms, this message is not going to have an impact on the field because paradigms only shift when a new paradigm gives researchers something else to do.

In the world of empirical meta-analysis, selection models are the way forward. We can improve them or replace them with something better, but we cannot go back to the good old days that assumed publication bias is not a problem.

Reviewer A made a valid observation. Selection models will produce biased estimates when the actual selection process differs from the assumptions of the selection model. All selection models assume a simple step function and equal probabilities for all significant results. This is clearly an oversimplification and future applications of z-curve and other selection models should conduct sensitivity analyses to examine how estimates change under various biases (see Carter et al., 2019, for a study of effect size meta-analyses). We are actually working on this issue right now.

Here I presented a sensitivity analysis for the emotion literature, using Reviewer A’s simulation of selection bias. While the results change notably, a key recommendation remains. The false positive risk with alpha = .05 is too high, but it can be reduced dramatically by lowering alpha to .01. While meta-scientists have argued over the best alpha level based on theoretical assumptions, we provided an empirical answer to this question. The answer remains valid even when we allow for gradual selection up to z = 4. The reason is that power is heterogenous and the pile of just significant results between .05 and .01 stems from studies with low power. In contrast, the long tail and the 20% of results with z > 6 stem from high powered studies. Alpha = .01 is a reasonable value to separate these two populations of studies.

On a final note, it is still true that you should trust the science, even if science does not provide simple TRUE vs. FALSE statements. This does not mean that you should trust scientists. It would be foolish for me to claim that I am unbiased. Fortunately, you do not have to trust me. You can download the data and z-curve, write your own simulation of selection bias, and conduct sensitivity analyses. Just don’t make the mistake to trust Reviewer A’s claim that meta-analyses of published studies are useless because we do not know how the sausage was made (i.e., researchers p-hacked their data). Forensic meta-analysis is not perfect and not as conclusive as detection of data fabrication or manipulation, but it can help to identify a subset of credible studies, if they exist.

P.S. We have indirectly communicated with Reviewer A with responses to their reviews and in public blog posts that reveal the flaws in their arguments. So far, Reviewer A has shown no willingness to engage directly with us and simply continue to write misinformed and misleading articles about z-curve. This blog post is just another open and transparent response to their criticism. If only there were a bade for open exchange of arguments. 🙂

Z-Curve.3.0 Tutorial: Chapter 3

July 13, 2025UncategorizedUlrich Schimmack

Chapter 3 shows the use of z-curve.3.0 with the Open Science Collaboration Reproducibility Project (Science, 2015) p-values of the replication studies. The r-code can be found on my github page.
zcurve3.0/Tutorial.R.Script.Chapter3.R at main · UlrichSchimmack/zcurve3.0

Here are the links to Chapter 1 and links to the other chapters.

Introduction to Chapter 3

Z-curve was developed to examine the credibility (publication bias, replicability, false positive risk) of articles that report statistical results, typically null-hypothesis significance tests. The need for such a tool became apparent in the early 2010s, when concerns about replication failures and high false positive risks led to a crisis of confidence in published results.

Another remarkable investigation of the credibility of psychological science was the Reproducibility Project of the Open Science Collaboration (Science, 2015). Nearly 100 results published in in three influential journals were replicated as closely as possible. The key finding that has been cited in thousands of articles was that the percentage of significant results in the replication studies was much lower, and that effect sizes were much smaller as well.

In line with the emphasis on transparence, the project also made the data from this study openly available. The data provide a valuable learning tool to illustrate the use of z-curve.3.0. The data from this project are unique in that z-curve results based on the original results can be compared to the results of the replication studies. Normally, the “truth” is unknown or simulated with simulation studies. Here, the replication studies serve as an approximation of the truth. For example, the replicability estimate based on the p-values of the original studies can be compared to the actual outcome of the replication studies. Chapter 2 analyzed the original data. Chapter 3 analyzes the replication data. The replication data are unusual in that results were reported independent of the outcome of a study. This means that there should be no bias against non-significant results. Z-curve tests whether this is actually the case. The most interesting application of z-curve is to estimate the false positive risk of the original studies based on the replication data.

3.0 First Examination of the Z-curve Plot

The first step is to run a quick z-curve analysis with the fast density method and no bootstrap and then change parameters to adjust the y-axis and the width of the histogram bars to make the figure visually appealing.

Visual inspection of the plot suggests that there is no selection bias. This is confirmed by the comparison of the observed discovery rate (ODR, i.e., the percentage of significant results) and the expected discovery rate (EDR, i.e., the percentage of significant results predicted by the model). The ODR is 33% and the EDR is 32%. The ODR is even lower than the EDR. More important is that the two estimates are practically identical. Even if there were a small statistically significant difference in a large sample, selection bias would be small and not affect the results.

Z-curve.3 provides a simple test of bias that showed evidence of bias in the original studies (Chapter 2). This test assumes that there is no bias (the null-hypothesis). Under this assumption, z-curve is fitted to all z-values, not just significant ones. Bias will produce too many just significant results. The default range for just significant results is 2 to 2.6 (about p = .05 to .01).

Figure 2 shows that the model fits the data well. The test of excessive just significant results is not significant, p = .7082. Another test examines whether p-hacking is present. It is hardly necessary to run this test with these data, but I am presenting the results anyways because this test was significant for the original results.

To test for p-hacking, we fit z-curve to the “really” significant results (z > 6). The model predicts the distribution of “just” significant results. This time evidence of too many significant results would suggest that p-hacking added just significant results.

Figure 3 shows the results. The test is also not significant, p = .6789. Thus, there is no evidence of bias in these data.

The next test examines heterogeneity of power. With some successful and many failed replications, it is plausible that the studies are heterogeneous. For non-significant results the true power is likely to be low, or the null-hypothesis may even be true. However, significant results in a replication study imply that the null hypothesis is unlikely to be true and that the study had a low false negative risk (i.e., adequate power).

Surprisingly, the heterogeneity test of the original studies did not show evidence of heterogeneity. This may be due to the problem of estimating heterogeneity with strong bias. Here the test has more power because we can fit the model to all results. This increases the set of studies from ~30 to ~ 100.

To illustrate the heterogeneity test, I generated z-curve plots that are not included in the typical heterogeneity test. The first plot shows the results for a model that assumes homogeneity. As a result, we have only one component with a fixed SD of 1, while the mean estimates the power of all studies, assuming equal power for all studies.

We see that the model fails to predict the z-values of significant results.

The second model freely estimates the SD of the component. Sampling error alone produces a standard deviation of 1. Values above 1 suggest that the non-central z-values of studies also vary. The test of heterogeneity uses the 95%CI of the SD estimates. If the lower limit of the CI is above 1, the data are heterogeneous. The heterogeneity of the non-central z-values is simply (SD – 1).

The figure shows that the model better fits the significant results. The estimated SD is 1.84 and the 95% confidence interval is 1.06 to 2.21. Thus, there is evidence of heterogeneity.

The other model does not make assumptions about the distribution of the non-central z-values. This is the advantage of z-curve over other models. To test heterogeneity, a z-curve model with two free components and fixed SD of 1 is fitted to the data. If the model fits better than the model with a single component, there is evidence of heterogeneity. Moreover, model fit is compared to the fit of the model with one component and free SD.

The two-component model suggests that there are two populations of studies. 73% of studies have a population mean of 0.45, which corresponds to 7% power. The remaining 27% of studies have a population mean of 3.28, which corresponds to 91% power.

The comparison of model fit is inconclusive. The 95%CI ranges from -.001 to .014. Using a more liberal criterion, however, would favor the two-component model, 90%CI = .000 to .012. Another consideration is that the free component model does not make assumptions that could bias results. If both models produced the same results, it would not be important to pick one model over the other. However, the two-component model has a much higher estimate of the ERR. To test this prediction, it would be necessary to replicate the replication studies again. The normal distribution model predicts only a 33% success rate, whereas the 2-component model predicts a success rate of 76%. The truth is unknown, but I believe that the two-component model is closer to the truth.

The final model is estimated with the “EM” algorithm implemented in the z-curve package with the default of seven fixed components at z = 0 to 6. The results are similar to the previous results.

The results suggest that there is no bias in the replication studies and that the average power of all original studies was 35%. The 95% confidence interval allows for 49% significant results, but not 90%, which was the success rate of the original studies.

The EDR is used to estimate the false positive risk (FDR). However, the FDR of 10% tells us only that 10% of the significant results in the replication studies may be false positive results. The more interesting question is how many of the original studies could be false positive results. The analysis of the original studies was inconclusive because the confidence interval ranged from 3% to 88%.

To use the replication data to estimate the FDR of the original studies, we can limit the analysis to the non-significant result and estimate the true average power of the non-significant results. This estimate can then be used to estimate the FDR of studies with non-significant results.

The estimated EDR of non-significant replication studies is 6%, just 1 percentage point over 5%, which is expected if all results were false positives in the original studies. This implies a high FDR of 76% and the 95% confidence interval includes 100%. The lower limit is 27%. Therefore, we cannot conclusively say that most of these studies tested a true null-hypothesis, but we can say that they provided no evidence against the null-hypothesis, despite significant results in the original studies.

The replication failures in the OSC reproducibility project have been discussed in many articles, and some articles correctly pointed out that replication failures do not show that the original result was a false positive. This is true, but the current results suggest that the evidence of these studies is so weak that they do not rule out this possibility.

Z-Curve.3.0 Tutorial: Chapter 2

July 11, 2025UncategorizedUlrich Schimmack

You can find links to the other chapters on the post with Chapter 1.

Chapter 2 shows the use of z-curve.3.0 with the Open Science Collaboration Reproducibility Project (Science, 2015) p-values of the original studies.
zcurve3.0/Tutorial.R.Script.Chapter2.R at main · UlrichSchimmack/zcurve3.0

Introduction to Chapter 2

In line with the emphasis on transparence, the project also made the data from this study openly available. The data provide a valuable learning tool to illustrate the use of z-curve.3.0. The data from this project are unique in that z-curve results based on the original results can be compared to the results of the replication studies. Normally, the “truth” is unknown or simulated with simulation studies. Here, the replication studies serve as an approximation of the truth. For example, the replicability estimate based on the p-values of the original studies can be compared to the actual outcome of the replication studies. Chapter 2 analyzes the original data. These analyses serve as a blueprint for typical applications of z-curve. Chapter 3 shows how z-curve analysis of replication studies can also provide useful information.

2.0 First Examination of the Z-curve Plot

The first step is to run a quick z-curve analysis with the fast density function and no bootstrap and then change parameters to adjust the y-axis and the width of the histogram bars to make the figure visually appealing.

Visual inspection of the plot suggests that there is selection bias, as there are many more just significant results than non-significant results. By default, the model is fitted to the significant results only. The model then predicts the distribution of non-significant results. The actual data show 91% significant results. This is the observed discovery rate (ODR). The model estimates an expected discovery rate of only 41%. This also suggests publication bias, but point estimates are inconclusive.

Z-curve.3 provides a simple test of bias. This test assumes that there is no bias (the null-hypothesis). Under this assumption, z-curve is fitted to all z-values, not just significant ones. Bias will produce too many just significant results, like the bar with z-values between 2 and 2.4 in Figure 1. The default range for just significant results is 2 to 2.6 (about p = .05 to .01).

Figure 2 shows that z-values between 2 and 2.4 cannot be fitted by the model. The test of Excessive Just Significance (EJS) is significant with p = .0025. This confirms that bias is present in this dataset, but it does not tell us whether it is selection bias (not reporting non-significant results) or p-hacking (analyzing data in multiple ways until a significant result is found).

The next analysis examines p-hacking. P-hacking tends to produce more p-values that are just significant compared to actual power of studies. To test this, z-curve is fitted to the “really” significant results, z > 2.6, and the percentage of observed just significant results is compared to the prediction by the model.

Figure 3 shows the results. There are too many significant results between 2 and 2.4, but not between 2.4 and 2.6. The significance test is not significant, p = .1324. However, the selection model does not explain the excess of z-values between 2 and 2.4. We can now respecify the model and define “just significant” as 2 to 2.4.

Now the test is significant with p = .0009. But we tested twice. Did we just p-hack a p-hacking test? Not really. We can adjust alpha to take into account that we tested twice. Even with alpha = .025, the p-value of .0009 is significant.

The next test examined heterogeneity by comparing a model with a single component with a free mean and a fixed SD of 1 against (a) model with a single component free mean and free SD, and (b) a model with two components with free means and fixed SD of 1. Neither test showed evidence of heterogeneity.

Now we face a decision problem. Assuming no bias would lead to inflated estimates of power, but using a selection model when p-hacking was used leads to underestimation of power, especially, the estimate of power for all z-values, including non-significant ones. Both results should be reported, but I prefer to use the selection model and treat the downward bias due to p-hacking as a p-hacking penalty. Having access to the data and the z-curve program makes it possible for everybody to make their own decisions.

The final model is fitted with the “EM” algorithm and 500 bootstraps implemented in the z-curve.2.0 package. The EM algorithm is slower, but slightly superior to the density approach.

The final model confirms the presence of bias and quantifies it. The ODR is 91%, while the EDR is only 20%. Although the 95%CI is wide, it does not include the ODR. The Expected Replication Rate is 60%, with a 95%CI ranging from 44% to 74%. Chapter 3 will compare this to the actual results. Based on the EDR, it is possible to quantify the false positive risk; that is, the probability that a significant result was obtained without an effect (the null-hypothesis is true). The risk is 21%, but the 95%CI is wide and allows for 88% false positive results. This does not mean that many results are false positives, but it does mean that the evidence is weak and that many results could be false positives. Chapter 3 examines the false positive risk based on the actual replication results.

Although the heterogeneity test did not find evidence of heterogeneity, the local power results below the x-axis suggest that there is some heterogeneity. Non-significant results are estimated to have low power ranging from 20% to 32%. Significant results are estimated to have modest power for z-scores of 2 to 4. The value of 80% that is recommended for a priori power-analysis is only reached at z = 4 and very few studies have z-values greater 4. Thus, one clear finding is that the studies were underpowered. To avoid false negatives in replication studies, sample sizes would have to be increased considerably for most studies.

Z-Curve.3.0 Tutorial: Introduction

July 8, 2025Z-Curve, ZcurveUlrich Schimmack

Links to Additional Resources and Answers to Frequently Asked Questions

Chapters

This post is Chapter 1. The R-code for this chapter can be found on my github:
zcurve3.0/Tutorial.R.Script.Chapter1.R at main · UlrichSchimmack/zcurve3.0
(the picture for this post shows a “finger-plot, you can make your own with the code)

Chapter 3 shows the use of z-curve.3.0 with the Open Science Collaboration Reproducibility Project (Science, 2015) p-values of the replication studies.
zcurve3.0/Tutorial.R.Script.Chapter3.R at main · UlrichSchimmack/zcurve3.0

Chapter 4 shows how you can run simulation studies to evaluate the performance of z-curve for yourself.
zcurve3.0/Tutorial.R.Script.Chapter4.R at main · UlrichSchimmack/zcurve3.0

Chapter 5 uses the simulation from Chapter 4 to compare the performance of z-curve with p-curve, another method that aims to estimate the average power of only significant results that is used to estimate the expected replication rate with z-curve.
zcurve3.0/Tutorial.R.Script.Chapter5.R at main · UlrichSchimmack/zcurve3.0

Chapter 6 uses the simulation from Chapter 4 to compare the performance of the default z-curve method with a z-curve that assumes a normal distribution of population effect sizes. The simulation highlights the problem of making distribution assumptions. One of the strengths of z-curve is that it does not make an assumption about the distribution of power.
zcurve3.0/Tutorial.R.Script.Chapter6.R at main · UlrichSchimmack/zcurve3.0

Chapter 7 uses the simulation from Chapter 4 to compare the performance of z-curve to a Bayesian mixture model (bacon). The aim of bacon is different, but it also fits a mixture model to a set of z-values. The simulation results show that z-curve performs better than the Bayesian mixture model.
zcurve3.0/Tutorial.R.Script.Chapter7.R at main · UlrichSchimmack/zcurve3.0

Chapter 8 uses the simulation from Chapter 4 to examine the performance of z-curve with t-values from small studies (N = 30). It introduces a new transformation method that performs better than the default method from z-curve.2.0 and it introduces the t-curve option to analyze t-values from small studies with t-distributions.
zcurve3.0/Tutorial.R.Script.Chapter8.R at main · UlrichSchimmack/zcurve3.0

Chapter 9 simulates p-hacking by combining small samples with favorable trends into a larger sample with a significant result (patchwork samples). The simulation simulates studies with between-subject two-group designs with varying means and SD of effect sizes and sample sizes. It also examines the ability of z-curve to detect p-hacking and compares the performance of the default z-curve that does not make assumptions about the distribution of power and a z-curve model that assumes a normal distribution of power.
zcurve3.0/Tutorial.R.Script.Chapter9.R at main · UlrichSchimmack/zcurve3.0

Brief ChatGPT Generated Summary of Key Points

What Is Z-Curve?

Z-curve is a statistical tool used in meta-analysis, especially for large sets of studies (e.g., more than 100). It can also be used with smaller sets (as few as 10 significant results), but the estimates become less precise.

There are several types of meta-analysis:

Direct replication: Studies that test the same hypothesis with the same methods.
Example: Several studies testing whether aspirin lowers blood pressure.
Conceptual replication: Studies that test a similar hypothesis using different procedures or measures.
Example: Different studies exploring how stress affects memory using different tasks and memory measures.

In direct replications, we expect low variability in the true effect sizes. In conceptual replications, variability is higher due to different designs.

Z-curve was primarily developed for a third type of meta-analysis: reviewing many studies that ask different questions but share a common feature—like being published in the same journal or during the same time period. In these cases, estimating an average effect size isn’t very meaningful because effects vary so much. Instead, z-curve focuses on statistical integrity, especially the concept of statistical power.

What Is Statistical Power?

I define statistical power as the probability that a study will produce a statistically significant result (usually p < .05).

To understand this, we need to review null hypothesis significance testing (NHST):

Researchers test a hypothesis (like exercise increasing lifespan) by conducting a study.
They calculate the effect size (e.g., exercise increase the average lifespan by 2 years) and divide it by the standard error to get a test statistic (e.g., a z-score).
Higher test-statistics imply a lower probability that the null hypothesis is true. The null hypothesis is that there is no effect. If the probability is below the conventional criterion of 5%, the finding is interpreted as evidence of an effect.

Power is the probability of obtaining a significant result, p < .05.

Hypothetical vs. Observed Power

Textbooks often describe power in hypothetical terms. For example, before collecting data, a researcher might assume an effect size and calculate how many participants are needed for 80% power.

But z-curve does something different. It estimates the average true power of a set of studies. It is only possible to estimate average true power for sets of studies because power estimates based on a single study are typically too imprecise to be useful. Z-curve provides estimates of the average true power of a set of studies and the uncertainty in these estimates.

Populations of Studies

Brunner and Schimmack (2020) introduced an important distinction:

All studies ever conducted (regardless of whether results were published).
Only published studies, which are often biased toward significant results.

If we had access to all studies, we could simply calculate power by looking at the proportion of significant results. For example, if 50% of all studies show p < .05, then the average power is 50%.

In reality, we only see a biased sample—mostly significant results that made it into journals. This is called selection bias (or publication bias), and it can mislead us.

What Z-Curve Does

Z-curve helps us correct for this bias by:

Using the p-values from published studies.
Converting them to z-scores (e.g., p = .05 → z ≈ 1.96).
Modeling the distribution of these z-scores to estimate:
- The power of the studies we see,
- The likely number of missing studies,
- And the amount of bias.

Key Terms in Z-Curve

Term	Meaning
ODR (Observed Discovery Rate)	% of studies that report significant results
EDR (Expected Discovery Rate)	Estimated % of significant results we’d expect if there were no selection bias
ERR (Expected Replication Rate)	Estimated % of significant studies that would replicate if repeated exactly
FDR (False Discovery Rate)	Estimated % of significant results that are false positives

Understanding the Z-Curve Plot

Figure 1. Histogram of z-scores from 1,984 significant tests. The solid red line shows the model’s estimated distribution of observed z-values. The dashed line shows what we’d expect without selection bias. The Observed Discovery Rate (ODR) is 100%, meaning all studies shown are significant. However, the Expected Discovery Rate (EDR) is only 40%, suggesting many non-significant results were omitted. The Expected Replication Rate (ERR) is also 40%, indicating that only 40% of these significant results would likely replicate. The False Discovery Rate (FDR) is estimated at 8%.

Notice how the histogram spikes just above z = 2 (i.e., just significant) and drops off below. This pattern signals selection for significance, which is unlikely to occur due to chance alone.

Homogeneity vs. Heterogeneity of Power

Sometimes all studies in a set have similar power (called homogeneity). In that case, the power of significant and non-significant studies is similar.

However, z-curve allows for heterogeneity, where studies have different power levels. This flexibility makes it better suited to real-world data than methods that assume all studies are equally powered.

When power varies, high-power studies are more likely to produce significant results. That’s why, under heterogeneity, the ERR (for significant studies) is often higher than the EDR (for all studies).

Summary of Key Concepts

Meta-analysis = Statistical summary of multiple studies.
Statistical significance = p < .05.
Power = Probability of finding a significant result.
Selection bias = Overrepresentation of significant results in the literature.
ODR = Observed rate of p < .05.
EDR = Expected rate of p < .05 without bias.
ERR = Estimated replication success rate of significant results.

Full Introduction

Z-curve is a statistical tool for meta-analysis of larger sets of studies (k > 100). Although it can be used with smaller sets of studies (k > 10 significant results), confidence intervals are likely to be very wide. There are also different types of meta-analysis. The core application of meta-analysis is to combine information from direct replication studies. that is studies that test the same hypothesis (e.g., the effect of aspirin on blood pressure). The most widely used meta-analytic tools aim to estimate the average effect size for a set of studies with the same research question. A second application is to quantitatively review studies on a specific research topic. These studies are called conceptual replication studies. They test the same or related hypothesis, but with different experimental procedures (paradigms). The main difference between meta-analysis of direct and conceptual replication studies is that we would expect less variability in population effect sizes (not the estimates in specific samples) in direct replications, whereas variability is expected to be higher in conceptual replication studies with different experimental manipulations and dependent variables.

Z-curve can be applied to meta-analysis of conceptual replication studies, but it was mainly developed for a third type of meta-analysis. These meta-analyses examine sets of studies with different hypotheses and research designs. Usually, these studies share a common feature. For example, they may be published in the same journal, belong to a specific scientific discipline or sub-discipline, or a specific time period. The main question of interest here is not the average effect size that is likely to vary widely from study to study. The purpose of a z-curve analysis is to examine the credibility or statistical integrity of a set of studies. The term credibility is a broad term that covers many features of a study. Z-curve focuses on statistical power as one criterion for the credibility of a study. To use z-curve and to interpret z-curve results it is therefore important to understand the concept of statistical power. Unfortunately, statistical power is still not part of the standard education in psychology. Thus, I will provide a brief introduction to statistical power.

Statistical Power

Like many other concepts in statistics, statistical power (henceforth power, the only power that does not corrupt), is a probability. To understand power, it is necessary to understand the basics of null-hypothesis significance testing (NHST). When resources are insufficient to estimate effect sizes, researchers often have to settle for the modest goal to examine whether a predicted positive effect (exercise increases longevity) is positive, or a predicted negative effect is negative (asparin lowers blood pressure). The common approach to do so is to estimate the effect size in a sample, estimate the sampling error, compute the ratio of the two, and then compute the probability that the observed effect size or an even bigger one could have been obtained without an effect; that is, a true effect size of 0. Say, the effect of exercise on longevity is an extra 2 years, the sampling error is 1 year, and the test statistic is 2/1 = 2. This value would correspond to a p-value of .05 that the true effect is positive (not 2 years, but greater than 0). P-values below .05 are conventionally used to decide against the null hypothesis and to infer that the true effect size is positive if the estimate is positive or that the true effect is negative if the estimate is negative. Now we can define power. Power is the probability of obtaining a significant result, which typically means a p-value below .05. In short,

Power is the probability of obtaining a statistically significant result.

This definition of power differs from the textbook definition of power because we need to distinguish between different types of powers or power calculations. The most common use of power calculations relies on hypothetical population effect sizes. For example, let’s say we want to conduct a study of exercise and longevity without any prior studies. Therefore, we do not know whether exercise has an effect or how big the effect is. This does not stop us from calculating power because we can just make assumptions about the effect size. Let’s say we assume the effect is two years. The main reason to compute hypothetical power is to plan sample sizes of studies. For example, we have information about the standard deviation of people’s life span and can compute power for hypothetical sample sizes. A common recommendation is to plan studies with 80% power to obtain a significant result with the correct sign.

It would be silly to compute the hypothetical power for an effect size of zero. First, we know that the probability of a significant result without a real effect is set by the research. When they use p < .05 as a rule to determine significance, the probability of obtaining a significant result without a real effect is 5%. If they use p < .01, it is 1%. No calculations are needed. Second, researchers conduct power analysis to find evidence for an effect. So, it would make no sense to do the power calculation with a value of zero. This is null hypothesis that researchers want to reject, and they want a reasonable sample size to do so.

All of this means that hypothetical power calculations assume a non-zero effect size and power is defined as the conditional probability to obtain a significant result for a specified non-zero effect size. Z-curve is used to compute a different type of power. The goal is to estimate the average true power of a set of studies. This average can be made up of a mix of studies in which the null hypothesis is true or false. Therefore, z-curve estimates are no longer conditional on a true effect. When the null hypothesis is true, power is set by the significance criterion. When there is an effect, power is a function of the size of the effect. All of the discussion of conditional probability is just needed to understand the distinction between the definition of power in hypothetical power calculations and in empirical estimates of power with z-curve. The short and simple definition of power is simply the probability of a study to produce a significant result.

Populations of Studies

Brunner and Schimmack (2020) introduce another distinction between power estimates that is important for the understanding of z-curve. One population of studies are all studies that have been conducted independent of the significance criterion. Let’s assume researchers’ computers were hooked up to the internet and whenever they conduct a statistical analysis, the results are stored in a giant database. The database will contain millions of p-values, some above .05 and others below .05. We could now examine the science-wide average power of null hypothesis significance tests. In fact, it would be very easy to do so. Remember, power is defined as the probability to obtain a significant result. We can therefore just compute the percentage of significant results to estimate average power. This is no different than averaging the results of 100,000 roulette games to see how often a table produces “red” or “black” as an outcome. If the table is biased and has more power to get “red” results, you could win a lot of money with that knowledge. In short,

The percentage of significant results in a set of studies provides an estimate of the average power of the set of studies that was conducted.

We would not need a tool like z-curve, if power estimation were that easy. The reason why we need z-curve is that we do not have access to all statistical tests that were conducted in science, psychology, or even a single lab. Although data sharing is becoming more common, we only see a fraction of results that are published in journal articles or preprints on the web. The published set of results is akin to the proverbial tip of the iceberg, and many results remain unreported and are not available for meta-analysis. This means, we only have a sample of studies.

Whenever statisticians draw conclusions about populations from samples, it is necessary to worry about sampling bias. In meta-analyses, this bias is known as publication bias, but a better term for it is selection bias. Scientific journals, especially in psychology, prefer to publish statistically significant results (exercise increases longevity) over non-significant results (exercise may or may not increase longevity). Concerns about selection bias are as old as meta-analyses, but actual meta-analyses have often ignored the risk of selection bias. Z-curve is one of the few tools that can be used to detect selection bias and quantify the amount of selection bias (the other tool is the selection model for effect size estimation).

To examine selection bias, we need a second approach to estimate average power, other than computing the average of significant results. The second approach is to use the exact p-values of a study (e.g., p = .17, .05, .005) and to convert them into z-values (e.g., z = 1, 2, 2.8). These z-values are a function of the true power of a study (e.g., a study with 50% power has an expected z-value of ~ 2), and sampling error. Z-curve uses this information to obtain a second estimate of the average power of a set of studies. If there is no selection bias, the two estimates should be similar, especially in reasonably large sets of studies. However, often the percentage of significant result (power estimate 1) is higher than the z-curve estimate (power estimate 2). This pattern of results suggests selection for significance.

In conclusion, there are two ways to estimate the average power of a set of studies. Without selection bias, the two estimates will be similar. With selection bias, the estimate based on counting significant results will be higher than the estimate based on the exact p-values.

Figure 1 illustrates the extreme scenario that the true power of studies was just 40%, but selection bias filtered out all non-significant results.

Figure 1. Histogram of z-scores from 1,984 significant tests (based on a simulation of 5,000 studies with 40% power). The solid red line represents the z-curve fit to the distribution of observed z-values. The dashed red line shows the expected distribution without selection bias. The vertical red line shows the significance criterion, p < .05 (two-sided, z ~ 2). ODR = Observed Discovery Rate, EDR = Expected Discovery Rate, ERR = Expected Replication Rate. FDR = False Positive Risk, not relevant for the Introduction.

The figure shows a z-curve plot. Understanding this plot is important for the use of z-curve. First, the plot is a histogram of absolute z-values. Absolute z-values are used because in field-wide meta-analyses the sign has no meaning. In one study, researchers predicted a negative result (aspirin decreases blood pressure) and in another study they predicted a positive result (exercises increases longevity). What matters is that the significant result was used to reject the null hypothesis in either direction. Z-values above 6 are not shown because they are very strong, imply nearly 100% power. The critical range of z-scores are z-scores between 2 (p = .05, just significant) and 4 (~ p = .0001).

The z-curve plot makes it easy to spot selection for significance because there are many studies with just significant results (z > 2) and no studies with just not-significant results that are often called marginally significant results because they are used in publications to reject the null hypothesis with a relaxed criterion. A plot like this cannot be produced by sampling error.

In a z-curve plot, the percentage of significant results is called the observed discovery rate. Discovery is a term used in statistic for a significant result. It does not mean a breaking-news discovery. It just means p < .05. The ODR is 100% because all results are significant. This would imply that all studies tested a true hypothesis with 100% power. However, we know that this is not the case. Z-curve uses the distribution of significant z-scores to estimate power, but there are two populations of power. One population is all studies, including the missing non-significant results. I will explain later how z-curve estimates power. Here it is only important that the estimate is 40%. This estimate is called the expected discovery rate. That is, if we could get access to all missing studies, we would see that only 40% of the studies were significant. Expected therefore means without selection bias and open access to all studies. The difference between the ODR and EDR quantifies the amount of selection bias. Here selection bias inflates the ODR from 40% to 100%.

It is now time to introduce another population of studies. This is the population of studies with significant results. We do not have to assume that all of these studies were published. We just assume that the published studies were not selected based on their p-values. This is a common assumption in selection models. We will see later how changing this assumption can change results.

It is well known that selection introduces bias in averages. Selection for significance, selects studies that had positive sampling error that produced z-scores greater than 2, while the expected z-score without sampling error is only 1.7, not significant on its own. Thus, a simple power calculation for the significant results would overestimate power. Z-curve corrects for this bias and produces an unbiased estimate of the average power of the population of studies with significant results. This estimate of power after selection for significance is called the expected replication rate (ERR). The reason is that average power of the significant results predicts the percentage of significant results if the studies with significant results were replicated exactly, including the same sample sizes. The outcome of this hypothetical replication project would be 40% significant results. The decrease from 100% to 40% is explained by the effect of selection and regression to the mean. A study that had an expected value of 1.7, but sampling error pushed it to 2.1 and produced a significant result is unlikely to have the same sampling error and produce a significant result again.

At the bottom of z-curve 3.0, you see estimates of local power. These are average power estimates for ranges of z-values. The default is to use steps of z = .05. You see that the strength of the observed z-values does not matter. Z-values between 0 and 0.5 are estimated to have 40% power as do z-values between 5.5 and 6. This happens when all studies have the same power. When studies differ in power, local power increases because studies with higher power are more likely to produce larger z-values.

When all studies have the same power, power is said to be homogenous. When studies have different levels of power, power is heterogeneous. Homogeneity or small heterogeneity in power imply that it is easy to infer the power of studies with non-significant results from studies with significant results. The reason is that power is more or less the same. Some selection models like p-curve assume homogeneity. For this reason, it is not necessary to distinguish populations of studies with or without significant results. It is assumed that the true power is the same for all studies, and if the true power is the same for all studies, it is also the same for all subsets of studies. This is different for z-curve. Z-curve allows for heterogeneity in power, and z-curve 3.0 provides a test of heterogeneity. If there is heterogeneity in power, the ERR will be higher than the EDR because studies with higher power are more likely to produce a significant result (Brunner & Schimmack, 2020).

To conclude, the introduction introduced basic statistical concepts that are needed to conduct z-curve analyses and to interpret the results correctly. The key constructs are

Meta-Analysis: the statistical analysis of results from multiple studies
Null Hypothesis Significance Testing
Statistical Significance: p < .05 (alpha)
(Statistical) Power: the probability of obtaining a significant result
Conditional Power: the probability of obtaining a significant result with a true effect
Populations of Studies: A set of studies with a common characteristic
Set of all studies: studies with non-significant and significant results
Selection Bias: An overrepresentation of significant results in a set of studies
(Sub)Set of studies with significant results: Subset of studies with p < .05
Observed Discovery Rate (ODR): the percentage of significant results in a set of studies
Expected Discovery Rate (EDR): the z-curve estimate of the discovery rate based on z-values
Expected Replication Rate (ERR): the z-curve estimate of average power for the subset of significant results.