How Bad is P-Curve Really and Why Should We Care?

P-curve was introduced a little over a decade ago by Uri Simonsohn, Leif D. Nelson, and Joseph P. Simmons (2014); the same team that later launched the DataColada blog. It is a selection-model approach designed specifically for examining the evidential value of published findings when non-significant results are missing and publication bias inflates estimates of power that ignore selection bias.


The method’s goal and its historical context

Its statistical goal is instead to test the null hypothesis that all significant results are false positives. While methodologists warned about this possibility (Rosenthal, 1979), it was considered unlikely that large sets of studies could be published without real effects. However, the DataColada team showed that it can be relatively easy to produce significant results without real effects when the data are p-hacked (Simmons, Nelson, & Simonsohn (2011, Psychological Science, “False-Positive Psychology”). Awareness of inflated type-I error rates and replication failures raised concerns that most results might be false positives (Ioannidis, 2005).


Applications and Limitations

Over the past decade, p-curve has been applied in numerous meta-analyses, and the typical conclusion is that the analyzed literature shows evidential value. However, this conclusion has a critical limitation: rejecting the null hypothesis that all results are false positives does not reveal how many results are false positives, how large the true effects are, or how much reported effect sizes are inflated by publication bias. The latest version of p-curve provides an estimate of “power” to provide quantitative information about the amount of evidential value in a set of studies. This blog post examined the controversy surrounding this parameter of the p-curve model.


Scope of the Discussion

To be clear, the developers also introduced a version of p-curve for effect-size estimation, but this procedure has been used rarely and performs worse than alternative bias-correcting methods when credible nonsignificant evidence is available (see Carter et al., 2019). Consequently, the present discussion focuses on p-curve as a test of evidential value, as implemented in the public p-curve app, rather than as an estimator of effect magnitude.


The Current Debate

Morey and Davis-Stober (2025) published a formal critique in the Journal of the American Statistical Association (JASA) (see my earlier post, Rindex.08.08.25). Uri Simonsohn (2025) responded in a post on the DataColada blog (#129).

The key issue is how p-curve performs when the power of studies varies across studies (i.e., heterogeneity in power). Morey and Davis-Stober present a simulation with true mean power of 66 percent, yet p-curve returns an estimate of 87%, a 21-percentage point difference. Simonsohn shows simulations where bias is never larger than 5%.


Simulation Hacking

The controversy illustrates a broader methodological issue that might be called simulation hacking. Just as empirical researchers can obtain desired results through selective analyses (p-hacking), methodologists can shape conclusions by emphasizing simulation conditions where a method performs particularly well or poorly. This does not mean that the chosen scenarios are unrealistic; rather, it highlights that statistical procedures often perform differently across contexts. A method may be robust and informative for some purposes yet unreliable for others, depending on which assumptions the simulations accentuate.


Simulating Field Wide Heterogeneity

Figure 1: Distribution of Effect Sizes in Morey and Davis-Stober’s Simulation

Morey and Davis-Stober (2025) simulated a distribution of true effect sizes that is shown in their Figure 1. This distribution is broadly consistent with the average effect sizes reported in psychology meta-analyses (Richard et al., 2003). Such a distribution can be used to simulate p-values from studies testing a wide variety of hypotheses and research designs that aim to estimate the typical power of studies in psychology (e.g., Cohen, 1962; Schimmack, 2020; Soto & Schimmack, 2024). These conditions generate extreme heterogeneity in statistical power across studies. Morey and Davis-Stober’s analysis suggests that under such heterogeneity, p-curve will produce inflated estimates of average power.

A concrete example is provided by the Reproducibility Project (Open Science Collaboration, 2015). These data are especially informative because the outcomes of the replication studies offer an independent benchmark of the original studies’ power to produce significant results without selection bias. The observed replication rate implies an average true power of less than 40%. Schimmack (2025) analyzed the p-values of the original studies and obtained a p-curve estimate of power of 91%, 95% CI = 86% to 94% (Schimmack, 2025).

If the replication outcomes were unknown, this p-curve result would incorrectly suggest that the high proportion of significant findings in psychology journals (Sterling et al., 1995) reflects genuinely high study power rather than publication bias or p-hacking. In conclusion, a tool that was developed in response to the replication crisis to reveal p-hacking would falsely suggest that power is high and p-hacking is rare.


Simulating Meta-Analyss of P-Hacked Literatures

Simonsohn (2025) simulated studies with low power that never exceeds 80%. Examples like this can be found in meta-analysis of p-hacked studies. For example, a recent p-curve analysis of 825 terror-management studies yielded a power estimate of only 25%, 95%CI = 21% to 29%. This finding implies that exact replications of these studies would produce at most 30% significant results; a rate that is similar to the success rate in actual replication studies (Open Science Collaboration, 2015). An anecdote tells about a social psychologist who prided himself on a success rate of 1 out of 3 studies and compared it baseball, where a 30% batting average is excellent.

The problem here is not that p-curve estimates are biased. Rather, the problem is that they can be easily misinterpreted, if heterogeneity in power is ignored. After all, p-curve does reject the null-hypothesis that all studies are false positives. Assuming that all studies have the same power also implies that there are no false positive results; contrary to Simmons et al.’s (2011) suspicion that false positives are common. P-curve simply does not provide information about false positives unless all significant results are false. The power estimate could be an average of false positives and true positives with high power.

Stay Calm: Use Z-Curve

There is no need to fight over p-curve because we have a better method that works with and without heterogeneity called z-curve (Bartos & Schimmack, 2022; Brunner & Schimmack, 2020). When we developed z-curve, we compared it against alternative models. We presented all simulations, even those where p-curve performed a bit better with homogenous data. The simulation showed that both methods have only a small bias when heterogeneity is small, but p-curve has a large bias when heterogeneity is large. So, we can simply use z-curve for all data.

Here is a simple example that shows how z-curve is superior to p-curve, even if p-curve estimates are only slightly biased. The simulation uses 50% false positives, and 50% true positives with 80% power. It is easy to see that we would expect .50 * .05 + .50 * .80 = .025 + .40 = 42.5% significant results. This is the expected replication rate if the studies were replicated exactly without selection bias. It is called power in p-curve, but that term ignores that real data may contain false positives.

Figure 2: p-curve plot with power estimate

Consistent with Simonsohn’s claims, the bias in the p-curve estimate is small (p-curve estimate: 44% vs. true parameter: 42.5%), but p-curve does not tell us whether all studies have about 40% power or whether this is an average of studies that vary in power or even include false positive results.

Z-curve’s estimate of the expected replication rate (ERR) is accurate (42%). More important, it also recognizes that the data are heterogeneous. A simple way to see this is that it estimates a lower discovery rate for all studies, including non-significant results that are not reported. A discrepancy between EDR and ERR indicates heterogeneity because studies with higher power have a higher chance of being in the set of significant results.

Z-curve also estimates the expected discovery rate for the full range of z-values, including non-significant results that are not reported (see the red dotted line). The EDR of 11% is incompatible with the observed discovery rate of 100% (only significant results are published). Even the upper limit of the CI is only 18% (about 5 studies for each significant result). The p-curve power estimate cannot be used to evaluate publication bias, although p-curve is often falsely used as a test of publication bias.

Finally, the EDR can be used to estimate the false positive risk with a formula by Soric (1989). We know the true percentage is 50%. The z-curve estimate is only 45%, but the 95%CI around this estimate is wide. Most troubling, the 1,000 studies do not rule out the possibility that all studies are false positives (the 95%CI includes 100%). This is very different from the inference we may draw from the p-curve estimate of 42% power that does not suggest a high rate of false positive results.

Z-curve also provides additional information about the expected discovery rate (EDR) for different ranges of observed z-values (see percentages below the x-axis of the z-curve plot). Results that are just significant (e.g., z = 2 to 2.5) are likely to include many false positives; in this range and the expected discovery rate is only about 27%.

By contrast, studies with larger z-values (e.g., z > 4) are almost certainly based on true effects and have an expected replication probability of around 80%. Z-curve slightly overestimate replicability for these high z-values, but the main point is that discovery rates are expected to change dramatically due to heterogeneity in the probability to obtain significant results.


Conclusion

This blog post showed how silly it is to fight over p-curve with carefully selected simulation scenarios. P-curve makes the unrealistic assumption that studies are homogenous. Z-curve avoids this assumption, models heterogeneity, and provides more information about the data than p-curve can. So, researchers can just use z-curve, and the performance of p-curve is no longer relevant. It is a bit like testing assumptions about equal variances in t-tests. We can just use a t-test that avoids this assumption.

It is clear why Simonsohn does not mention a method that replaced p-curve several years ago on the Datacolada blogpost or allows comments that would alert readers to alternative methods. It is not clear why Morey and Davis-Stober criticize a method that is obsolete and do not mention that their criticisms have been addressed by a better method. But then, who understands the childish games of academics that produce publications, but not knowledge.

Unlike Datacolada my blog allows for comments and I welcome comments by Datacolada, Morey, Davis-Stober, or anybody else.


The Mythical Marriage of Fisher and Neyman-Pearson

Preface

This post grew out of a long discussion with ChatGPT about Gerd Gigerenzer’s treatment of the history of statistics and its influence on psychology in his book The Empire of Chance (1989).

I actually found this book by chance, because ChatGPT recommended it during a literature search. Psychology now has an overwhelmingly journal-based culture, where articles appear online as PDFs and are rarely accompanied by physical books. I am old enough to remember browsing the shelves of real libraries—especially the magnificent stacks at the University of Illinois and the Roberts Library in Toronto—but I stopped doing so about fifteen years ago. Younger colleagues may never know that quiet pleasure.

So, it is not surprising that few psychologists have actually read The Empire of Chance. Fortunately, I was able to access it through my University of Toronto credentials. For most readers, however, it remains locked behind a paywall.

To explore Gigerenzer’s arguments more closely, I uploaded the relevant chapters to ChatGPT (since they are not freely available) and discussed the content in light of my broader research on the history of power, significance testing, and replicability.

This post summarizes our shared understanding of how statistical thinking entered psychology, and why we concluded that Gigerenzer’s famous claim that null-hypothesis significance testing (NHST) is a hybrid of Fisher and Neyman-Pearson is inaccurate. It isn’t a hybrid at all. It’s pure Fisher.

Neyman and Pearson’s framework never gained traction. Today Neyman’s invention of confidence intervals dominates sound statistical inferences because they avoid the problems of Fisher’s significance testing without the difficulties of implementing Neyman-Pearson’s approach. So, we moved from Fischer to Neyman and forgot and Neyman-Pearson were never relevant in the use of statistics by psychologists.

Introduction

For decades psychologists have been told that the way they analyze data—null-hypothesis significance testing—is a hybrid of two rival statistical philosophies: Fisher’s significance test and the Neyman-Pearson decision framework.

Gigerenzer popularized this story in The Empire of Chance (1989), arguing that textbooks merged the two systems and gave the illusion of harmony. It’s a neat narrative—but it doesn’t survive close inspection.


1 · Fisher’s significance test

1️⃣ Make a prediction or explore whether two variables are related.
2️⃣ Collect data and compute a p-value assuming no relation (H₀).
3️⃣ If p is small enough, reject H₀ and claim support for the expected directional effect.
4️⃣ As Fisher wrote in 1935, “every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis” (The Design of Experiments, p. 16).

This deceptively simple procedure made inference a one-sided game: we seek “disproof” of H₀, not testing of a specific H₁.
In practice, rejecting H₀ is treated as confirming our theory—verification dressed up as falsification.


2 · The Neyman-Pearson alternative

Neyman and Pearson proposed a symmetric system of two hypotheses, H₀ and H₁, each with defined long-run error rates.

  • H₀ can be rejected, but H₁ can also be rejected.
  • To do so we must specify a concrete alternative, e.g., d = 0.5, and design the study with known α and β.
  • A result can therefore falsify a risky prediction (reject d = .8 means the effect is smaller than “large”).
  • If both survive, we test again.

In this framework, power and Type II error are not afterthoughts—they’re the price of claiming evidence.


3 · Why it never took root in psychology

Psychology kept Fisher’s asymmetry. Researchers learned to celebrate significant results and ignore non-significant ones. Gigerenzer claimed textbooks resolved the dispute by fusing both schools into a “hybrid model.” But the evidence tells a different story.


4 · Why the “hybrid” is a myth

1 · Fixed thresholds were Fisherian conveniences.
Before computers, tables listed critical values for .05, .01, and .001. Using them was a practical shortcut, not an adoption of Neyman-Pearson error control.
Reporting “p < .05” or adding ** for p < .01 continued Fisher’s graded-evidence tradition.

2 · Type II errors were rhetorical, not operational.
Textbooks mentioned them vaguely—“the probability of an error if H₀ is false”—but never linked them to a specific H₁ such as d = .5. β was seldom calculated or used.

3 · Power was rarely used for design or inference.
Even after Cohen (1962) called for power analysis, psychologists mostly ignored power or treated it only as planning advice for achieving significance, not to quantify type-II errors in inferences that rejected a specific H₁

4 · In practice, nothing changed.
Studies were published when p < .05 and forgotten when p > .05. Journal success rates were over 90%, reflect a one-sided testing culture, not a balanced decision framework.


5 · The broader context

Other social sciences followed different paths. Economists and sociologists, working with large samples and directly measurable variables, emphasized estimation and precision—effect sizes, standard errors, and confidence intervals. They had little interest in either Fisher’s or Neyman-Pearson’s philosophies, although interpretation of results was also influenced by significance thresholds.

Ironically, Neyman’s own (1937) invention of the confidence interval would have solved psychology’s dilemma: a CI simultaneously rejects extreme H₀ and H₁ values without pre-specifying them. Gigerenzer does not mention the modern hybrid of significance testing that uses values of 0 inside or outside the confidence interval to replace Fisher’s significance test.


6 · Conclusion

The so-called hybrid of Fisher and Neyman-Pearson is a myth.

Psychology adopted Fisher’s one-sided test with a conventional publishing threshold of p < .05 and never implemented the symmetrical logic of Neyman-Pearson decisions.

Even Cohen’s power analysis was absorbed into the same framework—another tool for ensuring significance, not for falsifying theoretical claims.

What Gigerenzer described as a marriage was never consummated.

Psychology has lived for nearly a century with Fisher alone, and is now replacing it with Neyman’s confidence intervals.

Neyman-Pearson’s marriage never produced any children.


References

Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65(3), 145–153. https://doi.org/10.1037/h0045186

Fisher, R. A. (1935). The design of experiments. Edinburgh: Oliver & Boyd.

Gigerenzer, G., Swijtink, Z., Porter, T., Daston, L., Beatty, J., & Krickeberg, K. (1989). The empire of chance: How probability changed science and everyday life. Cambridge University Press.

Gigerenzer, G. (1993). The superego, the ego, and the id of statistical reasoning. In G. Keren & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences: Methodological issues (pp. 311–339). Hillsdale, NJ: Erlbaum.

Gigerenzer, G. (2004). Mindless statistics. Journal of Socio-Economics, 33(5), 587–606. https://doi.org/10.1016/j.socec.2004.09.033

Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, 231, 289–337.

Neyman, J. (1937). Outline of a theory of statistical estimation based on the classical theory of probability. Philosophical Transactions of the Royal Society of London. Series A, 236, 333–380.

Tversky, A., & Kahneman, D. (1971). Belief in the law of small numbers. Psychological Bulletin, 76(2), 105–110. https://doi.org/10.1037/h0031322


A Better P-Curve: The Aggregated Cauchy Association Test with PP-Values

Main Points

  1. Purpose of p-curve.
    The p-curve method was introduced as a test of the null hypothesis that all significant results are false positives (that is, that (H_0) is true for all tests).
  2. Use in psychology.
    P-curve became popular in psychology as a way to demonstrate that a body of research has evidential value—in other words, that not all significant results are false positives.
  3. Criticisms.
    The method has been criticized on several grounds, including unsound statistical assumptions and inadmissible decision rules (Morey & Davis-Stober, 2025).
  4. Earlier related work.
    It has gone largely unrecognized that similar approaches for combining truncated p-values were developed much earlier in genomics (e.g., the Truncated Product Method, TPM; Zaykin et al., 2002).
  5. A modern alternative.
    A newer and more widely used approach is the Aggregated Cauchy Association Test (ACAT). Although ACAT does not assume truncation, truncated p-values can be analyzed by selecting p < α and dividing them by α to obtain pp-values that follow a uniform(0, 1) null distribution.
  6. Advantages over p-curve.
    This pp + ACAT approach addresses many of the statistical problems identified by Morey and Davis-Stober (2025), including inadmissibility, discontinuity, and sensitivity to values near α, while retaining the same logical test of the global null.
  7. Remaining limitations.
    Like p-curve, ACAT tests the hypothesis that all significant results are false positives, but it does not quantify the strength of evidence (e.g., average power) or capture heterogeneity among studies. For this reason, z-curve remains the preferred method for evaluating evidential value in a set of significant results.

References

Brunner, J., & Schimmack, U. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta-Psychology, 4, Article MP.2018.874. https://doi.org/10.15626/MP.2018.874 open.lnu.se+2CRAN+2

Bartoš, F., & Schimmack, U. (2022). Z-curve 2.0: Estimating replication rates and discovery rates. Meta-Psychology, 6, Article MP.2021.2720. https://doi.org/10.15626/MP.2021.2720

Morey, R. D., & Davis-Stober, C. P. (2025). On the poor statistical properties of the p-curve. American Statistician, in press. (see also Title: Review of “On the Poor Statistical Properties of the P-Curve Meta-Analytic Procedure” Published: August 8, 2025 by Ulrich Schimmack on the Replication Index blog. replicationindex.com
URL: https://replicationindex.com/2025/08/08/review-of-on-the-poor-statistical-properties-of-the-p-curve-meta-analytic-procedure

Zaykin, D. V., Zhivotovsky, L. A., Westfall, P. H., & Weir, B. S. (2002). Truncated product method for combining p-values. Genetic Epidemiology, 22(2), 170–185. https://doi.org/10.1002/gepi.0042

Liu, Y., Chen, S., Li, Z., Morrison, A. C., Boerwinkle, E., & Lin, X. (2019). ACAT: A fast and powerful p-value combination method for rare-variant analysis in sequencing studies. American Journal of Human Genetics, 104(3), 410–421. https://doi.org/10.1016/j.ajhg.2019.01.002

Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve: A key to the file-drawer. Journal of Experimental Psychology: General, 143(2), 534–547. https://doi.org/10.1037/a0033242


The Power of Post-Hoc Power: The Glucose Theory of Willpower is Dead

It has been claimed that the psychological literature is filled with zombie theories—walking dead that are not supported by evidence, no longer believed by insiders or even by the original proponents, but that live on in mindless citations and textbooks forever. Sometimes, though, they do die in silence, without a funeral or obituary. One example of a dead theory in psychology is the glucose theory of willpower.

While the inventor of this theory (and the underlying phenomenon), Roy F. Baumeister, still clings to his broader theory of willpower despite two large replication failures, even he has walked away from the idea that willpower depends on blood-glucose levels.


1. Gailliot’s Original Glucose Studies (2007–2009)

Baumeister and Gailliot’s studies (e.g., Gailliot et al., 2007, Journal of Personality and Social Psychology; Gailliot & Baumeister, 2007, Psychological Science) claimed that exerting self-control reduces blood-glucose levels and that restoring glucose—even by drinking a sugary beverage—replenishes willpower and improves performance on subsequent self-control tasks.

The reported effects were dramatic. Across a series of small-sample experiments, the authors found large, consistently significant results, suggesting a robust physiological mechanism underlying self-control. The simplicity of the claim—that willpower literally runs on sugar—made the theory intuitively appealing and easy to test. These findings immediately inspired a wave of replications and extensions, many of which were conceptual replications using similar small-sample, between-subjects designs.


2. Schimmack’s 2012 “Incredibility Index” Analysis

In 2012, Ulrich Schimmack applied his Incredibility Index (IC-index) to Gailliot’s published results—a meta-analytic tool that tests whether the reported proportion of significant results is credible given the observed power of the individual studies.

The results were striking. The distribution of p-values in Gailliot’s papers was too good to be true. The success rate (about 100%) was incompatible with the small sample sizes, and even with the inflated effect-size estimates, the estimated average power of these studies was far too low to justify only significant results.

During the review process, Roy F. Baumeister openly admitted that the published studies were selected from a larger set that included many null results. He justified this practice by claiming that this was simply what everyone did—an argument that, while true at the time, highlighted how pervasive questionable research practices had become in psychology. At that point, it was still unclear whether these practices merely inflated real effects or had created an entirely spurious phenomenon.


3. The Fallout: A Futile Wave of Follow-Up Research

Yet the glucose theory continued to attract attention—precisely because those early, inflated findings appeared compelling. For several years, researchers treated it as a promising physiological explanation for self-control. However, the replication crisis made it possible to publish replication failures, and several articles reported that they could not find effects of glucose on willpower.

From 2012 to 2018, dozens of researchers tried to replicate the glucose–willpower effect and found mostly null or inconsistent results. Eventually, large-scale meta-analyses—Vadillo, Gold, and Osman (2016) and Lange and Eggert (2014)—confirmed the suspicion: the literature was biased, the true effect size was near zero, and the original findings likely reflected p-hacking and selective publication.


4. The Current Status (Post-2020): The Glucose Model Is Dead

Today’s consensus—including Baumeister’s own 2024 review—effectively concedes that the glucose theory of willpower is dead, despite the initially strong-looking evidence for it. That evidence only appeared strong but was in fact illusory, based on unscientific practices that inflated effect sizes and suppressed null results.

Even the broader theory of ego-depletion has come under heavy criticism, as it was developed and supported using the same questionable practices. Two large replication studies failed to reproduce the basic ego-depletion effect and produced effect-size estimates close to zero. Baumeister continues to defend the broader theory, so it cannot yet be declared dead, but the glucose model has vanished.


5. Why This Matters for Meta-Science

The glucose theory of willpower is more than just a failed idea—it is a vivid example of how questionable research practices can create the illusion of discovery and lead to years of wasted effort. Schimmack’s 2012 incredibility analysis exposed the statistical impossibility of Gailliot’s findings long before the replication failures confirmed it.

6. Postscript

The future will show how many other walking-dead theories and ghostly phenomena, created in laboratories through dark research practices, will eventually die and join the glucose theory in the graveyard of psychological ideas that lack empirical support. To make this happen, psychology needs more courageous scientists—modern-day Dr. Van Helsings—who are willing to put a wooden stake through the heart of theories that deserve to die.

Happy Halloween

Shining a Light on Dark Practices in Peer-Review

Pre-publication peer review is the biggest scam in the history of science. Researchers have to pass the evaluation by “peers,” before an editor (often a mindless vote-counter of opinions shared in reviews) decides whether the article deserves to be published. Arbitrary limits on the number of publications, give reviewers power to censor by writing extremely negative reviews and recommending rejection. This faulty system of quality control is maintained by publishers to resell publicly funded research to publicly funded universities with fat profit margins. In short, we would be better off to publish everything and let open post-publication reviews decide what is good or not. Until then, researchers have to battle with anonymous reviewers in a Kafkaesque closed and hidden process at the mercy of dark forces that decide their fate.

I have encountered this many times myself, but right now I am writing to help researchers who encounter Reviewer 2, when they use z-curve to examine the credibility of research in their field. Interestingly, I am never asked to review z-curve articles, although I am the expert on this method, but Reviewer 2 is asked to comment on the method and is allowed to make outrageously false claims. If this happens to you, please feel free to contact me to write a rebuttal or just point the editor to this blog post.

You can also ask AI to evaluate the quality of Reviewer 2’s arguments and my rebuttals. I did so myself and ChatGPT and Perplexity concluded that my counterarguments are scientifically accurate and valid, but don’t trust me. Trust is good, but fact checking is better.

Reviewer 2

1. The authors primarily conceptualize replicability in terms of retrospective (or post-hoc) average power (also known as the “expected discovery rate” / EDR).

Rebuttal:

This statement is false and shows limited knowledge by this reviewer about the method they criticize. Z-curve estimates two percentages: The percentage of significant results that is expected if the studies that were used in the meta-analysis were reproduced exactly and analyzed exactly the same way using the same sample size.  This is the Expected REPLICATION rate (ERR). It also estimates the percentage of significant results that are expected in this new replication set of studies. This estimate includes all of the non-significant results that will also be produced, but may or may not have been reported. This is called the expected discovery rate (discovery = p < .05). The reviewer confuses ERR and EDR.0

2. Average power is a meta-analytic analogue of single study post hoc power. Single study post hoc power has been greatly lampooned for many decades now (Hoenig & Heisey, 2001; Yuan & Maxwell, 2005). For example, Greenland (2012) writes that post hoc power computed from completed studies is: “Irrelevan[t]: Power refers only to future studies done on populations that look exactly like our sample with respect to the estimates from the sample used in the power calculation; for a study as completed (observed), it is analogous to giving odds on a horse race after seeing the outcome.” In addition, average power is not relevant to the replicability of actual prospective replication studies. As McShane, Bockenholt, and Hansen (2020) write: “Average power is relevant to replicability if and only if replication is defined in terms of statistical significance within the classical frequentist repeated sampling framework. As this framework is both purely hypothetical and ontologically impossible, average power is not relevant to the replicability of actual prospective replication studies.”

Rebuttal:

All of these comments are irrelevant and rest on confusion about the term power. The classic definition of power defines power as a probability of obtaining a significant result given a hypothetical alternative hypothesis.  This definition of power is irrelevant in studies that estimate the ERR and EDR that are influenced by the true population effect sizes of studies (and sampling error), not some hypothetical values that are no longer relevant when actual data are available.

The criticism of post-hoc power is also relevant because it is about the interpretation of results in a single study, not about meta-analysis of many studies.

Finally, McShane et al.’s article makes two mistakes. It uses the term power for empirical estimates, when power is defined in terms of hypothetical values. Second, the article relied on sets of 30 studies to claim that estimates are imprecise, but precision increases with the set of studies. This article had over 100 studies and the precision of the estimates is clearly specified with 95% confidence interval. Thus, the uncertainty of the results can and should be evaluated with the actual results and not based on an article that did not examine z-curve estimates.

3. Pek et al (2022) also note ontological concerns with average power. Pek et al (2024) further note that (as per the present authors’ approach) “using power for evaluating completed studies can be counterproductive.”

Rebuttal

Pek et al.’s criticism is about studies that compute post-hoc power based on the definition of power as a hypothetical construct. This criticism does not apply to z-curve estimates that estimate expected values based on true population effect sizes and not statistical power as defined by Pek et al. Pek et al. also did not discuss z-curve as a method to estimate expected discovery rates or expected replication rates. So, the article is irrelevant to the evaluation of z-curve.

4. While I have thus far focused on the primary manner in which the authors conceptualize replicability (i.e., average power / EDR), exactly the same concerns apply to the secondary manner (i.e., the “expected replication rate” / ERR).

Rebuttal  

The same rebuttal holds for the ERR. It is not an estimate of average power as defined by Pek. Because it estimates the true probability of significant results in exact replication studies, whereas Pek et al. define power as a hypothetical construct. Estimating the ERR is not wrong, calling it power is.  The terms EDR and ERR therefore make it clear that these estimates are not estimates of average power, in the classic sense of statistical power. So, this criticism does not address z-curve estimates and their validity.

5. Rosenthal was a pioneer studying replication in psychology. Drawing on his work dating from the 1960s, Rosenthal (1990) dismissed evaluations of replicability that are dichotomous and based on significance testing as “the traditional, not very useful view of replication” and advocated evaluations of replicability that are continuous and based on effect sizes as “the newer, more useful view of replication. The authors’ approach in this paper is dichotomous and based on significance testing and thus falls squarely in what Rosenthal thirty-five years ago today already termed “the traditional, not very useful view of replication.”

Rebuttal

Rosenthal made contributions to effect size meta-analysis. They are useful and important when researchers want to combine results from several close or direct replications to estimate the population effect size.  The main in this article is different. Science-wide estimates of EDR and ERR can provide useful information for the interpretation of individual studies that lack multiple replications and can be meta-analyzed.  Moreover, it can provide information about the typical amount of publication bias in a literature and provide information for the planning of future studies.  In short, effect-size meta-analysis is important. So, is knowing the amount of publication bias, replicability, and the false positive risk in a field of studies. Effect size meta-analyses do not provide this information.

Rosenthal also was responsible for a faulty way to assess publication bias in meta-analysis (fail-safe N) that suggested publication bias is not a big problem in meta-analysis. Z-curve, however, can estimate the actual amount of publication bias in a literature and has shown massive publication bias and a high false positive risk in literatures with hundreds of studies. For example, z-curve showed that Noble Laureate had picked priming studies for his bestseller “Thinking: Fast and Slow” that had a false positive risk of 100%. He openly distanced himself from the researchers who had published results and were unwilling to back up their claims with actual replication studies.  In this example, the average effect size of these different studies was not important. What mattered was that the studies failed to provide credible evidence that social priming works.

6. “It is therefore not surprising that a common finding among replication projects is that unbiased replication studies with larger sample sizes produce much smaller effect sizes. For instance, the ### replication project found that 88% of the replication effect sizes were severely inflated in comparison to the original effect sizes, with a median percentage decrease of 75%. As can be seen, the ### replication project takes a continuous quantitative view based on effect sizes, reporting that the median decrease in the effect size estimates was 75% and going on to characterize the full distribution of effect size differentials in Figures 1 and 2 of that paper. I do not find the present authors’ retrospective and dichotomous approach based on significance testing to be an advance over the ### replication project’s prospective and continuous approach based on effect sizes. Indeed, I view it as retrograde.

Rebuttal

Reviewer 2 does not stop for one second to explain why effect sizes estimates shrank by about 80%. The z-curve analysis shows why; the orginal studies reported inflated effect sizes estimates because studies with large sampling error require large estimates to get significant results. The actual replication results cannot show that this is the reason, but the z-curve analysis of original studies can because it estimates how replicable these studies are in the hypothetical scenario that they are exactly replicated with a new sample. The argument also ignores that effect size estimates are rarely used to interpret results. Most of the time, the key claim is a rejection of the null-hypothesis in a specific direction. This conclusion is not altered by lower effect sizes, but it is altered when the result is no longer significant. The original conclusion no longer holds.

7. Even for those who prefer a dichotomous approach based on significance testing, when such is applied to the sports science replication project, we get a result similar to the present authors’ result (see middle of page 12 of their manuscript). Therefore, in a very important sense, the present authors’ result is already known (or at least cannot be said to be novel).

This comment by the reviewer shows once more their lack of understanding of science and a lack of awareness of the methodological discussion about the importance of replication studies that by definition lack novelty. Ironically, they applaud a replication project and then criticize a replication study for being unoriginal. The proper comparison of the actual replications and the z-curve analysis is this: Both projects used different methods on different sets of studies and produced consistent results. The novel finding is that in this literature both estimates converge on the same conclusion. When two different methods with different data show consistent results, it provides evidence that the results are not driven by sampling error (e.g., actual replication studies picked studies with easy and cheap designs) or methodological biases (e.g., replication studies produce weaker effects because the replication researchers are not experts in that field).  In short, consistent results provide valuable information. Novelty is important for original studies, not for meta-analyses that assess how many of the novel original findings are actually findings and how many may be false positive ones.

8. The authors’ use the forensic Z-curve meta-analytic procedure of Brunner & Schimmack (2020) and Bartos & Schimmack (2022). On page 3 of their manuscript, they note that they could use the forensic P-curve meta-analytic procedure of Simonsohn, Nelson, and Simmons instead. In a forthcoming Journal of the American Statistical Association paper, Morey and Davis-Stober provide a formal analysis that proves that the P-curve has poor statistical properties. For example, they prove that the P-curve produces inconsistent estimates of average power / EDR. One might question the relevance of this to the Z-curve and thus the present manuscript. I quote the final paragraph of Morey and Davis-Stober:

“As a final point, we suggest that meta-scientists be more skeptical of procedures like the P-curve in the meta-scientific literature. Papers introducing them are often light on statistical exposition, using metaphors [and] a few simulations to make sweeping arguments. Simulation is a powerful tool and can help build intuition, but it is not a substitute for formal analysis. Simulation may provide hints of problems with a procedure, but only if the simulator’s formal knowledge helps guide the choice of simulations. A simulator might quit after running a few simulations that tell them what they think is true while problems remain uncovered. Given the implications of poor forensic procedures for science, all such procedures demand deeper formal scrutiny.”

This forthcoming paper is extremely relevant to the present manuscript because the very paragraph above could be written about the Z-curve.

Rebuttal

In a legal trial, this witness would be held in contempt. They are simply lying. Brunner and Schimmack (2020) directly compared p-curve and z-curve and showed that p-curve fails when data are heterogeneous as they typically are and as they are in this article (heterogeneity:  ERR > EDR, homogeneity: ERR = EDR). Schimmack and Brunner have also written several subsequent criticisms of p-curve. Morey and Davis-Stober’s article adds to this criticism and the p-curve authors are not defending their method against these criticisms. So, yes, p-curve was an attempt to estimate the true power of a set of studies, but it failed.

It is ridiculous to imply that we can just use any criticism of p-curve and apply it to a fundamentally different method.  P-cure was not evaluated by simulation studies. Z-curve has been evaluated with hundreds of simulation studies and performs well with typical data sets, including data like the one in this article. The convergence between results from the actual replication project and z-curve predictions that the Reviewer used to claim “lack of novelty” is also relevant here. If z-curve were flawed, why does it produce estimates that are validated with actual replication outcomes?

9. Turning back to this manuscript and its use of the Z-curve, in short, we at present know next to nothing about the statistical properties of the Z-curve (just as we knew next to nothing about the statistical properties of the P-curve until Morey and Davis-Stober came along). The statistical properties of the Z-curve may be as poor or worse than those of the P-curve. Or they may be solid. We simply cannot say. Morey and Davis-Stober write: “Given the stated purpose of the P-curve—evaluating the trustworthiness of scientific literatures—the stakes are too high to use tests with such poor, or poorly-understood, properties.” The same applies to the Z-curve which has the same stated purpose. As a consequence, I remain very skeptical of any use of the Z-curve until its properties have been investigated formally and shown not to be wanting—especially given the very high stakes involved.

Rebuttal

I had an email discussion with Davis-Stober and he was not aware of z-curve and does not know anything about z-curve. He simply does not think it is useful to estimate publication bias, but that is his personal opinion, and not a criticism of a method that estimates it.

10. You refer to these four quantities as “parameters” but they are not parameters. The word parameter has a formal definition within the context of a statistical model and these do not qualify. These are outputs or estimands but not parameters.

No Rebuttal

That is correct. EDR and ERR are estimates of population parameters not parameters themselves.  Estimands is a new fancy word that few psychologists use. The word estimates is good enough. ODR, EDR ERR and FDR are estimates of population parameters. Correcting this mistake does not change anything substantial about the results.

11   You assert (arguably rather blithely) that the Z-curve’s independence assumption is met in your analysis because only one p-value per study is included in the analysis. This is of course not necessarily true. If, for example, the 269 studies share authors or sets of authors, that could induce dependence. There are of course many additional sources of possible dependence. One simply cannot say.

Rebuttal

This is simply false. The independence assumption is about the sampling error of studies, and each new sample has a new sampling error. If all studies used z-tests and had the same effect size and sample size, we expect an average sampling error of 1. When studies are heterogeneous, there is additional variation due to real differences in the non-centrality parameters (the location of the normal distribution on the x-axis of z-values) that describes the sampling distribution, but that is irrelevant for z-curve because it makes no assumptions about that distribution. Some studies from one author may be close to z = 0 and those of other authors may be close to 3. That is heterogeneity, not dependence in sampling errors. Dependence of sampling errors only occurs for some analysis based on the same dataset (e.g., correlated dependent variables).

12.  The authors discuss many subjective choices or value judgments as if they were objective. An example that recurs throughout the manuscript is the discussion and use of alpha = 0.05 and power = 0.80. As is well known, any choice of alpha and power reflects a particular tradeoff between the relative costs of Type I errors versus Type II errors. Except in very narrow circumstance where these relative costs can be objectively quantified (e.g. industrial quality control), these relative costs reflect a particular subjective utility (or loss) function. This subjective function will in turn vary by context or even by different people working within the same context (Neyman, 1977). This is why some have made calls for researchers to “justify their alpha” and power in light of their subjective preferences and idiosyncratic research contexts (see, for example, Lakens et al, 2018). It would be helpful if the authors discussed a range of possible (alpha, power) pairs. Alternatively, if they believe (alpha =  0.05, power = 0.80) are objectively justified in their setting, please state that and argue in favor of it. This comment applies more broadly to other quantities that the authors tend to suggest are objective (e.g., the percentage of studies with “statistically significant” results, the replication rate, etc.): either recognize the subjectivity involved or justify the values of these quantities that you believe are objectively optimal.

Rebuttal

I do not see how this is related to z-curve, but to blame the authors of this article for the mindless use of alpha = .05 that has been in place since Fisher published his first book with tables that allowed researchers to claim significance at that level is just another strange and unhinged comment by this Reviewer that revealed nothing but willful ignorance; except for the comment about parameters.

The Schimmack-Pek Controversy

🔹 Core Issue

The controversy centers on whether it is legitimate to estimate the average statistical power of completed studies—that is, to use published test statistics to infer how often those studies would produce significant results if replicated.

  • Schimmack’s position: Average power can and should be estimated empirically from published results.
  • Pek’s position: Power is a hypothetical construct used for planning future studies, not something that can be meaningfully estimated post hoc.

🔹 Schimmack’s Position (Brunner & Schimmack, 2020; Schimmack, 2025)

  1. Two Concepts of Power
    • Hypothetical power: The probability of significance based on an assumed true effect size before data collection.
    • True power: The actual long-run probability that studies in a literature produce significant results, given their real (not assumed) effect sizes and sample sizes.
      Schimmack argues both concepts are legitimate—one for planning, one for evaluating.
  2. Empirical Estimation is Possible
    • Using methods like z-curve, one can reconstruct the distribution of significant test statistics (z-values, t-values) to estimate:
      • Expected Discovery Rate (EDR): expected proportion of significant results after accounting for selection bias.
      • Expected Replication Rate (ERR): probability that a significant finding would be significant again if replicated.
        These correspond to estimates of average true power.
  3. Purpose of Estimation
    • Estimating average power reveals the credibility of a research area: if the observed success rate (e.g., 90%) far exceeds the estimated true success rate (e.g., 30%), the field likely suffers from publication bias or p-hacking.
    • Average power is thus an index of evidential value and reproducibility, not a design tool.
  4. Rebuttal to Semantic Objection
    • Even if “power” was historically defined for hypothetical design contexts, that’s a semantic convention, not a logical limitation.
    • Redefining or replacing the term (e.g., “expected discovery rate”) does not change the underlying empirical reality that studies have a certain probability of success given their true effects.

🔹 Pek’s Position (Pek, Hoisington-Shaw, & Wegener, 2024, Psychological Methods)

  1. Power Is Hypothetical
    • By definition, power is the probability of rejecting H₀ given a true effect size and sample size in a planned design.
    • Once data are collected, the “true effect” is unknown, and the observed result no longer provides information about power.
  2. Post-hoc Power Is Misleading
    • Power computed from an observed effect size is mathematically redundant with the p-value.
    • Therefore, post-hoc power analysis adds no new information—it simply recasts the p-value in another form (the so-called “power-p equivalence” argument).
  3. Meta-Analytic Power Estimation Is Ontologically Flawed
    • Using the same term (“power”) to describe retrospective estimates confuses the conceptual role of power (a design tool) with empirical inference about data.
    • Pek argues that such redefinitions create an “ontological error”—blurring what power is (a pre-data probability) versus what z-curve estimates (a property of observed distributions).
  4. Proper Role of Power
    • Power analysis should be reserved for planning new studies to achieve a desired level of sensitivity, not for evaluating past research.

🔹 Schimmack’s Counterarguments (2025 Blog & Responses)

  1. Misplaced Formalism
    • The “ontological” objection is purely semantic: words often have multiple legitimate meanings depending on context (e.g., “force” in physics vs. conversation).
    • Cohen himself used power in both planning and evaluative contexts (e.g., Cohen, 1962; Sedlmeier & Gigerenzer, 1989).
  2. Empirical Track Record
    • Dozens of meta-analyses since the 1960s have reported “average power” of published studies—this tradition predates Pek’s definitional restriction.
    • Methods like z-curve extend that logic by correcting for selection bias and estimating actual discovery probabilities.
  3. Conceptual Utility Over Semantics
    • Regardless of what it’s called, the estimated probability that a significant result would replicate is an empirically meaningful and policy-relevant measure.
    • The debate over the label “power” is a distraction from the substantive goal: improving credibility and reproducibility.
  4. Meta-Science vs. Design Science
    • Power as used in z-curve belongs to meta-science—the empirical study of how scientists actually behave—rather than the formal Neyman-Pearson design framework.
    • Rejecting post-hoc estimation because it violates a textbook definition misses the meta-scientific purpose entirely.

🔹 Broader Implications

IssueSchimmack’s ViewPek’s View
Definition of powerCan refer to true long-run success probability of real studiesOnly a hypothetical design probability
Use of observed dataValid and necessary for empirical evaluationInvalid; tautological with p-values
Role of z-curveA meta-scientific estimator of true discovery/replication ratesMisuses “power” concept
Philosophy of scienceEmpirical realism: definitions should follow observable realityConceptual essentialism: definitions must follow formal theory
GoalDiagnose publication bias and credibilityPreserve terminological purity of statistical theory

🔹 Summary Statement

The Schimmack–Pek controversy is ultimately about the meaning and use of statistical power.

  • Pek argues that power belongs exclusively to the design phase and cannot describe completed studies.
  • Schimmack argues that psychology needs empirical tools to assess its actual performance and that average power—or equivalently, expected discovery/replication rates—provides exactly that.

In short:

Pek defends a definition of power of a single study; Schimmack invented a method to estimate average power.


Psycho-Science: Unscientific Statisticians Enable Bad Research Practices


When Every Hypothesis Is True

Psychologists want to be scientists so badly that they have started rebranding themselves as psychological scientists. We now have journals like Psychological Science and departments renamed from Psychology to Psychological and Brain Sciences. It is an odd development, considering that psychology already means the study of the mind and behavior (APA Dictionary of Psychology). For decades, psychologists were content to call themselves psychologists, just as biologists are content to be biologists. But somewhere along the way, some began to worry that “psychologist” sounded too much like “astrologist.” So they added “science” to their name, hoping that a new label might make it true.

Of course, calling yourself a scientist does not make you one—any more than drawing a salary from a university or holding a PhD does. To study something scientifically requires following the basic rules of science: form falsifiable theories, test them empirically, and revise or abandon them when the data show that you are wrong. Unfortunately, many psychologists have been trained to believe that they can be scientists without ever risking that outcome.

When Every Hypothesis Is True

Awareness that something is wrong with psychological research is not new. In 1959, Sterling discovered that more than 90 percent of published studies in psychology supported the authors’ hypotheses. He repeated the finding in 1995, and it was still true in the 2010s (Schimmack, 2020). Sterling et al. (1995) already suggested that this high success rate is too good to be true. Graduate students quickly learn that publishing depends on getting significant results, and everyone knows it. The studies that do not “work” simply disappear. This is publication bias, and it undermines the very foundation of science.

The replication crisis made the problem visible. In the 2010s, the Open Science Collaboration (2015) tried to replicate 100 published results. Only about 25 percent of the social-psychology findings and 50 percent of the cognitive-psychology findings held up. The most straightforward explanation was that the original studies had a low probability of producing a significant result, but only the lucky ones were published. Luck alone, however, cannot explain the remarkably high success rates. Psychologists also used statistical tricks to inflate effect sizes to reach significance.

Scientific Doping

John, Loewenstein, and Prelec (2012) coined the term scientific doping for the set of behaviors that inflate apparent success rates—running multiple analyses, stopping data collection once p < .05, and hiding null results. Z-curve is simply a doping test for science. The difference is that, unlike in sports, scientific doping is still legal. Nobody has lost their job for concealing null findings or for collecting data until the numbers “worked.” When I once compared a famous psychologist to Lance Armstrong, I was threatened with a lawsuit, and I had to clarify that there is a distinction between banned substances in sports and legal p-hacking in psychology. Whereas the public assumes that scientists follow a code of honesty, the insider secret is that honest reporting of results is career-ending because only successful studies are published. Every psychological researcher knows it, but most like to hide this from the general public or their undergraduate students.

A Doping Test for Science

Together with Jerry Brunner—a psychologist-turned-statistician who left the field when he realized it was not functioning as a science—I developed z-curve, a method that estimates the true success rate in original and replication studies based on the statistical evidence in published studies (e.g., t– or F-values) (Brunner & Schimmack, 2020). Later, Bartoš and Schimmack (2022) extended the method to quantify the amount of publication bias in psychology journals.

When I applied z-curve to 678 statistical results from social-psychology journals (Motyl et al., 2017), the findings were sobering. The published success rate was 90 percent, but the estimated true success rate was only 19 percent (95 % CI = 6–36 %). Even under the most generous assumptions, the gap between the observed success rate and the true success rate is enormous. That discrepancy is not opinion; it is meta-science—the empirical study of psychologists’ behavior in their laboratories revealed by their published results. I guess that makes me a meta-psychological scientist.

Why the Pushback?

It is easy to see why empirical psychologists dislike these results—nobody enjoys learning that an entire discipline has been built on shaky foundations. What is harder to understand is the resistance from statistical methodologists, whose careers do not depend on producing significant empirical results.

In particular, Pek, Hoisington-Shaw, and Wegener (2024) appear to have made it their mission to fight against the estimation of true success rates. Some of the arguments are pure semantics (Schimmack, 2025). First, Pek insists on the definition of statistical power as a purely hypothetical construct based on hypothetical population effect sizes. Then, she criticizes anybody from Cohen (1962) to our own work for using results from actual studies for estimating true power because power is defined as hypothetical. This ignores 60 years of meta-analysis of the actual power of studies, but psycho-statisticians who decide what psychologists get to read do not see the problem with this argument. If power is defined as a hypothetical construct, it cannot be estimated with actual results. True—but we just have to change the definition of power and then we can estimate true power. A simple solution is to just use another term, which we already did when we created z-curve 2.0. Z-curve does not estimate average power. It estimates expected discovery and expected replication rates. Unlike hypothetical power, these estimates are influenced by the true population effect sizes of studies, not some hypothetical ones used for classic power estimation.

We do not need the term power to state that psychological journals have 90 % observed success rates when the expected success rates that correct for publication bias are often below 50 %, and defining power in a way that makes it impossible to apply it to actual studies does not address the empirical finding that success rates in psychology journals are inflated by publication bias. This bias undermines the credibility of claims that psychology is a science. Serious methodologists who want to improve psychology need to address the problem, not define it away with word games.

The Moral

For a long time, it was possible for psychologists to pretend that publication bias is not a problem and to ignore criticisms that success rates are incredibly high (Sterling, 1959). However, the replication crisis has shown that entire literatures can be made up from nothing. While actual replication studies are hard, z-curve makes it easy to show how implausible 90 % success rates really are. However, many psychologists do not want a doping test that holds them accountable and welcome criticism of doping tests, even if they rest on silly word games. Decades of criticism without reforms have shown that psychology is unable to fix itself—a hallmark of a real pseudoscience that uses statistical rituals to pretend to be scientific when it is not (Gigerenzer, 2004).

Psychology can rename itself as often as it likes—psychological science, psychological and brain sciences—but as long as it denies empirical evidence, protects illusions of success, and hides behind semantic arguments, it will remain what it has long been: a discipline that talks like a science but acts like a cult. While some progress has been made toward improving standards, most of these improvements are voluntary and limited to some areas of psychology. Despite some improvements, many areas of psychology have not even made these small changes.

Psychology needs an intervention. Stakeholders like funding agencies and undergraduate students have to hold psychological researchers accountable and ensure that they act in accordance with the rules of science. This means publishing results, even if they fail to confirm or even undermine a researcher’s theory. This also means that falsification of other researchers’ claims is desirable and should be encouraged. The success rate of 90 % has to come down to be taken seriously as a science. A scientific doping test is useful because it provides a clear goal for the outcome of the intervention. To get psychology clean, we have to show that it no longer uses scientific doping, and z-curve can track the progress toward that goal.


References

Bartoš, F., & Schimmack, U. (2022). Z-curve 2.0: Estimating replication success and publication bias with a mixture model. Psychological Methods, 27(3), 433–449. https://doi.org/10.1037/met0000475

Brunner, J., & Schimmack, U. (2020). Estimating replicability with z-curve. PsyArXiv Preprint. https://doi.org/10.31234/osf.io/9rhyz

Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65(3), 145–153. https://doi.org/10.1037/h0045186

Gigerenzer, G. (2004). Mindless statistics. Journal of Socio-Economics, 33(5), 587–606. https://doi.org/10.1016/j.socec.2004.09.033

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23(5), 524–532. https://doi.org/10.1177/0956797611430953

Motyl, M., Demos, A. P., Carsel, T. S., Hanson, B. E., Melton, Z. J., Mueller, A. B., Prims, J. P., Sun, J., Washburn, A. N., Wong, K. M., Yantis, C., & Skitka, L. J. (2017). The state of social and personality science: Rotten to the core, not so bad, getting better, or getting worse? Perspectives on Psychological Science, 12(4), 613–617. https://doi.org/10.1177/1745691617692103

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. https://doi.org/10.1126/science.aac4716

Pek, J., Hoisington-Shaw, K. J., & Wegener, D. T. (2024). Uses of uncertain statistical power: Designing future studies, not evaluating completed studies. Psychological Methods. Advance online publication. https://doi.org/10.1037/met0000577

Schimmack, U. (2020). The replication crisis: A z-curve analysis of social-psychology journals. Replication Index Blog. https://replicationindex.com/2020/01/04/replicability-crisis/

Schimmack, U. (2025). Reply to Pek, Hoisington-Shaw, and Wegener (2024): Defending the estimation of true success rates. Replication Index Blog. https://replicationindex.com

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance—or vice versa. Journal of the American Statistical Association, 54(285), 30–34. https://doi.org/10.1080/01621459.1959.10501497

Sterling, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication decisions revisited: The effect of the outcome of statistical tests on the decision to publish and vice versa. American Psychologist, 50(11), 1086–1089. https://doi.org/10.1037/0003-066X.50.11.1086


The Journal Psychological Methods Decides that Psychological Research Is Not Underpowered

For over sixty years, psychologists have been told that their studies are underpowered.
Starting with Cohen (1962) and repeated by Sedlmeier and Gigerenzer (1989), Maxwell (2004), Button et al. (2013), and countless meta-analyses, the message has been consistent: our typical studies lack the statistical sensitivity to detect true effects.

This “power failure” has been a cornerstone of the open science movement and a standard explanation for the replication crisis.

But according to a recent Psychological Methods editorial decision, that entire literature rests on a fundamental ontological error.


Apparently, there is no power failure—because, by definition, completed studies do not have power.


The New Logic

In the journal’s interpretation (based on Pek, Hoisington-Shaw, & Wegener, 2024):

  • Statistical power is a property of a test, not of a study.
  • It is defined only before data are collected, for a hypothetical infinite series of replications with a fixed true effect size.
  • Once data exist, the test’s long-run property no longer applies; therefore, it makes no sense to speak of a study’s power—past or present.

Hence, Cohen (1962) didn’t actually discover that psychology was underpowered.
He merely performed a few algebraic exercises about hypothetical tests.
The claim that real studies were underpowered is, under this logic, a category mistake.


The Consequences of Redefinition

If this view is taken seriously:

  • Cohen (1962, 1988), Sedlmeier & Gigerenzer (1989), Rossi (1990), Maxwell (2004), and Button et al. (2013) all committed the same “ontological error.”
  • There can be no such thing as observed or average power, since power is not a property of results.
  • And since no real study can ever be “underpowered,” there can be no “power failure” to explain replication failures.

Replication failures must therefore be due to something else—perhaps the planets, bad luck, or metaphysical indeterminacy.


The Irony

Of course, the same journal publishes meta-analyses that estimate “average power” and discuss its implications for replicability. It also accepts simulations in which average power determines expected replication rates. But when that same concept is applied to actual psychological research, it suddenly becomes “ontologically incoherent.” In other words, power matters—except when it matters.


The Bigger Picture

Sarcasm aside, this position would erase one of the few robust empirical generalizations about psychology:

that typical studies have a low probability of detecting true effects.

and that publication bias explains the 90% success rates in psychology journals (Sterling, 1959; Sterling et al., 1995; Motyl et al., 2017).

That claim—whether estimated via Cohen’s early surveys, Sedlmeier & Gigerenzer’s analyses, or modern bias-corrected methods like z-curve—has accurately predicted the replication crisis.

To declare the concept meaningless is not theoretical progress; it’s conceptual retreat.
It protects the purity of Neyman–Pearson logic at the cost of empirical relevance.
If taken literally, the new Psychological Methods stance means that “power” applies only to imaginary studies—and that real studies, by definition, can never fail.


Closing Line

So congratulations, psychology.
After six decades of self-criticism, we can finally declare victory:
our research is not underpowered—because power no longer exists.


Why Most Published Clinical Trials Are Not False (Ioannidis & Trikalinos, 2007)

This blog post was written in collaboration with ChatGPT5


Why Most Published Clinical Trials Are Not False (Ioannidis & Trikalinos, 2007)

John Ioannidis became world-famous for his 2005 essay, Why Most Published Research Findings Are False. That paper used a set of hypothetical assumptions—low power, low prior probabilities, and selective reporting—to argue that the majority of published results must be false positives. The title was rhetorically brilliant, but the argument was theoretical, not empirical.

Only two years later, Ioannidis co-authored a real data analysis that quietly contradicted his earlier claim.


1. From theory to data

In 2007, Ioannidis and Thomas Trikalinos published An Exploratory Test for an Excess of Significant Findings in Clinical Trials. They examined large meta-analyses of clinical trials, comparing the number of reported significant results to the number expected based on estimated statistical power. Their results revealed low power—around 30 % on average—but not an excess of significant findings.


2. Low power ≠ high false-positive risk

Low power increases sampling error within a single study, but it does not automatically mean that half the published results are false. As Soric (1989) showed, even with 30 % power and α = .05, the maximum false discovery rate cannot exceed 13 %, much lower than Ioannidis’s claimed in his 2005 article.


3. Small publication bias

The Clinical Trials paper found that observed success rates were only slightly higher than expected power. That implies small publication bias and relatively little inflation of effect-size estimates. Unlike psychology or social science—where success rates approach 90 %—clinical trials appeared statistically honest.


4. Replicable evidence

Most of the meta-analyses Ioannidis & Trikalinos reviewed show clear, replicated effects that rule out the null hypothesis. When multiple independent low-power studies all point in the same direction, the probability that all are false positives becomes vanishingly small.


5. Later confirmation: Jager & Leek (2014)

Jager and Leek analyzed thousands of p-values from top medical journals and estimated a false-positive risk of about 14 % for individual clinical trials—remarkably consistent with the 2007 findings and with Soric’s theoretical upper bound. Schimmack & Bartos (2023) replicated this estimate using a bias-corrected z-curve approach.


6. Ioannidis’s response

Despite this convergence, Ioannidis rejected Jager & Leek’s conclusions in a 2014 Biostatistics commentary, arguing their model was “overly optimistic.”
He did not mention that his own 2007 results implied the same low false-positive risk.
Instead, he continued to promote the notion that more than half of published findings are false—an idea that captured headlines but not empirical reality.


7. The irony

Ioannidis became a global authority on “research unreliability” and a professor at Stanford largely because of a provocative title, not because of evidence that his 2005 hypothesis was true. Ironically, his own empirical work two years later provided the best evidence against his famous claim.


Take-Home Message

Don’t trust fame. Don’t trust titles. Trust facts.

Even celebrated professors at elite universities can be wrong — sometimes dramatically so.
In today’s capitalist science, researchers are often rewarded for selling results, not for verifying them. Their papers can function as marketing — even when they’re about “bias” and “meta-science.”

So don’t take grand claims at face value, whether they come from experimental psychologists or meta-scientists who claim to expose everyone else’s errors. Always fact-check — ideally with multiple sources, and yes, even with multiple AIs. If independent analyses converge, you can start to trust the pattern.

Trust is good; Fact-Checking is better.


Cleaning Up Psychological Science

One of the most robust and replicable findings in meta-psychology is that psychologists publish over 90% significant results with under 50% power (Cohen, 1962; Sterling, 1959). There is also clear evidence from replication studies that some results are highly replicable (Stroop effect), and others are false positives (many social priming effects, ego depletion, etc.).

This heterogeneity in credibility means that we do not know which results can be trusted and which ones should be removed from the scientific record. Cleaning this mess is a daunting task and requires a Herculean effort. The question is who is going to clean up all this bullshit

Hercules cleaning Augen stables

I have developed the statistical tools that can do the job, but I cannot apply them to the thousands of articles that published millions of statistical results. Alas, the superhuman abilities of AI may provide a solution. AI can read and code articles in a fraction of the time a human coder can. Training AI to do the job may make it possible to remove all the shit that has been published in psychology journals over the years.

The task is particularly easy for articles that publish many studies and even more statistical results. These so-called multiple-study articles make it easy to detect p-hacking because it is increasingly unlikely to get significant results again and again (50% power = coin flip. head = significant, getting 10 / 10 significant results, p = .5^10 = .001).

My lab is working on training AI to code articles. Running the bias tests on these data is a cake walk for AI. To illustrate the capabilities of this approach, I report the result for an article that I encountered during the training of AI; a JPSP article with over 1,000 citations from authors that either did not care about credibility, were gullible enough to assume p < .05 = H0 is false, or simply assumed that being affiliated with prestigious universities is a valid cue for quality.

DOI: 10.1037/0022-3514.86.2.205

For this article, I instructed AI that follow-up contrasts are more important than the interaction when the theory predicts a cross-over interaction with significant opposite effects. We still had some disagreement about the choice of DVs, but that would not really alter the results because they are all “just significant”

Here is the automatically generated report on this article.

Conclusion: Bullshit that needs to be removed from the scientific record.

This means that there is massive publication bias, a high false positive risk, and inflated effect size estimates.