When Every Hypothesis Is True
Psychologists want to be scientists so badly that they have started rebranding themselves as psychological scientists. We now have journals like Psychological Science and departments renamed from Psychology to Psychological and Brain Sciences. It is an odd development, considering that psychology already means the study of the mind and behavior (APA Dictionary of Psychology). For decades, psychologists were content to call themselves psychologists, just as biologists are content to be biologists. But somewhere along the way, some began to worry that “psychologist” sounded too much like “astrologist.” So they added “science” to their name, hoping that a new label might make it true.
Of course, calling yourself a scientist does not make you one—any more than drawing a salary from a university or holding a PhD does. To study something scientifically requires following the basic rules of science: form falsifiable theories, test them empirically, and revise or abandon them when the data show that you are wrong. Unfortunately, many psychologists have been trained to believe that they can be scientists without ever risking that outcome.
When Every Hypothesis Is True
Awareness that something is wrong with psychological research is not new. In 1959, Sterling discovered that more than 90 percent of published studies in psychology supported the authors’ hypotheses. He repeated the finding in 1995, and it was still true in the 2010s (Schimmack, 2020). Sterling et al. (1995) already suggested that this high success rate is too good to be true. Graduate students quickly learn that publishing depends on getting significant results, and everyone knows it. The studies that do not “work” simply disappear. This is publication bias, and it undermines the very foundation of science.
The replication crisis made the problem visible. In the 2010s, the Open Science Collaboration (2015) tried to replicate 100 published results. Only about 25 percent of the social-psychology findings and 50 percent of the cognitive-psychology findings held up. The most straightforward explanation was that the original studies had a low probability of producing a significant result, but only the lucky ones were published. Luck alone, however, cannot explain the remarkably high success rates. Psychologists also used statistical tricks to inflate effect sizes to reach significance.
Scientific Doping
John, Loewenstein, and Prelec (2012) coined the term scientific doping for the set of behaviors that inflate apparent success rates—running multiple analyses, stopping data collection once p < .05, and hiding null results. Z-curve is simply a doping test for science. The difference is that, unlike in sports, scientific doping is still legal. Nobody has lost their job for concealing null findings or for collecting data until the numbers “worked.” When I once compared a famous psychologist to Lance Armstrong, I was threatened with a lawsuit, and I had to clarify that there is a distinction between banned substances in sports and legal p-hacking in psychology. Whereas the public assumes that scientists follow a code of honesty, the insider secret is that honest reporting of results is career-ending because only successful studies are published. Every psychological researcher knows it, but most like to hide this from the general public or their undergraduate students.
A Doping Test for Science
Together with Jerry Brunner—a psychologist-turned-statistician who left the field when he realized it was not functioning as a science—I developed z-curve, a method that estimates the true success rate in original and replication studies based on the statistical evidence in published studies (e.g., t– or F-values) (Brunner & Schimmack, 2020). Later, Bartoš and Schimmack (2022) extended the method to quantify the amount of publication bias in psychology journals.
When I applied z-curve to 678 statistical results from social-psychology journals (Motyl et al., 2017), the findings were sobering. The published success rate was 90 percent, but the estimated true success rate was only 19 percent (95 % CI = 6–36 %). Even under the most generous assumptions, the gap between the observed success rate and the true success rate is enormous. That discrepancy is not opinion; it is meta-science—the empirical study of psychologists’ behavior in their laboratories revealed by their published results. I guess that makes me a meta-psychological scientist.
Why the Pushback?
It is easy to see why empirical psychologists dislike these results—nobody enjoys learning that an entire discipline has been built on shaky foundations. What is harder to understand is the resistance from statistical methodologists, whose careers do not depend on producing significant empirical results.
In particular, Pek, Hoisington-Shaw, and Wegener (2024) appear to have made it their mission to fight against the estimation of true success rates. Some of the arguments are pure semantics (Schimmack, 2025). First, Pek insists on the definition of statistical power as a purely hypothetical construct based on hypothetical population effect sizes. Then, she criticizes anybody from Cohen (1962) to our own work for using results from actual studies for estimating true power because power is defined as hypothetical. This ignores 60 years of meta-analysis of the actual power of studies, but psycho-statisticians who decide what psychologists get to read do not see the problem with this argument. If power is defined as a hypothetical construct, it cannot be estimated with actual results. True—but we just have to change the definition of power and then we can estimate true power. A simple solution is to just use another term, which we already did when we created z-curve 2.0. Z-curve does not estimate average power. It estimates expected discovery and expected replication rates. Unlike hypothetical power, these estimates are influenced by the true population effect sizes of studies, not some hypothetical ones used for classic power estimation.
We do not need the term power to state that psychological journals have 90 % observed success rates when the expected success rates that correct for publication bias are often below 50 %, and defining power in a way that makes it impossible to apply it to actual studies does not address the empirical finding that success rates in psychology journals are inflated by publication bias. This bias undermines the credibility of claims that psychology is a science. Serious methodologists who want to improve psychology need to address the problem, not define it away with word games.
The Moral
For a long time, it was possible for psychologists to pretend that publication bias is not a problem and to ignore criticisms that success rates are incredibly high (Sterling, 1959). However, the replication crisis has shown that entire literatures can be made up from nothing. While actual replication studies are hard, z-curve makes it easy to show how implausible 90 % success rates really are. However, many psychologists do not want a doping test that holds them accountable and welcome criticism of doping tests, even if they rest on silly word games. Decades of criticism without reforms have shown that psychology is unable to fix itself—a hallmark of a real pseudoscience that uses statistical rituals to pretend to be scientific when it is not (Gigerenzer, 2004).
Psychology can rename itself as often as it likes—psychological science, psychological and brain sciences—but as long as it denies empirical evidence, protects illusions of success, and hides behind semantic arguments, it will remain what it has long been: a discipline that talks like a science but acts like a cult. While some progress has been made toward improving standards, most of these improvements are voluntary and limited to some areas of psychology. Despite some improvements, many areas of psychology have not even made these small changes.
Psychology needs an intervention. Stakeholders like funding agencies and undergraduate students have to hold psychological researchers accountable and ensure that they act in accordance with the rules of science. This means publishing results, even if they fail to confirm or even undermine a researcher’s theory. This also means that falsification of other researchers’ claims is desirable and should be encouraged. The success rate of 90 % has to come down to be taken seriously as a science. A scientific doping test is useful because it provides a clear goal for the outcome of the intervention. To get psychology clean, we have to show that it no longer uses scientific doping, and z-curve can track the progress toward that goal.
References
Bartoš, F., & Schimmack, U. (2022). Z-curve 2.0: Estimating replication success and publication bias with a mixture model. Psychological Methods, 27(3), 433–449. https://doi.org/10.1037/met0000475
Brunner, J., & Schimmack, U. (2020). Estimating replicability with z-curve. PsyArXiv Preprint. https://doi.org/10.31234/osf.io/9rhyz
Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65(3), 145–153. https://doi.org/10.1037/h0045186
Gigerenzer, G. (2004). Mindless statistics. Journal of Socio-Economics, 33(5), 587–606. https://doi.org/10.1016/j.socec.2004.09.033
John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23(5), 524–532. https://doi.org/10.1177/0956797611430953
Motyl, M., Demos, A. P., Carsel, T. S., Hanson, B. E., Melton, Z. J., Mueller, A. B., Prims, J. P., Sun, J., Washburn, A. N., Wong, K. M., Yantis, C., & Skitka, L. J. (2017). The state of social and personality science: Rotten to the core, not so bad, getting better, or getting worse? Perspectives on Psychological Science, 12(4), 613–617. https://doi.org/10.1177/1745691617692103
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. https://doi.org/10.1126/science.aac4716
Pek, J., Hoisington-Shaw, K. J., & Wegener, D. T. (2024). Uses of uncertain statistical power: Designing future studies, not evaluating completed studies. Psychological Methods. Advance online publication. https://doi.org/10.1037/met0000577
Schimmack, U. (2020). The replication crisis: A z-curve analysis of social-psychology journals. Replication Index Blog. https://replicationindex.com/2020/01/04/replicability-crisis/
Schimmack, U. (2025). Reply to Pek, Hoisington-Shaw, and Wegener (2024): Defending the estimation of true success rates. Replication Index Blog. https://replicationindex.com
Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance—or vice versa. Journal of the American Statistical Association, 54(285), 30–34. https://doi.org/10.1080/01621459.1959.10501497
Sterling, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication decisions revisited: The effect of the outcome of statistical tests on the decision to publish and vice versa. American Psychologist, 50(11), 1086–1089. https://doi.org/10.1037/0003-066X.50.11.1086