All posts by Ulrich Schimmack

About Ulrich Schimmack

Since Cohen (1962) published his famous article on statistical power in psychological journals, statistical power has not increased. The R-Index makes it possible f to distinguish studies with high power (good science) and studies with low power (bad science). Protect yourself from bad science and check the R-Index before you believe statistical results. View all posts by Ulrich Schimmack →

Making Mountains Out of Molehills: Illusory “Huge” Sex Differences

November 7, 2025UncategorizedUlrich Schimmack

When you hear claims that “men and women differ a lot”, see effect-size numbers such as D > 2, and variables that correctly distinguish men and women with over 90% accuracy — it is worth asking what is being measured and how.

In many cases, researchers have taken many small sex differences (e.g., on spatial ability, aggression, interests) and combined them into a single composite that distinguishes men vs. women. That composite may show a large mean difference — but this does not mean there is a single giant biological cause that explains all of it.

Why this matters:

Each individual trait difference may have a modest effect size (say d ~ 0.2-0.4).
Aggregating correlated traits boosts the composite’s reliability and amplifies the mean difference.
A large composite difference is useful for classification (distinguishing male vs. female) but does not support the claim of a unified biological process underlying all those traits.
Because biological sex can already be determined with perfect accuracy, building a model to predict sex from traits is largely redundant from a causal perspective. What matters is how much sex explains trait variation, not how well traits predict sex.
When students or media see a big number, they often infer a big innate difference; but this reverses the direction of causality. We are using differences in masturbation and use of pornography to predict whether somebody is a man or a woman, rather than examining how much sex differences cause variation in behaviors.

Critics have long argued that psychological sex/gender differences are, on average, small and that many claims of large or unified differences collapse under closer scrutiny (Fine, 2005; Hyde, 2014). Meta-research shows that for most psychological and cognitive domains the differences are small (Zell et al., 2023; Szymanski & Henning, 2022). The largest and most consistent sex differences are observed for height (d ≈ 1.5), pornography use (d ≈ 1.5), and sex drive (d ≈ 1.0).

Reversing the direction of analysis creates another misunderstanding. When many dimensional traits are used to predict whether someone has XX or XY chromosomes, there is no remaining variability once we correctly classify more than 90 percent of people. But in the opposite direction—even d = 1.5 implies considerable unexplained variation within each group. Some men rarely watch pornography and some women do; some women have higher sex drives than many men. This within-group variation is psychologically meaningful but ignored by analyses that treat variability in predictors as error variance when predicting a dichotomous outcome (male = XY / penis; female = XX / vagina).

Conclusion

Combining many small effects into one large number is not the same as discovering a deep, singular sex-difference mechanism. There is no scientific purpose in creating a statistical predictor of sex when sex is directly observable. The only reason to compute such values is rhetorical; to make biological effects on variation in personality and other traits appear larger and more coherent than they really are.

References

Archer, J. (2019). The reality and evolutionary significance of human psychological sex differences. Biological Reviews, 94(4), 1381–1415. https://doi.org/10.1111/brv.12507

Eliot, L., Ahmed, A., Khan, H., & Patel, J. (2021). Dump the “dimorphism”: Comprehensive synthesis of human brain studies reveals few male–female differences beyond size. Neuroscience & Biobehavioral Reviews, 125, 667–697. https://doi.org/10.1016/j.neubiorev.2021.03.013

Fine, C. (2005). The gender similarities hypothesis. American Psychologist, 60(6), 581–592. https://doi.org/10.1037/0003-066X.60.6.581

Hyde, J. S. (2014). Gender similarities and differences. Annual Review of Psychology, 65, 373–398. https://doi.org/10.1146/annurev-psych-010213-115057

Joel, D., & Fausto-Sterling, A. (2016). Beyond sex differences: New approaches for thinking about variation in human behavior. Philosophical Transactions of the Royal Society B: Biological Sciences, 371(1688), 20150451. https://doi.org/10.1098/rstb.2015.0451

Lippa, R. A. (2010). Gender differences in personality and interests: When, where, and why? Social and Personality Psychology Compass, 4(11), 1098–1110. https://doi.org/10.1111/j.1751-9004.2010.00320.x

Su, R., Rounds, J., & Armstrong, P. I. (2009). Men and things, women and people: A meta-analysis of sex differences in interests. Psychological Bulletin, 135(6), 859–884. https://doi.org/10.1037/a0017364

Szymanski, D. M., & Henning, S. L. (2022). Are many sex/gender differences really power differences? PNAS Nexus, 3(2), pgae025. https://doi.org/10.1093/pnasnexus/pgae025

Zell, E., Strickhouser, J. E., Sedikides, C., & Alicke, M. D. (2023). The gender similarities hypothesis 2.0: Meta-analytic synthesis of psychological gender differences across the life span. Psychological Bulletin, 149(2), 109–137. https://doi.org/10.1037/bul0000380

The Ideology versus the Science of Evolved Sex Differences

November 6, 2025UncategorizedUlrich Schimmack

1. Introduction: Competing Stories About Gender

Debates about sex differences often swing between extremes. One narrative, familiar from strands of radical feminism, portrays masculinity as dangerous—a legacy of male violence and domination. The opposite story, popularized by Roy F. Baumeister’s Is There Anything Good About Men? (2010), recasts men as civilization’s heroic builders, unfairly maligned by modern culture. Both stories appeal to emotion and morality more than data.

This essay contrasts Baumeister’s narrative with the actual empirical evidence about evolution and sex (Evolution and Sex Differences in 2025). Unlike dramatic claims that men and women are fundamentally different (“Women are from Venus, Men Are from Mars”), scientific evidence shows that men and women evolved together with shared goals to maximize adaptive fitness. There are likely biological differences related to genetic variation in the sex-chromosomes (XX vs. XY), but even for traits that are strongly influenced by these genes, men and women are not fundamentally different.

2. Baumeister’s Core Thesis

Baumeister’s book advances a provocative claim: cultures “flourish by exploiting men.”
He argues that throughout history men have been socially conditioned—and biologically predisposed—to take greater risks, work harder, and sacrifice themselves for collective benefit.

In his telling, male dominance in politics, science, and business reflects expendability and service, not privilege.

He describes men as driven by status and competition, while women, protected and valued for reproduction, focus on relationships and security.

The argument is moral as much as evolutionary. Baumeister insists he speaks “as a scientist,” yet the book only mentions data that support his ideology. The story drives the data, not the data shape the theory. Data are only used when they verify a claim, never to falsify one—a hallmark of pseudoscience, as Karl Popper argued that genuine science advances by subjecting its theories to potential falsification.

He rarely quantifies differences or cites effect sizes, and he dismisses feminism and patriarchy as conspiracy theories. Instead, he offers anecdotes about male teachers, childbirth, and marital infidelity as evidence of “how the world works.”

3. What Empirical Science Shows

The cumulative evidence from behavioral genetics, developmental endocrinology, and cross-cultural psychology paints a more complex picture (Schimmack, 2025).

1. Magnitude of differences. An undisputed evolved sex difference is the height difference between men and women. The standardized effect size is about 1.5 standard deviations. While this number is abstract, it can serve as a benchmark for potentially evolved sex differences. Most psychological sex differences are small to moderate in size (average d ≈ 0.3–0.5). Distributions overlap substantially—typically more than 70%.

2. Outdated evolutionary theories also ignore that most traits are influenced by genes on the 22 pairs of autosomes that are mixed during reproduction and do not allow for biological sex differences. Any biological differences like those in height are rooted in the fact that men have a Y-chromosome and only one X-chromosome. For example, red-green color blindness is recessive on the X-chromosome and more common in men because the expression of this gene is more likely if only one X-chromosome is present.

3. Claims about achievement are especially fragile. First, sex differences in achievement related traits (Conscientiousness) are very small and tend to favor women, and once women are given a chance to compete they are doing as well as men. Baumeister, in psychology, should know that because the sex-ratio in psychology departments has shifted dramatically since the 1950s when gender-biases made it difficult for women in academia.

In short, scientific evidence shows that men and women as probabilistically different yet fundamentally similar; two overlapping variations of one cooperative species. Baumeister may not realize this because we all suffer from consensus bias; that is, we overestimate how many people are like us: Baumeister may overestimate how many men are like him.

4. Ideological Versus Scientific Reasoning

Baumeister’s reasoning resembles moral storytelling: good men, misunderstood by society, suffer for others. Science, by contrast, treats sex differences as empirical questions about magnitude, mechanism, and context. Men are not good or bad, but evolutionary theory explains why men are more likely to be bad people: rapists, murderers than women. This is one of the strongest sex differences that have been scientifically documented (Archer, 2019). They exist because small differences in mean levels of aggression and selfishness can produce large differences in the extremes of a trait. Toxic masculinity is real, but it is limited to a small number of toxic males.

5. Scientifically False Claims

The book makes many scientifically false claims that are ideologically motivated and risk normalizing or excusing abusive behavior.

1. “Research has suggested that most women have said ‘no’ when they meant ‘yes’ at least occasionally, which introduces a further element of confusion to even the most well-intentioned young man.”

Truth: Baumeister misrepresents the original study (Muehlenhard & Hollabaugh, 1988), which found that 39 percent of college women reported ever engaging in token resistance—not “most.” Later research shows this behavior is rare, context-dependent, and declining with improved sexual-education and consent norms (Humphreys, 2004). In contrast, sexual aggression is one of the largest documented sex differences: men are far more likely to be offenders and women to be victims (Archer, 2019). Baumeister’s framing inverts this reality.

2. Baumeister: “women are plenty aggressive—if anything, more violent than men.”

Truth: A meta-analysis of heterosexual partner aggression finds d ≈ –0.05 for act frequency, meaning women report slightly more minor acts—but men cause far more serious injuries (Archer, 2000). Across all forms of violence, the difference reverses dramatically: men commit the vast majority of homicides and serious assaults worldwide (Archer, 2019). Baumeister’s claim ignores the scale and severity of male violence and misrepresents the empirical record.

3. Baumeister “From the unfeeling perspective of the system, it could be worth it to restrict female access to education.” (p. 209)

Truth: Every cross-national dataset shows the opposite: female education increases social stability, child survival, and economic growth (UNESCO, 2019; World Bank, 2020). There is no conceivable “systemic advantage” to restricting women’s education—historically or evolutionarily. This statement is not only unsupported but directly contradicted by global evidence.

4. Baumeister: “After witnessing childbirth, many men find their wives sexually disgusting and thus cheat.” (pp. 246–247)

Truth: No scientific data link childbirth observation to marital infidelity. Longitudinal studies show that relationship satisfaction and communication, not childbirth disgust, predict sexual desire and fidelity (Lawson & Mullett, 2018). Baumeister’s anecdote pathologizes normal experiences of fatherhood without evidence.

5. Baumeister: There was and is no oppression of women; patriarchy is a conspiracy theory.

Truth: “Patriarchy” in social science refers to structural male advantage, not a secret male conspiracy. Historical and economic research documents centuries of legal, educational, and occupational exclusion of women (Goldin, 1990; England, 2010). Dismissing these constraints as myth denies overwhelming empirical documentation.

6. Baumeister: “Men are exploited by society; progress depends on male expendability.”

Truth: Men historically faced higher mortality in war and dangerous work, but these risks were tightly linked to male political and economic power. Men had the benefit of minimal investment in their reproductive success, while leaving women with the risk and costs of childbirth and child rearing. Baumeister’s framing ignores male exploitation by males, not women.

6. Ideological Consequences

Research confirms that exposure to Baumeister’s own Sexual Economics Theory—which portrays sex as a female resource traded for male investment—can shape social attitudes.
Fetterolf & Rudman (2016) found that participants who viewed a video based on this theory endorsed more adversarial beliefs about heterosexual relationships, even after reading feminist rebuttals. This shows that ideas presented as neutral “science” can increase cynicism and hostility between the sexes.

Moreover, the book’s framing has been widely circulated in manosphere communities and cited on forums linked to misogynistic radicalization. In these contexts, Baumeister’s evolutionary language becomes moral ammunition, used to rationalize resentment toward women.
Such diffusion illustrates how ideological narratives dressed as science can travel far beyond academia.

7. Why Scientific Caution Matters

Scientific reasoning differs from ideological rhetoric in three ways:

Falsifiability. Claims must be open to disconfirmation; Baumeister’s narrative is not.
Updating. Science revises itself when evidence changes; ideology repeats itself even when data contradict it.
Value neutrality. Science describes what is, not what ought to be. Moralizing about gender—positive or negative—distorts understanding.

In modern personality and evolutionary psychology, the consensus is clear:
Men and women evolved under shared pressures for cooperation, mutual dependency, and parental investment, not perpetual conflict or one-sided exploitation.

8. Conclusion I: Men and Women Evolved on Earth

Baumeister’s Is There Anything Good About Men? invites sympathy for men but mistakes ideological comfort for scientific truth. By glorifying masculine extremes and dismissing opposing evidence, it replaces inquiry with mythmaking.

The scientific picture that emerges from decades of research is subtler and more interesting.
Sex differences are real yet modest, biologically rooted yet culturally flexible. Both sexes show extraordinary variability, and both contributed to the survival of our species. Men and women did not evolve on separate planets; they evolved together, on Earth, as cooperative partners in a shared evolutionary story.

9. Conclusion II: Baumeister Lacks Scientific Credibility

Baumeister’s research record reveals a consistent pattern of selective evidence use—choosing studies that support his claims while ignoring or concealing results that do not.
His once-famous ego-depletion hypothesis—the idea that self-control operates like a limited resource—was based on publication-biased evidence.

Re-analyses of his own data show that the average effect size is close to zero once unpublished or failed studies are included (Schimmack, 2014, 2016, 2018, 2019, 2025). Meta-scientific investigations further document that his lab withheld null results, giving a misleading impression of robust support.

Baumeister himself admitted this practice in a personal email communication quoted by Schimmack:

“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.”

This admission confirms that his work exemplified the publication-bias culture that triggered psychology’s credibility crisis. Rather than using data to test hypotheses, Baumeister routinely used them to confirm preconceived beliefs—the same confirmatory pattern visible in Is There Anything Good About Men?

Scientific integrity requires falsifiability, transparency, and full reporting.
When these norms are ignored, claims cease to be scientific, even if they borrow the language of science. Authors who present untested opinions as empirical conclusions engage in narrative persuasion rather than data-driven inquiry—a form of writing closer to literature than to science.

Freedom of speech entitles Baumeister to publish ideological opinions, even offensive ones.
But academic freedom is different: it protects the search for truth through open, verifiable evidence. Baumeister’s gender arguments, like his ego-depletion studies, fail that test.
They are expressions of belief, not findings of science. The actual evidence shows not only that men and women are far more similar than his book suggests, but also that Baumeister’s own practices demonstrate a departure from scientific standards.

Key References

Archer, J. (2000). Sex differences in aggression between heterosexual partners: A meta-analytic review. Psychological Bulletin, 126(5), 651–680.
Archer, J. (2019). The reality and evolutionary significance of human psychological sex differences. Biological Reviews, 94(4), 1381–1415. https://doi.org/10.1111/brv.12507
Baumeister, R. F. (2010). Is There Anything Good About Men? Oxford University Press.
Popper, K. R. (1959). The logic of scientific discovery. London: Hutchinson. (Original work published 1934)
Schimmack, U. (2014). Roy Baumeister’s R-Index – Replicability-Index

Why Uri Simonsohn is a Jerk

November 5, 2025UncategorizedUlrich Schimmack

Science is like an iceberg. The published record is only a fraction of the things that university -paid academics do. Some time ago, Brian Nosek dreamed about a scientific utopia of open science that would make the workings of academia more transparent, but all we got was preprints and some badges – that are apparently rolled back. He never was interested in an open discussion about the IAT or the ethics of Project Implicit.

Other famous open science academics are also less open than you may think. Datacolada benefited from open data to find evidence of fraud. As there is no law to share data, the fraudsters must kick themselves for being so foolish to share them and lose millions and reputation.

Uri Simonsohn, of Datacolada with a few fraud-scalps on his belts, is also less not eager to embrace all aspects of open science. The Datacolada blog doesn’t even have a comment section that allows sharing of alternative viewpoints or – oh my god – criticism of the work. Worse than an old-school journal with peer-review that would even allow some critical comments to be published to maintain the image of being science. DataColada: Not Open – Not Science.

I have fully embraced open science. My blog has comment sections and I have even revised errors that people have pointed out in my blog posts. Living in utopia, I have also shared emails that the authors wanted to hide like my exchange with Uri about the poor performance of p-curve when data are heterogeneous) or anonymous peer-reviews that show how bias rather than scientific criteria dictate what gets shared in journals.

This post was motivated by a search in my email inbox for a link to a book I had ordered.

Instead, I found an email exchange from 2015 between Greg Francis and Uri Simonsohn about a DataColada post about the Test of Excessive Significance, used by Greg Francis to reveal publication bias and similar to the incredibility index (Schimmack, 2012).

[24] P-curve vs. Excessive Significance Test – Data Colada

Uri writes with typical intellectual humility “In this post I use data from the Many-Labs replication project to contrast the (pointless) inferences one arrives at using the Excessive Significant Test, with the (critically important) inferences one arrives at with p-curve.”
Translation: I am great, others do dumb shit.

What is the problem with testing for publication bias? Well, we know that the null-hypothesis of no bias is false. So, we do not need a test that there is no bias.

What makes p-curve great. It tests the null-hypothesis that all results are false positives and in 2014 Uri probably believed that this is common and we need p-curve to reveal those data. However, 10 years later p-curve has mostly rejected the silly and uninformative null-hypothesis that all results are false positives, p < .05.

Ten years later, nobody should care about TEST or p-curve results that test extreme and unlikely hypotheses. Instead, we should quantify how much publication bias there is and how much evidence against the null-hypothesis data contain. Neither TES nor p-curve do this well, and we developed such a method (Bartos & Schimmack, 2022; Brunner & Schimmack, 2020). So, I am great and Uri sucks, but you don’t have to take my word for it. He never bothered to engage in constructive criticism or try to show that Jerry Brunner’s critique of p-curve is false, although the comment section allows for it. He is not interested in open science because he cannot win a scientific argument. Hence, no comments are allowed when he defends p-curve with silly simulation results.

Anyhow, for the sake of open science, here is the email exchange from 2014 that I found.

—————————————————————————-

On 27 Jun 2014, at 04:59 pm, Simonsohn, Uri <uws@wharton.upenn.edu> wrote:

Hi Greg,

Thanks for your email.
The policy is to contact authors whose research we discuss. I do not discuss research you conducted, so I did not contact you.

One could extend the policy to contacting everybody whose work is related to the post, but that would be impractical, I would have needed to contact Kahneman, Klein et at, Ioannidins & Trikalinos, Uli and you, and presumably the people whose work you have analyzed via EST and perhaps even the OSF people. Or perhaps extend the policy to contacting anybody who is likely to disagree with the post. Similarly impractical.

Looking at the comments you sent via email, note how you don’t need to refer to any paper you have written to make your arguments, they are based exclusively on new analyses I run on data you had never analyzed before. That indicates to me the post is separate from your past work.

When I wrote a post about Bayesian analysis (http://datacolada.org/2014/01/13/13-posterior-hacking/) , I did not contact Bayesian statisticians like Kruschke or EJ. As in this case, I was talking about statistical tools they use, but not about analyses they have run, so our policy did not require me to contact them either. When we have written about replications we have not contacted Nosek.
When I wrote about ceiling effects in one replication paper, I did not contact authors of other papers that may also have a ceiling effect, or other people who have talkeda bout ceiling effects in that paper, I only contacted the authors whose work I was directly discussing.

Now, if I write a post about analyses EJ runs, or a replication that Nosek does, then of course we will contact them.
If I write a post about your use of the EST in this or that paper, then of course I will contact you.

You may disagree with the policy, but I thought it would be fair to share the rationale with you.

Thanks again,

Uri

—–Original Message—–
From: Gregory Francis [mailto:gfrancis@purdue.edu]
Sent: Friday, June 27, 2014 8:51 AM
To: Simonsohn, Uri
Cc: <uli.schimmack@utoronto.ca> Schimmack; Leif Nelson; Simmons, Joseph
Subject: Data Colada

Hi Uri,

I saw your Data Colada posting on the P-curve vs. the excessive significance test (http://datacolada.org/2014/06/27/24-p-curve-vs-excessive-significance-test/ ). I really don’t understand the motivation for this posting, and I think you misrepresented the TES (Test for Excess Significance- Ioannidis’ term).

In particular, you conclude that the inference from the TES is pointless because we know there are 5 studies not reported. Indeed, if you know some relevant studies were not reported (since you removed them!) then you are correct that there is no reason to run the TES. I would suggest that the more interesting test for this set of data would be to include the 5 non-significant studies (since they were actually published). Running the TES then gives 0.9699841 (I quickly modified your code to include all published studies; I am pretty sure this correct). The details are

Pooled d: 0.598
Observed number of significant studies: 31 Expected number of significant studies: 31.08
Chi-square: 0.0014159
p: 0.96998

So, the TES would not claim that there is anything amiss with the full set of 36 reported studies.

I also object to your argument that nobody publishes “all” findings. Taken broadly enough, the statement is true, but somewhat silly and naive. What the TES considers is whether the stated theoretical claims are consistent with the reported findings. For example, in the TES analysis of all 36 studies, the theoretical claims (a fixed effect size of d=0.598) is consistent with the reported frequency of rejecting the null. On the other hand, if we take just the 31 significant experiments, then the theoretical claim (a fixed effect size of d=0.629) is not consistent with the reported frequency of rejecting the null. One need not report all studies for consistency to hold, and if there are valid methodological reasons to not publish some studies then they should not be published. I have explained this to you many times, so I get the feeling you are being deliberately obtuse on this issue, which is a shame because you are confusing people and, in the long-run, undermining your own credibility.

I also think your post is misleading in a broader context. The “about” section of Data Colada states::

“When discussing research by other authors we contact them before posting; we ask for suggestions to improve the post, and invite them to comment within the original blog post.”

Readers of your blog who believe you take the policy seriously should infer that Uli and I were shown a draft, asked for feedback, and given an opportunity to comment, which is not true. It is too late for you to follow parts one and two of your policy, but you can fix the third: allow Uli (if he wishes) and me to write a follow-up post on Data Colada that explains our views of the TES and p-curve analyses.

Greg Francis

Professor of Psychological Sciences
Purdue University

What a jerk!

Greg

On 27 Jun 2014, at 04:59 pm, Simonsohn, Uri <uws@wharton.upenn.edu> wrote:

Hi Greg,

Thanks for your email.
The policy is to contact authors whose research we discuss. I do not discuss research you conducted, so I did not contact you.

One could extend the policy to contacting everybody whose work is related to the post, but that would be impractical, I would have needed to contact Kahneman, Klein et at, Ioannidins & Trikalinos, Uli and you, and presumably the people whose work you have analyzed via EST and perhaps even the OSF people. Or perhaps extend the policy to contacting anybody who is likely to disagree with the post. Similarly impractical.

Looking at the comments you sent via email, note how you don’t need to refer to any paper you have written to make your arguments, they are based exclusively on new analyses I run on data you had never analyzed before. That indicates to me the post is separate from your past work.

When I wrote a post about Bayesian analysis (http://datacolada.org/2014/01/13/13-posterior-hacking/) , I did not contact Bayesian statisticians like Kruschke or EJ. As in this case, I was talking about statistical tools they use, but not about analyses they have run, so our policy did not require me to contact them either. When we have written about replications we have not contacted Nosek.
When I wrote about ceiling effects in one replication paper, I did not contact authors of other papers that may also have a ceiling effect, or other people who have talkeda bout ceiling effects in that paper, I only contacted the authors whose work I was directly discussing.

Now, if I write a post about analyses EJ runs, or a replication that Nosek does, then of course we will contact them.
If I write a post about your use of the EST in this or that paper, then of course I will contact you.

You may disagree with the policy, but I thought it would be fair to share the rationale with you.

Thanks again,

Uri

—–Original Message—–
From: Gregory Francis [mailto:gfrancis@purdue.edu]
Sent: Friday, June 27, 2014 8:51 AM
To: Simonsohn, Uri
Cc: <uli.schimmack@utoronto.ca> Schimmack; Leif Nelson; Simmons, Joseph
Subject: Data Colada

Hi Uri,

I saw your Data Colada posting on the P-curve vs. the excessive significance test (http://datacolada.org/2014/06/27/24-p-curve-vs-excessive-significance-test/ ). I really don’t understand the motivation for this posting, and I think you misrepresented the TES (Test for Excess Significance- Ioannidis’ term).

In particular, you conclude that the inference from the TES is pointless because we know there are 5 studies not reported. Indeed, if you know some relevant studies were not reported (since you removed them!) then you are correct that there is no reason to run the TES. I would suggest that the more interesting test for this set of data would be to include the 5 non-significant studies (since they were actually published). Running the TES then gives 0.9699841 (I quickly modified your code to include all published studies; I am pretty sure this correct). The details are

Pooled d: 0.598
Observed number of significant studies: 31 Expected number of significant studies: 31.08
Chi-square: 0.0014159
p: 0.96998

So, the TES would not claim that there is anything amiss with the full set of 36 reported studies.

I also object to your argument that nobody publishes “all” findings. Taken broadly enough, the statement is true, but somewhat silly and naive. What the TES considers is whether the stated theoretical claims are consistent with the reported findings. For example, in the TES analysis of all 36 studies, the theoretical claims (a fixed effect size of d=0.598) is consistent with the reported frequency of rejecting the null. On the other hand, if we take just the 31 significant experiments, then the theoretical claim (a fixed effect size of d=0.629) is not consistent with the reported frequency of rejecting the null. One need not report all studies for consistency to hold, and if there are valid methodological reasons to not publish some studies then they should not be published. I have explained this to you many times, so I get the feeling you are being deliberately obtuse on this issue, which is a shame because you are confusing people and, in the long-run, undermining your own credibility.

I also think your post is misleading in a broader context. The “about” section of Data Colada states::

“When discussing research by other authors we contact them before posting; we ask for suggestions to improve the post, and invite them to comment within the original blog post.”

Readers of your blog who believe you take the policy seriously should infer that Uli and I were shown a draft, asked for feedback, and given an opportunity to comment, which is not true. It is too late for you to follow parts one and two of your policy, but you can fix the third: allow Uli (if he wishes) and me to write a follow-up post on Data Colada that explains our views of the TES and p-curve analyses.

Greg Francis

Professor of Psychological Sciences
Purdue University

How Bad is P-Curve Really and Why Should We Care?

October 28, 2025UncategorizedUlrich Schimmack

P-curve was introduced a little over a decade ago by Uri Simonsohn, Leif D. Nelson, and Joseph P. Simmons (2014); the same team that later launched the DataColada blog. It is a selection-model approach designed specifically for examining the evidential value of published findings when non-significant results are missing and publication bias inflates estimates of power that ignore selection bias.

The method’s goal and its historical context

Its statistical goal is instead to test the null hypothesis that all significant results are false positives. While methodologists warned about this possibility (Rosenthal, 1979), it was considered unlikely that large sets of studies could be published without real effects. However, the DataColada team showed that it can be relatively easy to produce significant results without real effects when the data are p-hacked (Simmons, Nelson, & Simonsohn (2011, Psychological Science, “False-Positive Psychology”). Awareness of inflated type-I error rates and replication failures raised concerns that most results might be false positives (Ioannidis, 2005).

Applications and Limitations

Over the past decade, p-curve has been applied in numerous meta-analyses, and the typical conclusion is that the analyzed literature shows evidential value. However, this conclusion has a critical limitation: rejecting the null hypothesis that all results are false positives does not reveal how many results are false positives, how large the true effects are, or how much reported effect sizes are inflated by publication bias. The latest version of p-curve provides an estimate of “power” to provide quantitative information about the amount of evidential value in a set of studies. This blog post examined the controversy surrounding this parameter of the p-curve model.

Scope of the Discussion

To be clear, the developers also introduced a version of p-curve for effect-size estimation, but this procedure has been used rarely and performs worse than alternative bias-correcting methods when credible nonsignificant evidence is available (see Carter et al., 2019). Consequently, the present discussion focuses on p-curve as a test of evidential value, as implemented in the public p-curve app, rather than as an estimator of effect magnitude.

The Current Debate

Morey and Davis-Stober (2025) published a formal critique in the Journal of the American Statistical Association (JASA) (see my earlier post, Rindex.08.08.25). Uri Simonsohn (2025) responded in a post on the DataColada blog (#129).

The key issue is how p-curve performs when the power of studies varies across studies (i.e., heterogeneity in power). Morey and Davis-Stober present a simulation with true mean power of 66 percent, yet p-curve returns an estimate of 87%, a 21-percentage point difference. Simonsohn shows simulations where bias is never larger than 5%.

Simulation Hacking

The controversy illustrates a broader methodological issue that might be called simulation hacking. Just as empirical researchers can obtain desired results through selective analyses (p-hacking), methodologists can shape conclusions by emphasizing simulation conditions where a method performs particularly well or poorly. This does not mean that the chosen scenarios are unrealistic; rather, it highlights that statistical procedures often perform differently across contexts. A method may be robust and informative for some purposes yet unreliable for others, depending on which assumptions the simulations accentuate.

Simulating Field Wide Heterogeneity

Figure 1: Distribution of Effect Sizes in Morey and Davis-Stober’s Simulation

Morey and Davis-Stober (2025) simulated a distribution of true effect sizes that is shown in their Figure 1. This distribution is broadly consistent with the average effect sizes reported in psychology meta-analyses (Richard et al., 2003). Such a distribution can be used to simulate p-values from studies testing a wide variety of hypotheses and research designs that aim to estimate the typical power of studies in psychology (e.g., Cohen, 1962; Schimmack, 2020; Soto & Schimmack, 2024). These conditions generate extreme heterogeneity in statistical power across studies. Morey and Davis-Stober’s analysis suggests that under such heterogeneity, p-curve will produce inflated estimates of average power.

A concrete example is provided by the Reproducibility Project (Open Science Collaboration, 2015). These data are especially informative because the outcomes of the replication studies offer an independent benchmark of the original studies’ power to produce significant results without selection bias. The observed replication rate implies an average true power of less than 40%. Schimmack (2025) analyzed the p-values of the original studies and obtained a p-curve estimate of power of 91%, 95% CI = 86% to 94% (Schimmack, 2025).

If the replication outcomes were unknown, this p-curve result would incorrectly suggest that the high proportion of significant findings in psychology journals (Sterling et al., 1995) reflects genuinely high study power rather than publication bias or p-hacking. In conclusion, a tool that was developed in response to the replication crisis to reveal p-hacking would falsely suggest that power is high and p-hacking is rare.

Simulating Meta-Analyss of P-Hacked Literatures

Simonsohn (2025) simulated studies with low power that never exceeds 80%. Examples like this can be found in meta-analysis of p-hacked studies. For example, a recent p-curve analysis of 825 terror-management studies yielded a power estimate of only 25%, 95%CI = 21% to 29%. This finding implies that exact replications of these studies would produce at most 30% significant results; a rate that is similar to the success rate in actual replication studies (Open Science Collaboration, 2015). An anecdote tells about a social psychologist who prided himself on a success rate of 1 out of 3 studies and compared it baseball, where a 30% batting average is excellent.

The problem here is not that p-curve estimates are biased. Rather, the problem is that they can be easily misinterpreted, if heterogeneity in power is ignored. After all, p-curve does reject the null-hypothesis that all studies are false positives. Assuming that all studies have the same power also implies that there are no false positive results; contrary to Simmons et al.’s (2011) suspicion that false positives are common. P-curve simply does not provide information about false positives unless all significant results are false. The power estimate could be an average of false positives and true positives with high power.

Stay Calm: Use Z-Curve

There is no need to fight over p-curve because we have a better method that works with and without heterogeneity called z-curve (Bartos & Schimmack, 2022; Brunner & Schimmack, 2020). When we developed z-curve, we compared it against alternative models. We presented all simulations, even those where p-curve performed a bit better with homogenous data. The simulation showed that both methods have only a small bias when heterogeneity is small, but p-curve has a large bias when heterogeneity is large. So, we can simply use z-curve for all data.

Here is a simple example that shows how z-curve is superior to p-curve, even if p-curve estimates are only slightly biased. The simulation uses 50% false positives, and 50% true positives with 80% power. It is easy to see that we would expect .50 * .05 + .50 * .80 = .025 + .40 = 42.5% significant results. This is the expected replication rate if the studies were replicated exactly without selection bias. It is called power in p-curve, but that term ignores that real data may contain false positives.

Figure 2: p-curve plot with power estimate

Consistent with Simonsohn’s claims, the bias in the p-curve estimate is small (p-curve estimate: 44% vs. true parameter: 42.5%), but p-curve does not tell us whether all studies have about 40% power or whether this is an average of studies that vary in power or even include false positive results.

Z-curve’s estimate of the expected replication rate (ERR) is accurate (42%). More important, it also recognizes that the data are heterogeneous. A simple way to see this is that it estimates a lower discovery rate for all studies, including non-significant results that are not reported. A discrepancy between EDR and ERR indicates heterogeneity because studies with higher power have a higher chance of being in the set of significant results.

Z-curve also estimates the expected discovery rate for the full range of z-values, including non-significant results that are not reported (see the red dotted line). The EDR of 11% is incompatible with the observed discovery rate of 100% (only significant results are published). Even the upper limit of the CI is only 18% (about 5 studies for each significant result). The p-curve power estimate cannot be used to evaluate publication bias, although p-curve is often falsely used as a test of publication bias.

Finally, the EDR can be used to estimate the false positive risk with a formula by Soric (1989). We know the true percentage is 50%. The z-curve estimate is only 45%, but the 95%CI around this estimate is wide. Most troubling, the 1,000 studies do not rule out the possibility that all studies are false positives (the 95%CI includes 100%). This is very different from the inference we may draw from the p-curve estimate of 42% power that does not suggest a high rate of false positive results.

Z-curve also provides additional information about the expected discovery rate (EDR) for different ranges of observed z-values (see percentages below the x-axis of the z-curve plot). Results that are just significant (e.g., z = 2 to 2.5) are likely to include many false positives; in this range and the expected discovery rate is only about 27%.

By contrast, studies with larger z-values (e.g., z > 4) are almost certainly based on true effects and have an expected replication probability of around 80%. Z-curve slightly overestimate replicability for these high z-values, but the main point is that discovery rates are expected to change dramatically due to heterogeneity in the probability to obtain significant results.

Conclusion

This blog post showed how silly it is to fight over p-curve with carefully selected simulation scenarios. P-curve makes the unrealistic assumption that studies are homogenous. Z-curve avoids this assumption, models heterogeneity, and provides more information about the data than p-curve can. So, researchers can just use z-curve, and the performance of p-curve is no longer relevant. It is a bit like testing assumptions about equal variances in t-tests. We can just use a t-test that avoids this assumption.

It is clear why Simonsohn does not mention a method that replaced p-curve several years ago on the Datacolada blogpost or allows comments that would alert readers to alternative methods. It is not clear why Morey and Davis-Stober criticize a method that is obsolete and do not mention that their criticisms have been addressed by a better method. But then, who understands the childish games of academics that produce publications, but not knowledge.

Unlike Datacolada my blog allows for comments and I welcome comments by Datacolada, Morey, Davis-Stober, or anybody else.

The Mythical Marriage of Fisher and Neyman-Pearson

October 26, 2025UncategorizedUlrich Schimmack

Preface

This post grew out of a long discussion with ChatGPT about Gerd Gigerenzer’s treatment of the history of statistics and its influence on psychology in his book The Empire of Chance (1989).

I actually found this book by chance, because ChatGPT recommended it during a literature search. Psychology now has an overwhelmingly journal-based culture, where articles appear online as PDFs and are rarely accompanied by physical books. I am old enough to remember browsing the shelves of real libraries—especially the magnificent stacks at the University of Illinois and the Roberts Library in Toronto—but I stopped doing so about fifteen years ago. Younger colleagues may never know that quiet pleasure.

So, it is not surprising that few psychologists have actually read The Empire of Chance. Fortunately, I was able to access it through my University of Toronto credentials. For most readers, however, it remains locked behind a paywall.

To explore Gigerenzer’s arguments more closely, I uploaded the relevant chapters to ChatGPT (since they are not freely available) and discussed the content in light of my broader research on the history of power, significance testing, and replicability.

This post summarizes our shared understanding of how statistical thinking entered psychology, and why we concluded that Gigerenzer’s famous claim that null-hypothesis significance testing (NHST) is a hybrid of Fisher and Neyman-Pearson is inaccurate. It isn’t a hybrid at all. It’s pure Fisher.

Neyman and Pearson’s framework never gained traction. Today Neyman’s invention of confidence intervals dominates sound statistical inferences because they avoid the problems of Fisher’s significance testing without the difficulties of implementing Neyman-Pearson’s approach. So, we moved from Fischer to Neyman and forgot and Neyman-Pearson were never relevant in the use of statistics by psychologists.

Introduction

For decades psychologists have been told that the way they analyze data—null-hypothesis significance testing—is a hybrid of two rival statistical philosophies: Fisher’s significance test and the Neyman-Pearson decision framework.

Gigerenzer popularized this story in The Empire of Chance (1989), arguing that textbooks merged the two systems and gave the illusion of harmony. It’s a neat narrative—but it doesn’t survive close inspection.

1 · Fisher’s significance test

1️⃣ Make a prediction or explore whether two variables are related.
2️⃣ Collect data and compute a p-value assuming no relation (H₀).
3️⃣ If p is small enough, reject H₀ and claim support for the expected directional effect.
4️⃣ As Fisher wrote in 1935, “every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis” (The Design of Experiments, p. 16).

This deceptively simple procedure made inference a one-sided game: we seek “disproof” of H₀, not testing of a specific H₁.
In practice, rejecting H₀ is treated as confirming our theory—verification dressed up as falsification.

2 · The Neyman-Pearson alternative

Neyman and Pearson proposed a symmetric system of two hypotheses, H₀ and H₁, each with defined long-run error rates.

H₀ can be rejected, but H₁ can also be rejected.
To do so we must specify a concrete alternative, e.g., d = 0.5, and design the study with known α and β.
A result can therefore falsify a risky prediction (reject d = .8 means the effect is smaller than “large”).
If both survive, we test again.

In this framework, power and Type II error are not afterthoughts—they’re the price of claiming evidence.

3 · Why it never took root in psychology

Psychology kept Fisher’s asymmetry. Researchers learned to celebrate significant results and ignore non-significant ones. Gigerenzer claimed textbooks resolved the dispute by fusing both schools into a “hybrid model.” But the evidence tells a different story.

4 · Why the “hybrid” is a myth

1 · Fixed thresholds were Fisherian conveniences.
Before computers, tables listed critical values for .05, .01, and .001. Using them was a practical shortcut, not an adoption of Neyman-Pearson error control.
Reporting “p < .05” or adding ** for p < .01 continued Fisher’s graded-evidence tradition.

2 · Type II errors were rhetorical, not operational.
Textbooks mentioned them vaguely—“the probability of an error if H₀ is false”—but never linked them to a specific H₁ such as d = .5. β was seldom calculated or used.

3 · Power was rarely used for design or inference.
Even after Cohen (1962) called for power analysis, psychologists mostly ignored power or treated it only as planning advice for achieving significance, not to quantify type-II errors in inferences that rejected a specific H₁

4 · In practice, nothing changed.
Studies were published when p < .05 and forgotten when p > .05. Journal success rates were over 90%, reflect a one-sided testing culture, not a balanced decision framework.

5 · The broader context

Other social sciences followed different paths. Economists and sociologists, working with large samples and directly measurable variables, emphasized estimation and precision—effect sizes, standard errors, and confidence intervals. They had little interest in either Fisher’s or Neyman-Pearson’s philosophies, although interpretation of results was also influenced by significance thresholds.

Ironically, Neyman’s own (1937) invention of the confidence interval would have solved psychology’s dilemma: a CI simultaneously rejects extreme H₀ and H₁ values without pre-specifying them. Gigerenzer does not mention the modern hybrid of significance testing that uses values of 0 inside or outside the confidence interval to replace Fisher’s significance test.

6 · Conclusion

The so-called hybrid of Fisher and Neyman-Pearson is a myth.

Psychology adopted Fisher’s one-sided test with a conventional publishing threshold of p < .05 and never implemented the symmetrical logic of Neyman-Pearson decisions.

Even Cohen’s power analysis was absorbed into the same framework—another tool for ensuring significance, not for falsifying theoretical claims.

What Gigerenzer described as a marriage was never consummated.

Psychology has lived for nearly a century with Fisher alone, and is now replacing it with Neyman’s confidence intervals.

Neyman-Pearson’s marriage never produced any children.

References

Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65(3), 145–153. https://doi.org/10.1037/h0045186

Fisher, R. A. (1935). The design of experiments. Edinburgh: Oliver & Boyd.

Gigerenzer, G., Swijtink, Z., Porter, T., Daston, L., Beatty, J., & Krickeberg, K. (1989). The empire of chance: How probability changed science and everyday life. Cambridge University Press.

Gigerenzer, G. (1993). The superego, the ego, and the id of statistical reasoning. In G. Keren & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences: Methodological issues (pp. 311–339). Hillsdale, NJ: Erlbaum.

Gigerenzer, G. (2004). Mindless statistics. Journal of Socio-Economics, 33(5), 587–606. https://doi.org/10.1016/j.socec.2004.09.033

Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, 231, 289–337.

Neyman, J. (1937). Outline of a theory of statistical estimation based on the classical theory of probability. Philosophical Transactions of the Royal Society of London. Series A, 236, 333–380.

Tversky, A., & Kahneman, D. (1971). Belief in the law of small numbers. Psychological Bulletin, 76(2), 105–110. https://doi.org/10.1037/h0031322

A Better P-Curve: The Aggregated Cauchy Association Test with PP-Values

October 22, 2025UncategorizedUlrich Schimmack

Main Points

Purpose of p-curve.
The p-curve method was introduced as a test of the null hypothesis that all significant results are false positives (that is, that (H_0) is true for all tests).
Use in psychology.
P-curve became popular in psychology as a way to demonstrate that a body of research has evidential value—in other words, that not all significant results are false positives.
Criticisms.
The method has been criticized on several grounds, including unsound statistical assumptions and inadmissible decision rules (Morey & Davis-Stober, 2025).
Earlier related work.
It has gone largely unrecognized that similar approaches for combining truncated p-values were developed much earlier in genomics (e.g., the Truncated Product Method, TPM; Zaykin et al., 2002).
A modern alternative.
A newer and more widely used approach is the Aggregated Cauchy Association Test (ACAT). Although ACAT does not assume truncation, truncated p-values can be analyzed by selecting p < α and dividing them by α to obtain pp-values that follow a uniform(0, 1) null distribution.
Advantages over p-curve.
This pp + ACAT approach addresses many of the statistical problems identified by Morey and Davis-Stober (2025), including inadmissibility, discontinuity, and sensitivity to values near α, while retaining the same logical test of the global null.
Remaining limitations.
Like p-curve, ACAT tests the hypothesis that all significant results are false positives, but it does not quantify the strength of evidence (e.g., average power) or capture heterogeneity among studies. For this reason, z-curve remains the preferred method for evaluating evidential value in a set of significant results.

References

Brunner, J., & Schimmack, U. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta-Psychology, 4, Article MP.2018.874. https://doi.org/10.15626/MP.2018.874 open.lnu.se+2CRAN+2

Bartoš, F., & Schimmack, U. (2022). Z-curve 2.0: Estimating replication rates and discovery rates. Meta-Psychology, 6, Article MP.2021.2720. https://doi.org/10.15626/MP.2021.2720

Morey, R. D., & Davis-Stober, C. P. (2025). On the poor statistical properties of the p-curve. American Statistician, in press. (see also Title: Review of “On the Poor Statistical Properties of the P-Curve Meta-Analytic Procedure” Published: August 8, 2025 by Ulrich Schimmack on the Replication Index blog. replicationindex.com
URL: https://replicationindex.com/2025/08/08/review-of-on-the-poor-statistical-properties-of-the-p-curve-meta-analytic-procedure

Zaykin, D. V., Zhivotovsky, L. A., Westfall, P. H., & Weir, B. S. (2002). Truncated product method for combining p-values. Genetic Epidemiology, 22(2), 170–185. https://doi.org/10.1002/gepi.0042

Liu, Y., Chen, S., Li, Z., Morrison, A. C., Boerwinkle, E., & Lin, X. (2019). ACAT: A fast and powerful p-value combination method for rare-variant analysis in sequencing studies. American Journal of Human Genetics, 104(3), 410–421. https://doi.org/10.1016/j.ajhg.2019.01.002

Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve: A key to the file-drawer. Journal of Experimental Psychology: General, 143(2), 534–547. https://doi.org/10.1037/a0033242

The Power of Post-Hoc Power: The Glucose Theory of Willpower is Dead

October 22, 2025UncategorizedUlrich Schimmack

It has been claimed that the psychological literature is filled with zombie theories—walking dead that are not supported by evidence, no longer believed by insiders or even by the original proponents, but that live on in mindless citations and textbooks forever. Sometimes, though, they do die in silence, without a funeral or obituary. One example of a dead theory in psychology is the glucose theory of willpower.

While the inventor of this theory (and the underlying phenomenon), Roy F. Baumeister, still clings to his broader theory of willpower despite two large replication failures, even he has walked away from the idea that willpower depends on blood-glucose levels.

1. Gailliot’s Original Glucose Studies (2007–2009)

Baumeister and Gailliot’s studies (e.g., Gailliot et al., 2007, Journal of Personality and Social Psychology; Gailliot & Baumeister, 2007, Psychological Science) claimed that exerting self-control reduces blood-glucose levels and that restoring glucose—even by drinking a sugary beverage—replenishes willpower and improves performance on subsequent self-control tasks.

The reported effects were dramatic. Across a series of small-sample experiments, the authors found large, consistently significant results, suggesting a robust physiological mechanism underlying self-control. The simplicity of the claim—that willpower literally runs on sugar—made the theory intuitively appealing and easy to test. These findings immediately inspired a wave of replications and extensions, many of which were conceptual replications using similar small-sample, between-subjects designs.

2. Schimmack’s 2012 “Incredibility Index” Analysis

In 2012, Ulrich Schimmack applied his Incredibility Index (IC-index) to Gailliot’s published results—a meta-analytic tool that tests whether the reported proportion of significant results is credible given the observed power of the individual studies.

The results were striking. The distribution of p-values in Gailliot’s papers was too good to be true. The success rate (about 100%) was incompatible with the small sample sizes, and even with the inflated effect-size estimates, the estimated average power of these studies was far too low to justify only significant results.

During the review process, Roy F. Baumeister openly admitted that the published studies were selected from a larger set that included many null results. He justified this practice by claiming that this was simply what everyone did—an argument that, while true at the time, highlighted how pervasive questionable research practices had become in psychology. At that point, it was still unclear whether these practices merely inflated real effects or had created an entirely spurious phenomenon.

3. The Fallout: A Futile Wave of Follow-Up Research

Yet the glucose theory continued to attract attention—precisely because those early, inflated findings appeared compelling. For several years, researchers treated it as a promising physiological explanation for self-control. However, the replication crisis made it possible to publish replication failures, and several articles reported that they could not find effects of glucose on willpower.

From 2012 to 2018, dozens of researchers tried to replicate the glucose–willpower effect and found mostly null or inconsistent results. Eventually, large-scale meta-analyses—Vadillo, Gold, and Osman (2016) and Lange and Eggert (2014)—confirmed the suspicion: the literature was biased, the true effect size was near zero, and the original findings likely reflected p-hacking and selective publication.

4. The Current Status (Post-2020): The Glucose Model Is Dead

Today’s consensus—including Baumeister’s own 2024 review—effectively concedes that the glucose theory of willpower is dead, despite the initially strong-looking evidence for it. That evidence only appeared strong but was in fact illusory, based on unscientific practices that inflated effect sizes and suppressed null results.

Even the broader theory of ego-depletion has come under heavy criticism, as it was developed and supported using the same questionable practices. Two large replication studies failed to reproduce the basic ego-depletion effect and produced effect-size estimates close to zero. Baumeister continues to defend the broader theory, so it cannot yet be declared dead, but the glucose model has vanished.

5. Why This Matters for Meta-Science

The glucose theory of willpower is more than just a failed idea—it is a vivid example of how questionable research practices can create the illusion of discovery and lead to years of wasted effort. Schimmack’s 2012 incredibility analysis exposed the statistical impossibility of Gailliot’s findings long before the replication failures confirmed it.

6. Postscript

The future will show how many other walking-dead theories and ghostly phenomena, created in laboratories through dark research practices, will eventually die and join the glucose theory in the graveyard of psychological ideas that lack empirical support. To make this happen, psychology needs more courageous scientists—modern-day Dr. Van Helsings—who are willing to put a wooden stake through the heart of theories that deserve to die.

Happy Halloween

Shining a Light on Dark Practices in Peer-Review

October 21, 2025UncategorizedUlrich Schimmack

Pre-publication peer review is the biggest scam in the history of science. Researchers have to pass the evaluation by “peers,” before an editor (often a mindless vote-counter of opinions shared in reviews) decides whether the article deserves to be published. Arbitrary limits on the number of publications, give reviewers power to censor by writing extremely negative reviews and recommending rejection. This faulty system of quality control is maintained by publishers to resell publicly funded research to publicly funded universities with fat profit margins. In short, we would be better off to publish everything and let open post-publication reviews decide what is good or not. Until then, researchers have to battle with anonymous reviewers in a Kafkaesque closed and hidden process at the mercy of dark forces that decide their fate.

I have encountered this many times myself, but right now I am writing to help researchers who encounter Reviewer 2, when they use z-curve to examine the credibility of research in their field. Interestingly, I am never asked to review z-curve articles, although I am the expert on this method, but Reviewer 2 is asked to comment on the method and is allowed to make outrageously false claims. If this happens to you, please feel free to contact me to write a rebuttal or just point the editor to this blog post.

You can also ask AI to evaluate the quality of Reviewer 2’s arguments and my rebuttals. I did so myself and ChatGPT and Perplexity concluded that my counterarguments are scientifically accurate and valid, but don’t trust me. Trust is good, but fact checking is better.

Reviewer 2

1. The authors primarily conceptualize replicability in terms of retrospective (or post-hoc) average power (also known as the “expected discovery rate” / EDR).

Rebuttal:

This statement is false and shows limited knowledge by this reviewer about the method they criticize. Z-curve estimates two percentages: The percentage of significant results that is expected if the studies that were used in the meta-analysis were reproduced exactly and analyzed exactly the same way using the same sample size. This is the Expected REPLICATION rate (ERR). It also estimates the percentage of significant results that are expected in this new replication set of studies. This estimate includes all of the non-significant results that will also be produced, but may or may not have been reported. This is called the expected discovery rate (discovery = p < .05). The reviewer confuses ERR and EDR.0

2. Average power is a meta-analytic analogue of single study post hoc power. Single study post hoc power has been greatly lampooned for many decades now (Hoenig & Heisey, 2001; Yuan & Maxwell, 2005). For example, Greenland (2012) writes that post hoc power computed from completed studies is: “Irrelevan[t]: Power refers only to future studies done on populations that look exactly like our sample with respect to the estimates from the sample used in the power calculation; for a study as completed (observed), it is analogous to giving odds on a horse race after seeing the outcome.” In addition, average power is not relevant to the replicability of actual prospective replication studies. As McShane, Bockenholt, and Hansen (2020) write: “Average power is relevant to replicability if and only if replication is defined in terms of statistical significance within the classical frequentist repeated sampling framework. As this framework is both purely hypothetical and ontologically impossible, average power is not relevant to the replicability of actual prospective replication studies.”

Rebuttal:

All of these comments are irrelevant and rest on confusion about the term power. The classic definition of power defines power as a probability of obtaining a significant result given a hypothetical alternative hypothesis. This definition of power is irrelevant in studies that estimate the ERR and EDR that are influenced by the true population effect sizes of studies (and sampling error), not some hypothetical values that are no longer relevant when actual data are available.

The criticism of post-hoc power is also relevant because it is about the interpretation of results in a single study, not about meta-analysis of many studies.

Finally, McShane et al.’s article makes two mistakes. It uses the term power for empirical estimates, when power is defined in terms of hypothetical values. Second, the article relied on sets of 30 studies to claim that estimates are imprecise, but precision increases with the set of studies. This article had over 100 studies and the precision of the estimates is clearly specified with 95% confidence interval. Thus, the uncertainty of the results can and should be evaluated with the actual results and not based on an article that did not examine z-curve estimates.

3. Pek et al (2022) also note ontological concerns with average power. Pek et al (2024) further note that (as per the present authors’ approach) “using power for evaluating completed studies can be counterproductive.”

Rebuttal

Pek et al.’s criticism is about studies that compute post-hoc power based on the definition of power as a hypothetical construct. This criticism does not apply to z-curve estimates that estimate expected values based on true population effect sizes and not statistical power as defined by Pek et al. Pek et al. also did not discuss z-curve as a method to estimate expected discovery rates or expected replication rates. So, the article is irrelevant to the evaluation of z-curve.

4. While I have thus far focused on the primary manner in which the authors conceptualize replicability (i.e., average power / EDR), exactly the same concerns apply to the secondary manner (i.e., the “expected replication rate” / ERR).

Rebuttal

The same rebuttal holds for the ERR. It is not an estimate of average power as defined by Pek. Because it estimates the true probability of significant results in exact replication studies, whereas Pek et al. define power as a hypothetical construct. Estimating the ERR is not wrong, calling it power is. The terms EDR and ERR therefore make it clear that these estimates are not estimates of average power, in the classic sense of statistical power. So, this criticism does not address z-curve estimates and their validity.

5. Rosenthal was a pioneer studying replication in psychology. Drawing on his work dating from the 1960s, Rosenthal (1990) dismissed evaluations of replicability that are dichotomous and based on significance testing as “the traditional, not very useful view of replication” and advocated evaluations of replicability that are continuous and based on effect sizes as “the newer, more useful view of replication. The authors’ approach in this paper is dichotomous and based on significance testing and thus falls squarely in what Rosenthal thirty-five years ago today already termed “the traditional, not very useful view of replication.”

Rebuttal

Rosenthal made contributions to effect size meta-analysis. They are useful and important when researchers want to combine results from several close or direct replications to estimate the population effect size. The main in this article is different. Science-wide estimates of EDR and ERR can provide useful information for the interpretation of individual studies that lack multiple replications and can be meta-analyzed. Moreover, it can provide information about the typical amount of publication bias in a literature and provide information for the planning of future studies. In short, effect-size meta-analysis is important. So, is knowing the amount of publication bias, replicability, and the false positive risk in a field of studies. Effect size meta-analyses do not provide this information.

Rosenthal also was responsible for a faulty way to assess publication bias in meta-analysis (fail-safe N) that suggested publication bias is not a big problem in meta-analysis. Z-curve, however, can estimate the actual amount of publication bias in a literature and has shown massive publication bias and a high false positive risk in literatures with hundreds of studies. For example, z-curve showed that Noble Laureate had picked priming studies for his bestseller “Thinking: Fast and Slow” that had a false positive risk of 100%. He openly distanced himself from the researchers who had published results and were unwilling to back up their claims with actual replication studies. In this example, the average effect size of these different studies was not important. What mattered was that the studies failed to provide credible evidence that social priming works.

6. “It is therefore not surprising that a common finding among replication projects is that unbiased replication studies with larger sample sizes produce much smaller effect sizes. For instance, the ### replication project found that 88% of the replication effect sizes were severely inflated in comparison to the original effect sizes, with a median percentage decrease of 75%. As can be seen, the ### replication project takes a continuous quantitative view based on effect sizes, reporting that the median decrease in the effect size estimates was 75% and going on to characterize the full distribution of effect size differentials in Figures 1 and 2 of that paper. I do not find the present authors’ retrospective and dichotomous approach based on significance testing to be an advance over the ### replication project’s prospective and continuous approach based on effect sizes. Indeed, I view it as retrograde.

Rebuttal

Reviewer 2 does not stop for one second to explain why effect sizes estimates shrank by about 80%. The z-curve analysis shows why; the orginal studies reported inflated effect sizes estimates because studies with large sampling error require large estimates to get significant results. The actual replication results cannot show that this is the reason, but the z-curve analysis of original studies can because it estimates how replicable these studies are in the hypothetical scenario that they are exactly replicated with a new sample. The argument also ignores that effect size estimates are rarely used to interpret results. Most of the time, the key claim is a rejection of the null-hypothesis in a specific direction. This conclusion is not altered by lower effect sizes, but it is altered when the result is no longer significant. The original conclusion no longer holds.

7. Even for those who prefer a dichotomous approach based on significance testing, when such is applied to the sports science replication project, we get a result similar to the present authors’ result (see middle of page 12 of their manuscript). Therefore, in a very important sense, the present authors’ result is already known (or at least cannot be said to be novel).

This comment by the reviewer shows once more their lack of understanding of science and a lack of awareness of the methodological discussion about the importance of replication studies that by definition lack novelty. Ironically, they applaud a replication project and then criticize a replication study for being unoriginal. The proper comparison of the actual replications and the z-curve analysis is this: Both projects used different methods on different sets of studies and produced consistent results. The novel finding is that in this literature both estimates converge on the same conclusion. When two different methods with different data show consistent results, it provides evidence that the results are not driven by sampling error (e.g., actual replication studies picked studies with easy and cheap designs) or methodological biases (e.g., replication studies produce weaker effects because the replication researchers are not experts in that field). In short, consistent results provide valuable information. Novelty is important for original studies, not for meta-analyses that assess how many of the novel original findings are actually findings and how many may be false positive ones.

8. The authors’ use the forensic Z-curve meta-analytic procedure of Brunner & Schimmack (2020) and Bartos & Schimmack (2022). On page 3 of their manuscript, they note that they could use the forensic P-curve meta-analytic procedure of Simonsohn, Nelson, and Simmons instead. In a forthcoming Journal of the American Statistical Association paper, Morey and Davis-Stober provide a formal analysis that proves that the P-curve has poor statistical properties. For example, they prove that the P-curve produces inconsistent estimates of average power / EDR. One might question the relevance of this to the Z-curve and thus the present manuscript. I quote the final paragraph of Morey and Davis-Stober:

“As a final point, we suggest that meta-scientists be more skeptical of procedures like the P-curve in the meta-scientific literature. Papers introducing them are often light on statistical exposition, using metaphors [and] a few simulations to make sweeping arguments. Simulation is a powerful tool and can help build intuition, but it is not a substitute for formal analysis. Simulation may provide hints of problems with a procedure, but only if the simulator’s formal knowledge helps guide the choice of simulations. A simulator might quit after running a few simulations that tell them what they think is true while problems remain uncovered. Given the implications of poor forensic procedures for science, all such procedures demand deeper formal scrutiny.”

This forthcoming paper is extremely relevant to the present manuscript because the very paragraph above could be written about the Z-curve.

Rebuttal

In a legal trial, this witness would be held in contempt. They are simply lying. Brunner and Schimmack (2020) directly compared p-curve and z-curve and showed that p-curve fails when data are heterogeneous as they typically are and as they are in this article (heterogeneity: ERR > EDR, homogeneity: ERR = EDR). Schimmack and Brunner have also written several subsequent criticisms of p-curve. Morey and Davis-Stober’s article adds to this criticism and the p-curve authors are not defending their method against these criticisms. So, yes, p-curve was an attempt to estimate the true power of a set of studies, but it failed.

It is ridiculous to imply that we can just use any criticism of p-curve and apply it to a fundamentally different method. P-cure was not evaluated by simulation studies. Z-curve has been evaluated with hundreds of simulation studies and performs well with typical data sets, including data like the one in this article. The convergence between results from the actual replication project and z-curve predictions that the Reviewer used to claim “lack of novelty” is also relevant here. If z-curve were flawed, why does it produce estimates that are validated with actual replication outcomes?

9. Turning back to this manuscript and its use of the Z-curve, in short, we at present know next to nothing about the statistical properties of the Z-curve (just as we knew next to nothing about the statistical properties of the P-curve until Morey and Davis-Stober came along). The statistical properties of the Z-curve may be as poor or worse than those of the P-curve. Or they may be solid. We simply cannot say. Morey and Davis-Stober write: “Given the stated purpose of the P-curve—evaluating the trustworthiness of scientific literatures—the stakes are too high to use tests with such poor, or poorly-understood, properties.” The same applies to the Z-curve which has the same stated purpose. As a consequence, I remain very skeptical of any use of the Z-curve until its properties have been investigated formally and shown not to be wanting—especially given the very high stakes involved.

Rebuttal

I had an email discussion with Davis-Stober and he was not aware of z-curve and does not know anything about z-curve. He simply does not think it is useful to estimate publication bias, but that is his personal opinion, and not a criticism of a method that estimates it.

10. You refer to these four quantities as “parameters” but they are not parameters. The word parameter has a formal definition within the context of a statistical model and these do not qualify. These are outputs or estimands but not parameters.

No Rebuttal

That is correct. EDR and ERR are estimates of population parameters not parameters themselves. Estimands is a new fancy word that few psychologists use. The word estimates is good enough. ODR, EDR ERR and FDR are estimates of population parameters. Correcting this mistake does not change anything substantial about the results.

11 You assert (arguably rather blithely) that the Z-curve’s independence assumption is met in your analysis because only one p-value per study is included in the analysis. This is of course not necessarily true. If, for example, the 269 studies share authors or sets of authors, that could induce dependence. There are of course many additional sources of possible dependence. One simply cannot say.

Rebuttal

This is simply false. The independence assumption is about the sampling error of studies, and each new sample has a new sampling error. If all studies used z-tests and had the same effect size and sample size, we expect an average sampling error of 1. When studies are heterogeneous, there is additional variation due to real differences in the non-centrality parameters (the location of the normal distribution on the x-axis of z-values) that describes the sampling distribution, but that is irrelevant for z-curve because it makes no assumptions about that distribution. Some studies from one author may be close to z = 0 and those of other authors may be close to 3. That is heterogeneity, not dependence in sampling errors. Dependence of sampling errors only occurs for some analysis based on the same dataset (e.g., correlated dependent variables).

12. The authors discuss many subjective choices or value judgments as if they were objective. An example that recurs throughout the manuscript is the discussion and use of alpha = 0.05 and power = 0.80. As is well known, any choice of alpha and power reflects a particular tradeoff between the relative costs of Type I errors versus Type II errors. Except in very narrow circumstance where these relative costs can be objectively quantified (e.g. industrial quality control), these relative costs reflect a particular subjective utility (or loss) function. This subjective function will in turn vary by context or even by different people working within the same context (Neyman, 1977). This is why some have made calls for researchers to “justify their alpha” and power in light of their subjective preferences and idiosyncratic research contexts (see, for example, Lakens et al, 2018). It would be helpful if the authors discussed a range of possible (alpha, power) pairs. Alternatively, if they believe (alpha = 0.05, power = 0.80) are objectively justified in their setting, please state that and argue in favor of it. This comment applies more broadly to other quantities that the authors tend to suggest are objective (e.g., the percentage of studies with “statistically significant” results, the replication rate, etc.): either recognize the subjectivity involved or justify the values of these quantities that you believe are objectively optimal.

Rebuttal

I do not see how this is related to z-curve, but to blame the authors of this article for the mindless use of alpha = .05 that has been in place since Fisher published his first book with tables that allowed researchers to claim significance at that level is just another strange and unhinged comment by this Reviewer that revealed nothing but willful ignorance; except for the comment about parameters.

The Schimmack-Pek Controversy

October 21, 2025UncategorizedUlrich Schimmack

🔹 Core Issue

The controversy centers on whether it is legitimate to estimate the average statistical power of completed studies—that is, to use published test statistics to infer how often those studies would produce significant results if replicated.

Schimmack’s position: Average power can and should be estimated empirically from published results.
Pek’s position: Power is a hypothetical construct used for planning future studies, not something that can be meaningfully estimated post hoc.

🔹 Schimmack’s Position (Brunner & Schimmack, 2020; Schimmack, 2025)

Two Concepts of Power
- Hypothetical power: The probability of significance based on an assumed true effect size before data collection.
- True power: The actual long-run probability that studies in a literature produce significant results, given their real (not assumed) effect sizes and sample sizes.
  Schimmack argues both concepts are legitimate—one for planning, one for evaluating.
Empirical Estimation is Possible
- Using methods like z-curve, one can reconstruct the distribution of significant test statistics (z-values, t-values) to estimate:
  - Expected Discovery Rate (EDR): expected proportion of significant results after accounting for selection bias.
  - Expected Replication Rate (ERR): probability that a significant finding would be significant again if replicated.
    These correspond to estimates of average true power.
Purpose of Estimation
- Estimating average power reveals the credibility of a research area: if the observed success rate (e.g., 90%) far exceeds the estimated true success rate (e.g., 30%), the field likely suffers from publication bias or p-hacking.
- Average power is thus an index of evidential value and reproducibility, not a design tool.
Rebuttal to Semantic Objection
- Even if “power” was historically defined for hypothetical design contexts, that’s a semantic convention, not a logical limitation.
- Redefining or replacing the term (e.g., “expected discovery rate”) does not change the underlying empirical reality that studies have a certain probability of success given their true effects.

🔹 Pek’s Position (Pek, Hoisington-Shaw, & Wegener, 2024, Psychological Methods)

Power Is Hypothetical
- By definition, power is the probability of rejecting H₀ given a true effect size and sample size in a planned design.
- Once data are collected, the “true effect” is unknown, and the observed result no longer provides information about power.
Post-hoc Power Is Misleading
- Power computed from an observed effect size is mathematically redundant with the p-value.
- Therefore, post-hoc power analysis adds no new information—it simply recasts the p-value in another form (the so-called “power-p equivalence” argument).
Meta-Analytic Power Estimation Is Ontologically Flawed
- Using the same term (“power”) to describe retrospective estimates confuses the conceptual role of power (a design tool) with empirical inference about data.
- Pek argues that such redefinitions create an “ontological error”—blurring what power is (a pre-data probability) versus what z-curve estimates (a property of observed distributions).
Proper Role of Power
- Power analysis should be reserved for planning new studies to achieve a desired level of sensitivity, not for evaluating past research.

🔹 Schimmack’s Counterarguments (2025 Blog & Responses)

Misplaced Formalism
- The “ontological” objection is purely semantic: words often have multiple legitimate meanings depending on context (e.g., “force” in physics vs. conversation).
- Cohen himself used power in both planning and evaluative contexts (e.g., Cohen, 1962; Sedlmeier & Gigerenzer, 1989).
Empirical Track Record
- Dozens of meta-analyses since the 1960s have reported “average power” of published studies—this tradition predates Pek’s definitional restriction.
- Methods like z-curve extend that logic by correcting for selection bias and estimating actual discovery probabilities.
Conceptual Utility Over Semantics
- Regardless of what it’s called, the estimated probability that a significant result would replicate is an empirically meaningful and policy-relevant measure.
- The debate over the label “power” is a distraction from the substantive goal: improving credibility and reproducibility.
Meta-Science vs. Design Science
- Power as used in z-curve belongs to meta-science—the empirical study of how scientists actually behave—rather than the formal Neyman-Pearson design framework.
- Rejecting post-hoc estimation because it violates a textbook definition misses the meta-scientific purpose entirely.

🔹 Broader Implications

Issue	Schimmack’s View	Pek’s View
Definition of power	Can refer to true long-run success probability of real studies	Only a hypothetical design probability
Use of observed data	Valid and necessary for empirical evaluation	Invalid; tautological with p-values
Role of z-curve	A meta-scientific estimator of true discovery/replication rates	Misuses “power” concept
Philosophy of science	Empirical realism: definitions should follow observable reality	Conceptual essentialism: definitions must follow formal theory
Goal	Diagnose publication bias and credibility	Preserve terminological purity of statistical theory

🔹 Summary Statement

The Schimmack–Pek controversy is ultimately about the meaning and use of statistical power.

Pek argues that power belongs exclusively to the design phase and cannot describe completed studies.
Schimmack argues that psychology needs empirical tools to assess its actual performance and that average power—or equivalently, expected discovery/replication rates—provides exactly that.

In short:

Pek defends a definition of power of a single study; Schimmack invented a method to estimate average power.

Psycho-Science: Unscientific Statisticians Enable Bad Research Practices

October 21, 2025UncategorizedUlrich Schimmack

When Every Hypothesis Is True

Psychologists want to be scientists so badly that they have started rebranding themselves as psychological scientists. We now have journals like Psychological Science and departments renamed from Psychology to Psychological and Brain Sciences. It is an odd development, considering that psychology already means the study of the mind and behavior (APA Dictionary of Psychology). For decades, psychologists were content to call themselves psychologists, just as biologists are content to be biologists. But somewhere along the way, some began to worry that “psychologist” sounded too much like “astrologist.” So they added “science” to their name, hoping that a new label might make it true.

Of course, calling yourself a scientist does not make you one—any more than drawing a salary from a university or holding a PhD does. To study something scientifically requires following the basic rules of science: form falsifiable theories, test them empirically, and revise or abandon them when the data show that you are wrong. Unfortunately, many psychologists have been trained to believe that they can be scientists without ever risking that outcome.

When Every Hypothesis Is True

Awareness that something is wrong with psychological research is not new. In 1959, Sterling discovered that more than 90 percent of published studies in psychology supported the authors’ hypotheses. He repeated the finding in 1995, and it was still true in the 2010s (Schimmack, 2020). Sterling et al. (1995) already suggested that this high success rate is too good to be true. Graduate students quickly learn that publishing depends on getting significant results, and everyone knows it. The studies that do not “work” simply disappear. This is publication bias, and it undermines the very foundation of science.

The replication crisis made the problem visible. In the 2010s, the Open Science Collaboration (2015) tried to replicate 100 published results. Only about 25 percent of the social-psychology findings and 50 percent of the cognitive-psychology findings held up. The most straightforward explanation was that the original studies had a low probability of producing a significant result, but only the lucky ones were published. Luck alone, however, cannot explain the remarkably high success rates. Psychologists also used statistical tricks to inflate effect sizes to reach significance.

Scientific Doping

John, Loewenstein, and Prelec (2012) coined the term scientific doping for the set of behaviors that inflate apparent success rates—running multiple analyses, stopping data collection once p < .05, and hiding null results. Z-curve is simply a doping test for science. The difference is that, unlike in sports, scientific doping is still legal. Nobody has lost their job for concealing null findings or for collecting data until the numbers “worked.” When I once compared a famous psychologist to Lance Armstrong, I was threatened with a lawsuit, and I had to clarify that there is a distinction between banned substances in sports and legal p-hacking in psychology. Whereas the public assumes that scientists follow a code of honesty, the insider secret is that honest reporting of results is career-ending because only successful studies are published. Every psychological researcher knows it, but most like to hide this from the general public or their undergraduate students.

A Doping Test for Science

Together with Jerry Brunner—a psychologist-turned-statistician who left the field when he realized it was not functioning as a science—I developed z-curve, a method that estimates the true success rate in original and replication studies based on the statistical evidence in published studies (e.g., t– or F-values) (Brunner & Schimmack, 2020). Later, Bartoš and Schimmack (2022) extended the method to quantify the amount of publication bias in psychology journals.

When I applied z-curve to 678 statistical results from social-psychology journals (Motyl et al., 2017), the findings were sobering. The published success rate was 90 percent, but the estimated true success rate was only 19 percent (95 % CI = 6–36 %). Even under the most generous assumptions, the gap between the observed success rate and the true success rate is enormous. That discrepancy is not opinion; it is meta-science—the empirical study of psychologists’ behavior in their laboratories revealed by their published results. I guess that makes me a meta-psychological scientist.

Why the Pushback?

It is easy to see why empirical psychologists dislike these results—nobody enjoys learning that an entire discipline has been built on shaky foundations. What is harder to understand is the resistance from statistical methodologists, whose careers do not depend on producing significant empirical results.

In particular, Pek, Hoisington-Shaw, and Wegener (2024) appear to have made it their mission to fight against the estimation of true success rates. Some of the arguments are pure semantics (Schimmack, 2025). First, Pek insists on the definition of statistical power as a purely hypothetical construct based on hypothetical population effect sizes. Then, she criticizes anybody from Cohen (1962) to our own work for using results from actual studies for estimating true power because power is defined as hypothetical. This ignores 60 years of meta-analysis of the actual power of studies, but psycho-statisticians who decide what psychologists get to read do not see the problem with this argument. If power is defined as a hypothetical construct, it cannot be estimated with actual results. True—but we just have to change the definition of power and then we can estimate true power. A simple solution is to just use another term, which we already did when we created z-curve 2.0. Z-curve does not estimate average power. It estimates expected discovery and expected replication rates. Unlike hypothetical power, these estimates are influenced by the true population effect sizes of studies, not some hypothetical ones used for classic power estimation.

We do not need the term power to state that psychological journals have 90 % observed success rates when the expected success rates that correct for publication bias are often below 50 %, and defining power in a way that makes it impossible to apply it to actual studies does not address the empirical finding that success rates in psychology journals are inflated by publication bias. This bias undermines the credibility of claims that psychology is a science. Serious methodologists who want to improve psychology need to address the problem, not define it away with word games.

The Moral

For a long time, it was possible for psychologists to pretend that publication bias is not a problem and to ignore criticisms that success rates are incredibly high (Sterling, 1959). However, the replication crisis has shown that entire literatures can be made up from nothing. While actual replication studies are hard, z-curve makes it easy to show how implausible 90 % success rates really are. However, many psychologists do not want a doping test that holds them accountable and welcome criticism of doping tests, even if they rest on silly word games. Decades of criticism without reforms have shown that psychology is unable to fix itself—a hallmark of a real pseudoscience that uses statistical rituals to pretend to be scientific when it is not (Gigerenzer, 2004).

Psychology can rename itself as often as it likes—psychological science, psychological and brain sciences—but as long as it denies empirical evidence, protects illusions of success, and hides behind semantic arguments, it will remain what it has long been: a discipline that talks like a science but acts like a cult. While some progress has been made toward improving standards, most of these improvements are voluntary and limited to some areas of psychology. Despite some improvements, many areas of psychology have not even made these small changes.

Psychology needs an intervention. Stakeholders like funding agencies and undergraduate students have to hold psychological researchers accountable and ensure that they act in accordance with the rules of science. This means publishing results, even if they fail to confirm or even undermine a researcher’s theory. This also means that falsification of other researchers’ claims is desirable and should be encouraged. The success rate of 90 % has to come down to be taken seriously as a science. A scientific doping test is useful because it provides a clear goal for the outcome of the intervention. To get psychology clean, we have to show that it no longer uses scientific doping, and z-curve can track the progress toward that goal.

References

Bartoš, F., & Schimmack, U. (2022). Z-curve 2.0: Estimating replication success and publication bias with a mixture model. Psychological Methods, 27(3), 433–449. https://doi.org/10.1037/met0000475

Brunner, J., & Schimmack, U. (2020). Estimating replicability with z-curve. PsyArXiv Preprint. https://doi.org/10.31234/osf.io/9rhyz

Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65(3), 145–153. https://doi.org/10.1037/h0045186

Gigerenzer, G. (2004). Mindless statistics. Journal of Socio-Economics, 33(5), 587–606. https://doi.org/10.1016/j.socec.2004.09.033

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23(5), 524–532. https://doi.org/10.1177/0956797611430953

Motyl, M., Demos, A. P., Carsel, T. S., Hanson, B. E., Melton, Z. J., Mueller, A. B., Prims, J. P., Sun, J., Washburn, A. N., Wong, K. M., Yantis, C., & Skitka, L. J. (2017). The state of social and personality science: Rotten to the core, not so bad, getting better, or getting worse? Perspectives on Psychological Science, 12(4), 613–617. https://doi.org/10.1177/1745691617692103

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. https://doi.org/10.1126/science.aac4716

Pek, J., Hoisington-Shaw, K. J., & Wegener, D. T. (2024). Uses of uncertain statistical power: Designing future studies, not evaluating completed studies. Psychological Methods. Advance online publication. https://doi.org/10.1037/met0000577

Schimmack, U. (2020). The replication crisis: A z-curve analysis of social-psychology journals. Replication Index Blog. https://replicationindex.com/2020/01/04/replicability-crisis/

Schimmack, U. (2025). Reply to Pek, Hoisington-Shaw, and Wegener (2024): Defending the estimation of true success rates. Replication Index Blog. https://replicationindex.com

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance—or vice versa. Journal of the American Statistical Association, 54(285), 30–34. https://doi.org/10.1080/01621459.1959.10501497

Sterling, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication decisions revisited: The effect of the outcome of statistical tests on the decision to publish and vice versa. American Psychologist, 50(11), 1086–1089. https://doi.org/10.1037/0003-066X.50.11.1086