Preface
This post grew out of a long discussion with ChatGPT about Gerd Gigerenzer’s treatment of the history of statistics and its influence on psychology in his book The Empire of Chance (1989).
I actually found this book by chance, because ChatGPT recommended it during a literature search. Psychology now has an overwhelmingly journal-based culture, where articles appear online as PDFs and are rarely accompanied by physical books. I am old enough to remember browsing the shelves of real libraries—especially the magnificent stacks at the University of Illinois and the Roberts Library in Toronto—but I stopped doing so about fifteen years ago. Younger colleagues may never know that quiet pleasure.
So, it is not surprising that few psychologists have actually read The Empire of Chance. Fortunately, I was able to access it through my University of Toronto credentials. For most readers, however, it remains locked behind a paywall.
To explore Gigerenzer’s arguments more closely, I uploaded the relevant chapters to ChatGPT (since they are not freely available) and discussed the content in light of my broader research on the history of power, significance testing, and replicability.
This post summarizes our shared understanding of how statistical thinking entered psychology, and why we concluded that Gigerenzer’s famous claim that null-hypothesis significance testing (NHST) is a hybrid of Fisher and Neyman-Pearson is inaccurate. It isn’t a hybrid at all. It’s pure Fisher.
Neyman and Pearson’s framework never gained traction. Today Neyman’s invention of confidence intervals dominates sound statistical inferences because they avoid the problems of Fisher’s significance testing without the difficulties of implementing Neyman-Pearson’s approach. So, we moved from Fischer to Neyman and forgot and Neyman-Pearson were never relevant in the use of statistics by psychologists.
Introduction
For decades psychologists have been told that the way they analyze data—null-hypothesis significance testing—is a hybrid of two rival statistical philosophies: Fisher’s significance test and the Neyman-Pearson decision framework.
Gigerenzer popularized this story in The Empire of Chance (1989), arguing that textbooks merged the two systems and gave the illusion of harmony. It’s a neat narrative—but it doesn’t survive close inspection.
1 · Fisher’s significance test
1️⃣ Make a prediction or explore whether two variables are related.
2️⃣ Collect data and compute a p-value assuming no relation (H₀).
3️⃣ If p is small enough, reject H₀ and claim support for the expected directional effect.
4️⃣ As Fisher wrote in 1935, “every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis” (The Design of Experiments, p. 16).
This deceptively simple procedure made inference a one-sided game: we seek “disproof” of H₀, not testing of a specific H₁.
In practice, rejecting H₀ is treated as confirming our theory—verification dressed up as falsification.
2 · The Neyman-Pearson alternative
Neyman and Pearson proposed a symmetric system of two hypotheses, H₀ and H₁, each with defined long-run error rates.
- H₀ can be rejected, but H₁ can also be rejected.
- To do so we must specify a concrete alternative, e.g., d = 0.5, and design the study with known α and β.
- A result can therefore falsify a risky prediction (reject d = .8 means the effect is smaller than “large”).
- If both survive, we test again.
In this framework, power and Type II error are not afterthoughts—they’re the price of claiming evidence.
3 · Why it never took root in psychology
Psychology kept Fisher’s asymmetry. Researchers learned to celebrate significant results and ignore non-significant ones. Gigerenzer claimed textbooks resolved the dispute by fusing both schools into a “hybrid model.” But the evidence tells a different story.
4 · Why the “hybrid” is a myth
1 · Fixed thresholds were Fisherian conveniences.
Before computers, tables listed critical values for .05, .01, and .001. Using them was a practical shortcut, not an adoption of Neyman-Pearson error control.
Reporting “p < .05” or adding ** for p < .01 continued Fisher’s graded-evidence tradition.
2 · Type II errors were rhetorical, not operational.
Textbooks mentioned them vaguely—“the probability of an error if H₀ is false”—but never linked them to a specific H₁ such as d = .5. β was seldom calculated or used.
3 · Power was rarely used for design or inference.
Even after Cohen (1962) called for power analysis, psychologists mostly ignored power or treated it only as planning advice for achieving significance, not to quantify type-II errors in inferences that rejected a specific H₁
4 · In practice, nothing changed.
Studies were published when p < .05 and forgotten when p > .05. Journal success rates were over 90%, reflect a one-sided testing culture, not a balanced decision framework.
5 · The broader context
Other social sciences followed different paths. Economists and sociologists, working with large samples and directly measurable variables, emphasized estimation and precision—effect sizes, standard errors, and confidence intervals. They had little interest in either Fisher’s or Neyman-Pearson’s philosophies, although interpretation of results was also influenced by significance thresholds.
Ironically, Neyman’s own (1937) invention of the confidence interval would have solved psychology’s dilemma: a CI simultaneously rejects extreme H₀ and H₁ values without pre-specifying them. Gigerenzer does not mention the modern hybrid of significance testing that uses values of 0 inside or outside the confidence interval to replace Fisher’s significance test.
6 · Conclusion
The so-called hybrid of Fisher and Neyman-Pearson is a myth.
Psychology adopted Fisher’s one-sided test with a conventional publishing threshold of p < .05 and never implemented the symmetrical logic of Neyman-Pearson decisions.
Even Cohen’s power analysis was absorbed into the same framework—another tool for ensuring significance, not for falsifying theoretical claims.
What Gigerenzer described as a marriage was never consummated.
Psychology has lived for nearly a century with Fisher alone, and is now replacing it with Neyman’s confidence intervals.
Neyman-Pearson’s marriage never produced any children.
References
Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65(3), 145–153. https://doi.org/10.1037/h0045186
Fisher, R. A. (1935). The design of experiments. Edinburgh: Oliver & Boyd.
Gigerenzer, G., Swijtink, Z., Porter, T., Daston, L., Beatty, J., & Krickeberg, K. (1989). The empire of chance: How probability changed science and everyday life. Cambridge University Press.
Gigerenzer, G. (1993). The superego, the ego, and the id of statistical reasoning. In G. Keren & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences: Methodological issues (pp. 311–339). Hillsdale, NJ: Erlbaum.
Gigerenzer, G. (2004). Mindless statistics. Journal of Socio-Economics, 33(5), 587–606. https://doi.org/10.1016/j.socec.2004.09.033
Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, 231, 289–337.
Neyman, J. (1937). Outline of a theory of statistical estimation based on the classical theory of probability. Philosophical Transactions of the Royal Society of London. Series A, 236, 333–380.
Tversky, A., & Kahneman, D. (1971). Belief in the law of small numbers. Psychological Bulletin, 76(2), 105–110. https://doi.org/10.1037/h0031322
I also heard this story in my grad stats classes, but we never really dove into it. Which specific parts did Gigerenzer think were actually combined in practice? My impression is that psychology has been Fisher-leaning with NP parts layered on descriptively, not prescriptively. For example, one point ChatGPT made is that CIs (adopted from NP) in a Fisherian framework are descriptive (precision about a point estimate), whereas in NP, they guide decisions when tied to specific, predefined h0 or h1. So psychology used Fisher as an interpretive lens rather than thinking critically about their differences.
Also what would a genuine hybrid approach look like in practice? In talking with ChatGPT about this, it is suggesting that Fisher is good for discovery while NP is good for confirmation. But I’m not convinced of that either–I’m not sure the decision frameworks are compatible. Afterall, for both traditions, the test statistics are the same but their claims differ: Fisher emphasized an asymmetric framework (reject/fail to reject H0) whereas NP emphasized a symmetrical framework (reject/accept h0; reject/accept h1, while predefining alpha and beta). And would we even want a hybrid approach?
1. Gigerenzer claims that the use of a strict alpha criterion of .05 is NP, while Fischer advocated for a gradual approach.
I would say that is partially true, but cut-off values .05 and .01 were already used because they were in Fischer’s book and easy to use and justify.
2. We do not want a hybrid because Fischer has nothing to offer and has ruined generations of psychologists. The solution is not a hybrid, but moving away from Fischer to effect size estimation with quantification of uncertainty (with or without priors).
I took it that the N-P part in modern NHST comes from the explicit stating of the alternative hypothesis (though you rightly point out that this is usually just complement of null rather than a point hypothesis). You’re also right that NHST is usually taken to be asymmetric, thus all that fuss about “absence of evidence =/= evidence of absence”, which was not what N-P intended. But isn’t this part of Gigerenzer’s point? That modern NHST is an *inconsistent* hybrid of Fisher and N-P, in that it contains elements that would be unacceptable to both Fisher and N-P.
The question is where the hybrid comes in. What is NP and not Fischer? What do you think?
I reread Gigerenzer (2004) [https://sciences.ucf.edu/biology/d4lab/wp-content/uploads/sites/23/2023/01/Gigerenzer-2004-Mindless-Statistics.pdf] and was a little surprised that he conceived of the “null ritual” as *not* involving alternative hypotheses. I am surprised because in psych departments NHST is taught with specific reference to alternative hypotheses (though it’s almost always composite). So actually, I would disagree with Gigerenzer that the NP part of the hybrid comes from (just) binary yes-no decisions but also the introduction of alternative hypotheses (which was not found in Fisher).