Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. The Journal of Abnormal and Social Psychology, 65(3), 145–153. https://doi.org/10.1037/h0045186
This is a short summary of the key findings of Cohen’s (1962) seminal and groundbreaking study of the actual power of studies in psychology that should have triggered a method revolution 60 years ago, but it didn’t. Reading the key sections of this important article is worth your time.
Here’s a summary of the key points from Jacob Cohen’s 1962 classic article The Statistical Power of Abnormal-Social Psychological Research:
Background and Motivation
- Psychological research emphasized Type I error (false positives) and statistical significance while largely neglecting Type II error (false negatives) and statistical power.
- Sample sizes were typically set by tradition, convenience, or negotiation—not by rational power analysis.
- This neglect meant that many studies had little chance of detecting real effects.
Study Goals
- Draw attention to the importance of power for researchers, editors, and sponsors.
- Provide conventions and tables to facilitate power analysis.
- Assess the actual power of published psychological research through a literature survey.
Method
- Cohen examined 78 articles published in the Journal of Abnormal and Social Psychology (1960–61).
- 70 articles contained statistical tests relevant to hypotheses.
- He created conventional benchmarks for effect size:
- Small, medium, large, defined in standardized terms for various tests (t-tests, F-tests, chi-square, correlations, proportions, etc.).
- Example: a medium difference between two means was set at 0.5 SD.
- Power was calculated under nondirectional two-tailed tests at α = .05.
Results
- Average power across the 70 studies to detect:
- Small effects: ~0.18 (about 1 in 5 chance).
- Medium effects: ~0.48 (barely 50–50 odds).
- Large effects: ~0.83 (good chance, but still not guaranteed).
- None of the studies achieved power ≥ .95 (the symmetrical equivalent of Type I error = .05) for small or medium effects; only 23% did so for large effects.
- Published studies looked “successful” mainly due to publication bias—failed studies were unlikely to appear in print.
Implications
- The typical abnormal-social psychology study had too small a sample size (mean maximum N ≈ 68).
- This led to:
- A high rate of false negatives (missed real effects).
- Premature abandonment of promising research lines.
- An overrepresentation of positive findings in the literature.
- Cohen argued that larger samples and routine power analysis should become standard practice.
Legacy
- Introduced the now-standard small/medium/large effect size conventions.
- One of the earliest systematic critiques of psychology’s reliance on low-powered studies.
- Helped set the stage for later discussions on replication, publication bias, and research credibility.
👉 In short: Cohen (1962) showed that most published abnormal-social psychology research in the early 1960s was severely underpowered, especially for detecting anything but large effects. He called for larger sample sizes, formal power analyses, and conventional effect size benchmarks to improve research quality.
Follow Up Questions
1. Has this article led to an increase in power in psychological research?
Great question. The short answer is: not immediately. Cohen’s (1962) article is now seen as a landmark, but for decades its message had little practical impact on how psychologists designed studies.
Immediate Aftermath (1960s–1980s)
- Cohen’s review was noticed, but it did not change practice in the short run.
- Significance testing remained the dominant framework; power analysis was rarely taught or reported.
- Cohen himself revisited the issue in his 1988 book Statistical Power Analysis for the Behavioral Sciences (2nd ed.), noting that very little had improved since 1962.
- Sedlmeier & Gigerenzer (1989) did a follow-up survey of psychology journals and concluded that power was still as low as in Cohen’s review.
Replication Crisis Era (2010s onward)
- The replication crisis (e.g., Open Science Collaboration, 2015) revealed widespread failures to replicate published findings.
- Low power was identified as a central cause of irreproducibility.
- This revived attention to Cohen’s warnings—his 1962 and 1988 works were widely cited as prophetic.
- Meta-research (e.g., Button et al., 2013 in neuroscience; Szucs & Ioannidis, 2017 in psychology) confirmed that average power remained ~20–40%, especially for small-to-medium effects.
Current Trends (2020s)
- Some increase in awareness:
- Journals, funders, and preregistration platforms increasingly require power analyses.
- Larger sample sizes are becoming more common in high-profile replication projects.
- But in practice:
- Many psychology studies are still underpowered, especially in social/personality psychology.
- Incentive structures (publish-or-perish, novelty bias) continue to favor small-N, flashy studies over careful, well-powered designs.
- Methods like z-curve, p-curve, and meta-science simulations (which you’ve been working with!) now make it possible to estimate field-wide power from published results, and these consistently show average power below Cohen’s recommended standards.
Bottom Line
- Cohen’s (1962) article did not directly lead to an increase in power in psychology in the decades that followed.
- Its influence was delayed: it became central only when the replication crisis forced the field to confront reproducibility issues.
- Even today, psychology still struggles with underpowered studies—so Cohen’s critique remains highly relevant.
2. Why do you think that the article did not change research practices?
This is one of the most interesting questions about Cohen’s 1962 article. Historians and methodologists generally agree that it had very little immediate effect on research practices. There are several reasons for that:
1. Incentive Structures in Psychology
- Researchers were (and still are) rewarded for publishing statistically significant results, not for designing well-powered studies.
- Journals rarely required power analyses; significance testing dominated.
- Small samples were cheaper, faster, and more publishable—even if underpowered.
Result: Even if researchers knew about power, they had little incentive to act on it.
2. Educational Norms
- Graduate training and statistics textbooks emphasized p-values and Type I error, while Type II error and power were treated as afterthoughts.
- Cohen himself noted this imbalance: textbooks introduced power briefly, then ignored it.
- Most researchers in the 1960s–80s were never taught how (or why) to calculate power.
3. Lack of Accessible Tools
- In the 1960s, computing resources were scarce.
- Power calculations required special tables or hand calculations, making them burdensome.
- Cohen (1962) provided effect size conventions and tables, but without user-friendly software, most researchers didn’t adopt them.
- Only much later (with programs like G*Power in the 1990s) did power analysis become technically easy.
4. Misconceptions About “Significance”
- Many psychologists assumed that if a result was “statistically significant,” then power was irrelevant.
- Others believed that if their study “worked,” power must have been adequate.
- This confusion meant researchers didn’t perceive low power as a threat to validity.
5. Publication Bias Masked the Problem
- Because only “successful” studies were published, the literature appeared full of significant effects.
- This gave the illusion that studies must be powerful—contradicting Cohen’s warning.
- Researchers rarely saw the large number of underpowered null findings that never made it into print.
6. Cohen’s Framing
- Cohen presented his conventions as provisional standards, but did not push them as hard rules.
- His style was more diagnostic than prescriptive—raising awareness, but not building institutional pressure for change.
- Only later, in his 1988 book, did Cohen emphasize the urgent need for reform.
7. Timing and Culture of the Field
- In the 1960s–80s, psychology was dominated by behaviorism and later cognitive psychology, both focused on experimental ingenuity rather than methodological rigor.
- Concerns about replication, bias, and research credibility were not mainstream until the replication crisis decades later.
Bottom Line
The article didn’t change practice because:
- Incentives rewarded significance, not power.
- Education ignored power.
- Tools made it hard to calculate.
- Biases in publication created a false sense of success.
- Cultural timing meant the field wasn’t ready to prioritize rigor over discovery.
In short, Cohen planted the seed, but the soil of psychology wasn’t fertile enough for it to grow until the replication crisis made his warnings unavoidable.
4. Has Cohen’s claim of low statistical power been challenged or is it widely accepted? For example, one could argue that effect sizes are actually larger than those assumed by Cohen and would justify high success rates.
Excellent, you’ve zeroed in on one of the core debates about Cohen’s (1962) findings and their legacy. Here’s the landscape:
1. Cohen’s Original Claim
- Cohen found that most studies in Journal of Abnormal and Social Psychology (1960–61) had very low power (~.18 for small, ~.48 for medium, ~.83 for large effects).
- He argued this meant many real effects were being missed, leading to wasted research and false negatives.
2. Challenges to Cohen
Yes, there have been arguments against taking his numbers at face value:
a) Effect sizes might actually be larger
- Critics suggest that Cohen’s “small/medium/large” conventions may underestimate the true effect sizes in psychology.
- If effects in the field are typically larger than Cohen’s “medium” benchmark (d = .5), then actual power would be higher.
- For instance, some areas (like cognitive psychology or psychophysics) do report very large, robust effects.
b) Publication bias inflates apparent success
- Others flipped the logic: since so many published studies were significant, maybe power must have been sufficient.
- But Cohen himself warned this was misleading: success rates reflect selective publication, not true underlying power.
c) Context-specific power
- Some researchers argued Cohen’s survey of one journal, over two years, may not generalize to all of psychology.
- Certain subfields (e.g., experimental cognitive psychology) tend to have higher-powered designs, while social/personality studies are often weaker.
3. Replications of Cohen’s Diagnosis
Most follow-up studies confirmed his basic point:
- Sedlmeier & Gigerenzer (1989): Found that average power in psychology remained just as low decades later.
- Maxwell (2004): Argued that low power was one of psychology’s most persistent problems.
- Button et al. (2013, neuroscience): Average power only ~20%.
- Szucs & Ioannidis (2017, psychology): Average power across thousands of studies was ~36%—remarkably close to Cohen’s original estimates.
4. Current Consensus
- Widely accepted: The field as a whole has been underpowered, particularly in social/personality psychology.
- Effect size debate remains: Some argue Cohen’s “small/medium/large” cutoffs don’t reflect real psychological phenomena.
- Example: A correlation of r = .20 (Cohen’s “small-to-medium”) may in fact be socially important but is hard to detect with small samples.
- Nuanced view today:
- In some areas (e.g., Stroop effect, visual perception), effects are large and easy to detect → power is fine.
- In others (social/personality, clinical interventions), effects are small-to-medium → power is chronically low.
5. Bottom Line
- Cohen’s claim of low power has stood the test of time.
- The challenge has mostly been about whether his conventions underestimated effect sizes—but large-scale meta-analyses have shown that in many areas, effects are indeed small-to-medium, meaning his diagnosis was correct.
- The modern view is that Cohen was right overall, though effect size distributions vary by subfield.
5. Are there any methodological advances that build on Cohen’s seminal work to provide better estimates of power in psychological research that do not rely on assumed effect sizes?
Yes — and this gets to the heart of why Cohen’s (1962, 1988) work is considered both foundational and limited. He gave psychology a language (effect size, small/medium/large) and a framework (power analysis), but he assumed you had to specify the effect size in advance. In reality, that’s often the hardest part.
Since then, several methodological advances have tried to estimate power empirically, without relying on arbitrary effect size assumptions:
1. Post-hoc and Retrospective Power (Cohen himself discouraged this)
- Early attempts calculated power from the observed effect size in a published study.
- Problem: highly unstable and circular — significant results yield “high” power, non-significant results yield “low” power.
- Cohen (1988) called this practice “illogical” when based only on observed effects.
- Conclusion: Not a real solution.
2. Meta-Analytic Power Estimation
- Instead of guessing effect sizes, meta-analyses combine past studies to estimate the distribution of effect sizes, which can then be used for more realistic power calculations.
- Examples:
- Maxwell (2004): called for cumulative effect-size databases to guide power analysis.
- But meta-analyses are themselves biased (publication bias, p-hacking).
3. Bias-Correcting Meta-Science Tools
These methods directly estimate average power (and related quantities) from the distribution of published test statistics — no need to assume effect sizes.
- p-curve (Simonsohn, Nelson, & Simmons, 2014):
Examines the distribution of significant p-values. If studies are truly powered, there should be “right-skew” (more very small p’s than just-below-.05 p’s). Produces an estimate of average power across a set of studies. - p-uniform (van Assen, van Aert, & Wicherts, 2015):
Similar idea, but uses the uniform distribution property of p-values under H0. Allows effect-size and power estimation corrected for selection. - z-curve (Schimmack, 2016; Bartoš & Schimmack, 2021):
Models the entire distribution of z-values (test statistics), both significant and nonsignificant (when available), to estimate:- Observed discovery rate (ODR)
- Expected discovery rate (EDR)
- False discovery risk
- Average power, directly from published evidence.
- RoBMA (Bartoš et al., 2021):
A Bayesian model-averaging extension of p-curve/p-uniform/z-curve, correcting for publication bias and heterogeneity simultaneously.
4. False Discovery Risk Metrics (related to power)
- Sorić’s formula (1989): links discovery rates to the maximum false discovery rate (FDR).
- Ioannidis (2005, “Why Most Published Findings Are False”): reframed power as a determinant of false discovery probability in fields with many small studies.
These don’t estimate power directly, but they make power observable in practice by connecting it to replicability and error rates.
5. Large-Scale Replication Projects
- Projects like the Open Science Collaboration (2015), Many Labs, and Registered Replication Reports provide empirical estimates of effect sizes by repeating published studies at scale.
- These projects show that many original studies were underpowered, confirming Cohen’s diagnosis without having to assume effect sizes.
6. Automated Power Estimation from the Literature
- Tools like z-curve (which you’ve used) and related approaches essentially reverse-engineer power from published test statistics, producing field-wide or topic-wide estimates of “realized” power.
- This is probably the most direct methodological advance that fulfills Cohen’s vision without his limitation (the need to specify an assumed effect size).
✅ Bottom Line
Cohen’s work forced psychology to talk about power.
Modern advances like p-curve, p-uniform, z-curve, and RoBMA made it possible to estimate actual power empirically from the published record, instead of relying on assumed effect sizes. These methods are now central to meta-science and are widely used to diagnose the credibility of research literatures.
6. Which of these methods has been used to estimate the power of psychological research using large representative sets of studies like Cohen’s (1962) seminal study?
1. Direct Replications of Cohen’s Survey
- Sedlmeier & Gigerenzer (1989): Surveyed 10 psychology journals from 1960–1984.
- Found no improvement in average power since Cohen.
- Median power: ~.37 for medium effects (virtually the same as Cohen’s ~.48).
- Maxwell (2004): Re-analyzed psychological literature, confirming that low power was still endemic.
These studies stayed close to Cohen’s method — manually coding studies, assuming effect sizes, and applying power tables.
2. p-curve
- Simonsohn, Nelson, & Simmons (2014): Introduced p-curve and demonstrated it on sets of published studies.
- Early applications used bundles of psychological experiments, showing that many literatures had low evidential value (i.e., low true power).
- But most p-curve analyses have been domain-specific (ego depletion, priming, etc.), not broad surveys like Cohen’s.
3. p-uniform
- van Assen, van Aert, & Wicherts (2015): Developed the method and applied it to collections of psychology studies.
- Like p-curve, mostly applied to targeted meta-analyses rather than huge representative samples.
4. z-curve
- Schimmack (2016 onward): Explicitly framed z-curve as a modern analogue to Cohen’s 1962 study.
- Used large samples of psychological articles (hundreds to thousands of test statistics).
- Reported average discovery rates of 30–40% and average power for published significant results around 60%.
- Schimmack’s “Replicability Index” and later z-curve 2.0 / 3.0 papers directly compared field-wide replicability to Cohen’s original findings.
- This is the closest true successor to Cohen’s survey — using automated test statistic distributions rather than effect-size assumptions.
5. RoBMA
- Bartoš et al. (2021): Combined approaches (p-curve, selection models, Bayesian averaging).
- Applied to psychology meta-analyses, not usually to whole journals.
- More powerful for specific literatures than sweeping field-level audits.
6. Large-Scale Meta-Science Surveys
- Szucs & Ioannidis (2017): Analyzed all psychology articles indexed in Cognitive, Developmental, Social journals (~1960–2015).
- Estimated median power for medium effects at ~36%.
- Essentially a mega-scale Cohen-style survey with modern computing.
- Open Science Collaboration (2015): Replication study in social/cognitive psychology.
- Directly showed realized replication rates ~36% — consistent with low original power.
- Button et al. (2013, neuroscience): Broader field but widely cited in psychology. Estimated median power ~20%.
✅ Bottom Line
- Cohen (1962): manual, small journal survey.
- Sedlmeier & Gigerenzer (1989); Maxwell (2004): larger manual surveys, same conclusion.
- Szucs & Ioannidis (2017): massive database approach, confirmed Cohen’s diagnosis.
- z-curve analyses (Schimmack, 2016–2021): the closest conceptual heir, using test-statistic distributions to estimate actual discovery rates and power without assuming effect sizes.
So — while p-curve and RoBMA are powerful tools, the field-wide, Cohen-style power estimates have mainly come from manual coding studies (1960s–2000s), meta-science big data surveys (2010s), and z-curve analyses (2016 onward).