The Origin of Statistical Power

The concept of statistical power is nearly 100 years old, but few applied researchers know the history of this construct or that it has changed over time. Most applied researchers only know that high power is good to get a statistically significant result that can be published (good), but also requires large sample sizes (bad), if you cannot use efficient repeated measures designs (something lost on applied researchers who confuse sample size with sampling error.

Fortunately, it is getting easier and easier (and faster) to learn new things. In the old days, researching the history of power analysis would have required trips to the library, finding relevant books, and struggling with old writing and Greek formulas. Nowadays, you can just ask ChatGPT or some other intelligence and you get the answer in 1 minute. The question is, why should you care? Well, you should care because the concept of power made sense when it was introduced by Neman-Pearson, but it makes little sense for applied researchers who want to get significant results. “Are you kidding?” The main reason modern researchers do not use the original concept of power is that power was defined as the probability of not making a type-II error and a type-II error means that researchers accept the null-hypothesis. What? Are you out of your mind? Every student learns in intro (applied) statistics that it is wrong to accept the null-hypothesis when a result is not significant. Yes, but students are not told that this is only true for one type of statistics based on an arrogant and racist statistician, Fisher, who invented p-values 100 years ago and successfully sold them to applied researchers, whereas better statistical methods were ignored. To understand why, you need to understand the alternative approach that was introduced by Neyman and Pearson that does not use p-values.

I asked ChatGPT to give me an example of the classic Neman Pearson approach and I think it is simple enough to illustrate the main point of the concept of power in this statistical approach to draw inferences from data.


Example Setup

Suppose we want to test whether a new teaching method improves math scores compared to the standard method.

  • Null hypothesis (H₀): μ = 75 (mean test score = 75)
  • Alternative hypothesis (H₁): μ = 80 (mean test score = 80)

We assume the population standard deviation is known: σ = 10.
This implies that the standardized effect size is medium, Cohen’s d = (80-75)/10 = 0.5

We choose:

  • Significance level: α = 0.05 (controls Type I error)
  • Sample size: n = 25 students

Step 1. Define the Test Statistic

The standard error is σ/n=10/25=2.

We use a z-test: Z = Mean−75) / 2


Step 2. Critical Value for α

For a one-sided test at α = 0.05, the critical z-value to reject H0 is : 1.645

So, reject H₀ if: X >75+1.645×2=78.29

So far, this is in line with Fisher because we would get a p-value below .05 (one-sided) if the mean in the sample is above 78.29.


Step 3. Type II Error (β)

We compute the probability of failing to reject H₀ when the true mean is 80:
Z = (78.29 − 80)/2 = −0.855

The probability of observing a sample mean ≤ 78.29 is: P(Z≤−0.855) = 0.196

So the Type II error rate is β = 0.196.


Step 4. Power

Power is the complement: Power=1−β=0.804 = 1 – β = 0.804

This means that if the true mean is 80, we have about an 80% chance of correctly rejecting H₀.


Interpretation in Neyman–Pearson Terms

  • α (Type I error) is fixed in advance (5%).
  • By choosing sample size and test criteria, we also control β (here, ~20%).
  • The resulting power (80%) shows how sensitive the test is to detect a true effect of μ = 80.

Interpretation of a (Non-Significant) Result with a mean below the critical value, 78.29

Suppose our test compares

  • H0: True Mean = 75
  • H1: True Mean = 80
    with α=.05, n=25, and a critical cutoff value at Observed Mean = 78.29

We observe a sample mean of 77.0. How should this be interpreted?


1. Neyman–Pearson Interpretation

  • Decision rule: Since 77.0 < 78.29, the result falls in the acceptance region of H0.
  • Conclusion: Accept H0
  • Error risk caveat: If the true mean were really 80, there is a Type II error probability of about 20%. Thus, this decision carries that long-run risk of being incorrect.

2. Modern NHST (Null-Hypothesis Significance Testing) Interpretation

  • Decision rule: Since 77.0 < 78.29, the result is not statistically significant at α=.05
  • Conclusion: Fail to reject H0H_0.
  • Error risk caveat: No explicit β is invoked; we simply say the data do not provide enough evidence to conclude the mean differs from 75.

And here you have it. You are taught that a non-significant result should not be used to accept the null-hypothesis, but this only follows from one approach to statistical inferences that does not have any use for the concept of power. Power is important when researchers are willing to test their theories and to publish results that do not support them. Then, we care about the power of a test to do so. If a test has 99% power and still does not support a prediction, there may be something wrong with the theory that made the prediction. Power without the willingness to make type-II errors is not power.

And now you also know why applied researchers use Fisher and not NP statistics. Fisher let’s them get non-significant results without drawing the conclusion that their theory is wrong or that their new therapy method does not work, at least not better than existing ones. Instead, they can find a bogus reason why the study did not work, do a new one, and hope that this one will be significant (with or without a real effect).

The choice of Fisher’s approach also explains why psychology journals publish 90% significant results (Sterling, 1959). When non-significant results are considered inconclusive, editiors have a simple reason for rejection. We do not publish inconclusive results. However, using NP there is no reason to prefer studies that reject H0 over studies that accept H0. If one researcher claims support for a theory with p < .05, another researcher could replicate the study and show that the replication study accepts H0. Even more embarrassing for the first researcher, if power of the replication study is 99% and the risk of a type-II error is only 1%, much lower than the 5% error risk of a type-I error in the original study. Why should a journal not reject a conclusive finding that an effect does not exist, and the original result was probably a type-I error? This way we could correct false results and the hallmark of science is that it corrects itself. With Fisher even a study with 99% power and a non-significant result will be considered inconclusive and rejected because everybody has been brainwashed to believe that p > .05 means a study did not produce any interesting results. This means psychology is not self-correcting and it is not a science.

I am not the first to point out how stupid the statistics ritual with p-values is (Gigerenzer, 2004, “Mindless Statistics”), but others have not pointed out the reason for the use of a statistical method that has been criticized in hundreds of articles and that has better alternatives. The answer is simple.

If you were a scientist and your goal were to demonstrate that your theory is right with empirical studies, would you want a statistical method that can show you that you are wrong or a method that never shows you that you are wrong? Exactly! The toothless pseudo-science approach advocated by Fisher serves researchers self-interest and they are in charge of deciding the rules of the game called science, even if it is not science.

So, pressure to improve “soft sciences” like psychology has to come from the outside. Consumers of psychological “findings” who often pay for this research with their tax dollars have to hold researchers accountable, and granting agencies have to stop rewarding researchers who have incredible success rates of 90% that only show that they do not report their “inconclusive” results. The first step is to teach undergraduate students that there are several ways to draw inferences from data, and to warn them that “inconclusive p-values” exist only to protect the fragile ego of researchers.

In conclusion, statistical power was invented as a powerful statistical tool to balance the risk of two errors. The error of accepting a false hypothesis (subliminal message help people to stop smoking) or the error of rejecting a true hypothesis (studying increases test performance). Without the risk of publishing results that disconfirm a prediction, the concept of power is meaningless and has only created a new ritual to plan studies with hypothetical effect sizes, but to avoid using power to interpret published results.

3 thoughts on “The Origin of Statistical Power

  1. This is fantastic and follows nicely from your previous post about the origins of power (https://replicationindex.com/2025/08/15/from-power-curves-to-z-curves/). If I’m understanding correctly, the implications of N-P power definition is that power is unconditional on whether a H0 is true or not. This allows you to include results where true effect size = 0. I think you made this point as well in your recent commentary pre-print.

    Fast forward to this blog post, it sounds like what you are saying is an extension of this idea. Which is that a non-significant p value, under N-P’s definition, means that we accept it as a falsification of our theory while understanding the risk that we may have missed a true effect. However, in Fisher’s interpretation, a nonsignificant p value essentially means you keep trying until you get it right.

    And is it correct to suggest that Fisher’s interpretation of p-values also aligns with what you mentioned about Cohen’s (1988) definition of power, where power is conditional only on true effects? Both frameworks seem to assume that there is a true effect to be found. For Fisher, p>.05 means you didn’t try hard enough. For Cohen (1988), power is ONLY the ability to detect and effect if effect size > 0. As you are suggesting, both create a perfect storm that undermines the intended self-correcting nature of science.

    1. and the perfect storm was created by the idea that scientists could self-control themselves and other scientists. Instead, they created a set of rules that benefited them to make claims with scientific authority supported by p-values that actually have no significance. it is time to regulate science, like we regulate all other commercial activities (e.g., professional sports).

      1. I agree. Scientific authority without scientific evidence is just posturing at best and dangerous at worst.

        However, quantitative psychologists are few and far between, especially among psychological scientists. Aside from quant psych scientists, who else is qualified enough to properly regulate science if not academic peers?

        Genuinely curious what more safeguards we can put in place. For example, I remember when the APA started requiring confidence intervals and effect size reporting to make the stats more transparent. Obviously, this does not address the selection bias your work illuminates, but could changing APA guidelines address these?

        I think reproducibility, like full code, raw datasets, etc., can at least allow academic peers to evaluate the believability of the study, at least after its published. Pre-registering can help, too, as long as the planned methodology and analyses are stated clearly.

Leave a Reply