A recent article with the title “Is the Power Threshold of 0.8 Applicable to Surgical Science? Empowering the Underpowered Study” is being discussed on social media (e.g., Gelman blog).

Neither the authors, not the critics appear to be familiar with the statistical concept of power that is being discussed.

The article mentions Jacob Cohen as the pioneer of power analysis only to argue that his recommendation that studies should have 80% power is not applicable to surgical science.

They apparently didn’t read the rest of Cohen’s book on power analysis or any other textbook about statistical power.

Let’s first define statistical power. Statistical power is the long-run proportion of studies with a statistically significant result that one can expect given the sample size and population effect size of a study and the criterion for statistical significance.

Given this definition of power, we can ask whether an 80% success rate is to high and what success rate would be more applicable in studies with small sample sizes. Assuming that sample sizes are fixed by low frequency of events and effect sizes are not under the control of a researcher, we might simply have to accept that power is only 50% or only 20%. There is nothing we can do about it.

What are the implications of conducting significance tests with 20% power? 80% of the studies will produce a type-II error; that is, the test cannot reject the null-hypothesis (e.g., two surgical treatments are equally effective), when the null-hypothesis is actually false (one surgical procedure is better than another). Is it desirable to have an error rate of 80% in surgery studies? This is what the article seems to imply, but it is unlikely that the authors would actually agree with this, unless they are insane.

So, what the authors are really trying to say is probably something like “some data are better than no data and we should be able to report results even if they are based on small samples.” The authors might be surprised that many online trolls would agree with them, while they vehemently disagree with the claim that we can empower studies with small samples by increasing the type-II error rate.

What Cohen really said was that researchers should balance the type-I error risk (concluding one surgical procedure is better than the other) when this is actually not the case (both surgical procedures are approximately equally effective) and the type-II error risk (the reverse error).

To balance error probabilities, researchers should set the criterion for statistical significance according to the risk of drawing false conclusions (Lakens et al., 2018). In small samples with modest effect sizes, a reasonable balance of type-I and type-II errors is achieved by increasing the type-I risk from the standard criterion of alpha = .05 to, say, alpha = .20, or if necessary even to alpha = .50.

Changing alpha is the only way to empower small studies to produce significant results. Somehow the eight authors, the reviewers, and the editor of the target article missed this basic fact about statistical power.

In conclusion, the article is another example that applied researchers receive poor training in statistics and that the concept of statistical power is poorly understood. Jacob Cohen made an invaluable contribution to statistics by popularizing Neyman-Pearson’s extension of null-hypothesis testing by considering type-II error probabilities. However, his work is not finished and it is time for statistics textbooks and introductory statistics courses to teach statistical power so that mistakes like this article will not happen again. Nobody should think that it is desirable to run studies with less than 50% power (Tversky & Kahneman, 1971). Setting alpha to 5% even if this implies that a study has a high chance of producing a type-II error is insane and may even be considered unethical, especially in surgery where a better procedure may save lives.