One of the greatest meta-psychologists was Jacob Cohen. He was concerned about the risk that psychologists might waste resources on studies that had a low probability to provide evidence for a true hypothesis. Following Neyman and Pearson this error is called a type-II error. It can also be called a false negative result.
Psychologists typically rely on null-hypothesis testing to provide evidence for their predictions. They set the criterion value for a statistically significant result to 5%. This means that there is only a 5% probability to get a significant result without a real effect. This is called a type-I error or a false positive result. In this approach, a type-II error occurs when a prediction is true (a treatment is effective), but the p-value is above .05.
Cohen (1961) warned psychologists that many studies have a high risk of producing false negative results, especially when effect sizes are statistically small. Even when effect sizes are statistically around the average effect size in psychological studies, the risk of a false negative result was about 50%. Follow-up studies showed that this situation had not changed in the following decades (Sedlmeier & Gigerenzer, 1989).
One might assume that psychologists simply have little control over the false negative risk in their studies. However, that is not true. A simple way to decrease the false negative risk is to increase sample sizes. Thus, one has to wonder why psychologists did not increase sample sizes in response to evidence that they are conducting studies with high risk of a false negative result?
Imagine a gambler who can play two slot machines. One has a 50% chance of winning, the other one has an 80% chance of winning? Which machine would you pick? The answer is obvious. The situation for a researcher is a bit different. Fist, they have to pay more (invest more resources in larger samples) to play the higher odds of winning (i.e. avoiding a false negative result). Second, they do not know the actual odds of winning. They merely know that the odds of winning are higher when they invest more resources. Cohen (1988) tried to help researchers to make decisions that reduce the false negative risk without paying too much for larger sample sizes. It took 50 years for power analyses to become more popular in psychology in the past decade.
While better control of false negative results may seem desirable to all, a recent peer-reviewed article by Pek, Pitt, and Wegener (2024) suggests that power analyses are useless. They claim in the title “Uncertainty limits the use of power analysis.” In the article, they wonder “Isn’t use of power better than not using power at all?” and their answer is not a simple “yes” (p. 11). They say it is also not a simple “No”, but they provide no examples where power analysis is better than drawing a random number from a hat to determine the sample size of a study. In fact, they go on to state that “we recommend that researchers place limited confidence when using power to design experiments, or not use it at all as a direct justification for determining N” (p. 11). If that does not mean “power analysis is useless”, they do a very good job at hiding the benefits of power analysis.
It is remarkable that such a harsh criticism of Cohen can be published in a leading psychology journal without even mentioning Cohen’s work. It is also remarkable that they never mention false negative results / type-II errors, although power is defined as the probability of avoiding a type-II error (beta = type-II error, power = 1 – beta). So, we do not know what Pek et al.’s (2024) suggestion for researchers is when they get a non-significant result. Maybe somebody should write to them. “Hey I just did a study with a randomly generated sample size and got a non-significant result. What now?”
Two other giants in the history of psychology wrote an article in 1971 about the problem with small samples that often have a high false negative risk (Kahneman & Tversky, 1971). They wrote “We refuse to believe that a serious investigator will knowingly accept a .50 risk of failing to confirm a valid research hypothesis.” (p. 110).
What Pek et al. do not tell there readers is the real reason why psychologists ignored false negative results and continued to use small samples. The reason is not that they rarely have false negative results. The reason is that they invest relatively few resources in their studies so that they can conduct many studies or many tests within a study to get at least one significant result. The non-significant results are simply discarded. This is known as using questionable research practices because researchers are not disclosing all results. This increases the risk that the published results are false positives. If a researcher tests 20 false hypotheses, they can expect to get one p-value below .05. If they do not disclose that they ran 20 tests, readers cannot see that the 1 significant result was expected by chance alone.
Pek et al. also do not tell readers why power analysis has become more popular in the past decade. The reason is that a high rate of false negative results makes significant results less informative. Imagine that researchers test 100 true hypotheses and 100 false hypotheses. The 100 false hypotheses are expected to produce 5 significant results. This is implied by the use of the 5% criterion. If the 100 true hypotheses are tested with only 10% power, we have 10 true findings and 90 false negative results. Now we publish only the significant results, which means there are 15 results, 5 are false findings (a medicine does not work) and 10 are true findings (a medicine works). This means one third of published findings are false. Cohen recommended to plan studies with 80% power. This would mean we get 80 true findings and 5 false findings. As a result, only 6% of the published results are false. Would you still believe Pek et al. (2025) that power analyses are useless or would you rather wonder whether the average power in psychology is closer to 10% or 80%?
The key argument in Pek et al.’s article is that power is a hypothetical construct because we do not know whether the predicted effect is small, medium, or large. First of all, this is not true of all science. Some sciences have theories that make quantitative predictions. Even psychologists may have some idea whether they are testing a weak, moderate, or strong effect. However, we do not even need to know what the true effect size is. We can conduct a power analysis based on an effect size that is theoretically interesting. For example, the question whether money buys happiness or not is a silly question. A more interesting question is how much happiness money can buy. Let’s say that money is only important for a theory of happiness, if the correlation between money and happiness is at least r = .1, what Cohen calls a weak effect. Power analysis not only helps us to determine a reasonable sample size to look for this correlation, it can also help to make non-significant results informative. For example, if we power a study to have a 95% chance to be significant with a correlation of r = .1, and we obtain a non-significant result, the chance that this result is a false positive result is less than 5%. We may therefore be willing to accept the hypothesis that the correlation is less than .1 and conclude that money has a negligible influence on happiness (BTW the true correlations tend to be between 1. and .3). This is valuable information that can only be obtained by considering the risk of a false negative finding. Finding a non-significant result in a study with N = 20 people does not warrant the conclusion that money does not matter much for happiness because the false negative risk is too high. Pek et al. (2024) ignore all of this useful information that power analyses can provide even if there is great uncertainty about the true power of a study. Thus, researchers can easily track the power of their studies by keeping track of their success record.
If a researcher conducts 20 statistical tests and finds only 4 significant results, the average power is about 25%. According to Kahneman and Tversky (1971) any serious researcher would have to wonder whether they are just testing a lot of false hypotheses or whether they produce a lot of false negative results. No serious researcher should just continue doing what they are doing and just publish the 4 significant results and call it a day. However, that is what social psychologists like Pek’s co-author Duane Wegener have been doing for decades, while ignoring power analyses. This has led to the replication crisis in social psychology that has uncovered many false findings. At least Noble Laureat Daniel Kahneman had the humility to recognize his mistake. “What the blog gets absolutely right is that I placed too much faith in underpowered studies” (Kahneman, 2017).
Kahneman (2017) also points out that we need science to make new discoveries and to correct false beliefs, but that science can only serve this function when all relevant results are published. And that was not the case in social psychology. Non-significant results were ignored and only significant results that confirmed even the most implausible predictions were published. This bias is evident in the high percentage of significant results in psychology journals (Motyl et al., 1997; Sterling, 1959; Sterling et al., 1995). With success rates of 90%, honest reporting would imply that psychologists only test true hypothesis with high power. Ironically, this would mean that psychologists do not need power analysis because they miraculously never obtain false negative results. The real reason for 90% success rate is rather different. A replication project found only 25% significant results in replication studies of social psychologists (Open Science Collaboration, 2015), suggesting that most studies are well below the 50% criterion for serious researchers (Kahneman & Tversky, 1971). These are well known facts that Pek et al. (2024) and the editor who published this article simply ignore and are hiding from readers who are not familiar with the history of power analysis.
Finally, Pek et al.’s (2024) concern about the uncertainty about the true power of a study is irrelevant for the usefulness of power analyses. The true power of a study is less important than the truthful reporting of results. Uncertainty about true power implies that even researchers who conduct power analyses will sometimes conduct studies that produce false negative results. First, Cohen’s recommendation to aim for 80% power implies that 20% of tests of a true hypothesis will produce false negative results. Second, power analyses can overestimate true power and the false negative risk can be even greater than 20%, This is not a problem if the results are published and combined with other evidence that can correct false negative results. This is what researchers in medicine do. Here, studies have only about 30% power on average, but non-significant results are reported and meta-analyses can reduce the risk of false conclusions. Thus, the biggest threat to psychology as a science is uncertainty about the honest reporting of results and not uncertainty about true power. The advantage of conducting power analyses is that it is more likely that we have honest and credible evidence when researchers conduct a few studies with high power than many studies with low power. This is what Cohen meant when he said “Less is more, except for sample size.”
In conclusion, if you are new to statistical power and its role in psychological science, I recommend to read Cohen (1988, 1992) and to ignore Pek et al.’s (2024) useless article. A simple truth about power is that the percentage of significant results in a set of studies is an estimate of the mean power of studies. If you see a set of studies with over 90% significant results, you have to ask yourself: did these studies really test only true hypothesis with high power, or did researchers not report studies that failed to support their claims ” (Schimmack, 2012). I trust you to come to the right conclusion, but you can also use power calculations to test for the presence of selection bias. But that is a story for another day.
1 thought on “How Useful are Statistical Power calculations?”