On the Definition of Statistical Power

D1: In plain English, statistical power is the likelihood that a study will detect an effect when there is an effect there to be detected. If statistical power is high, the probability of making a Type II error, or concluding there is no effect when, in fact, there is one, goes down (first hit on Google)

D2: The power or sensitivity of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis (H0) when the alternative hypothesis (H1) is true. (Wikipedia)

D3: The probability of not committing a Type II error is called the power of a hypothesis test. (Stat Trek)

The concept of statistical power arose from Neyman and Pearson’s approach to statistical inferences. Neyman and Pearson distinguished between two types of errors that could occur when a researcher draws conclusions about a population from observations in a sample. The first error (type-I error) is to infer a systematic relationship (in tests of causality this is an effect) when no relationship (no effect) exists. This error is also known as a false-positive as in a pregnancy test that shows a positive result (pregnant) when a women is not pregnant. The second error (type-II error) is to fail to detect a systematic relationship that actually exists. This error is also known as a false negative as when a pregnancy shows a negative result (not pregnant) when a woman is actually pregnant.

Ideally researchers would never make type-I or type-II errors, but it is inevitable that researchers will make both types of mistakes. However, researchers have some control over the probability of making these two mistakes. Statistical power is simply the probability of not making a type-II mistake; that is to avoid negative results when effects are present.

Many definitions of statistical power imply that the probability of avoiding a type-II error is equivalent to the long-run frequency of statistical significant results because statistical significance is used to decide whether an effect is present or not. By definition statistically non-significant results are negative results when an effect exists in the population. However, it does not automatically follow that all significant results are positive results when an effect is present. Significant results and positive results are only identical in one-sided hypotheses tests. For example, if the hypothesis is that men are taller than women and a one-sided statistical tests is used only significant results that show a greater mean for men than for women will be significant. A study that shows a large difference in the opposite direction would not produce a significant result no matter how large the difference is.

The equivalence between significant results and positive results no longer holds in the more commonly used two-tailed tests of statistical significance. In this case, the relationship in the population is either positive or negative. It cannot be both positive or negative. Only significant results that also show the correct direction of the effect (either as predicted by a correct prediction or as demonstrated by consistency with the majority of other significant results) are positive results. Significant results in the other direction are false positive results in that they show a false effect, which becomes only visible in a two-tailed test when the sign of the effect is taken into account.

How important is the distinction between the rate of positive results and the rate of significant results in a two-tailed test? Actually it is not very important. The largest number of false positive results is obtained when no effect exists at all. If the 5% significance criterion is used, no more than 5% of tests will produce false positive results. It will also become apparent after some time that there is no effect because half the studies will show a positive effect and the other half will show a negative effect. The inconsistency in the sign of the effect shows that significant results are not caused by a systematic relation. As the power of a test increases, more and more significant results will have the correct sign and fewer and fewer results will be false positives. The picture on top shows an example with 13% power. As can be seen most of this percentage comes from the fat right tail of the blue distribution. However, a small portion comes from the left tail that is more extreme than the criterion for significance (the green line).

For a study with 50% power to produce a true positive result (a significant result with the correct sign) is 50%. The probability of a false-positive result (a significant result with the wrong sign) is 0 to the second decimal, but not exactly zero (~0.05%). In other words, even in studies with modes power, false positive results have a negligible effect. A much bigger concern is that 50% of results are expected to be false negative results.

In conclusion, the sign of an effect matters. Two-tailed significance testing ignores the sign of an effect. Power is the long-run probability of obtaining a significant result with the correct sign. This probability is identical to the probability of a statistically significant result in a one-tailed test. It is not identical to the probability of a statistically significant results in a two-tailed test, but for practical purposes the difference is negligible. Nevertheless, it is probably most accurate to use a definition that is equally applicable to one-tailed and two-tailed tests.

D4: Statistical power is the probability of drawing the correct conclusion from a statistically significant result when an effect is present. If the effect is positive, the correct inference is that a positive effect exists. If an effect is negative, the correct inference is that a negative effect exists. When the inference is that the effect is negative (positive), but the effect is positive (negative), a statistically significant result does not count towards the power of a statistical test.

This definition differs from other definitions of power because it distinguishes between true positive and false positive results. Other definitions of power treat all non-negative results (false positive and true positive) as equivalent.

Replicability-Index

Improving the replicability of empirical research

On the Definition of Statistical Power

Like this:

1 thought on “On the Definition of Statistical Power”

Leave a ReplyCancel reply

Share this:

Like this:

1 thought on “On the Definition of Statistical Power”

Leave a ReplyCancel reply

Discover more from Replicability-Index