For decades psychologists have ignored statistics because the only knowledge required was that p-values less than .05 can be published and p-values greater than .05 cannot be published. Hence, psychologists used statistics programs to hunt for significant results without understanding the meaning of statistical significance.
Since 2011, psychologists are increasingly recognizing that publishing only significant results is a problem (cf. Sterling, 1959). However, psychologists are confused what to do instead. Many do not even know how to interpret p-values or what p < .05 means, as reflected by repetitive posts on social media that suggest p-values or significance testing provides no information.
First, it doesn’t require a degree in math to understand what p < .05 means. The criterion value of alpha = .05 sets the upper limit for a false positive result. For directional hypotheses this means that no more than 5% of all hypothesis tests can produce a significant result, p < .05 when the population effect is 0 or in the opposite direction from the effect suggested by the sample means (or the sign of the correlation in the sample).
That is, if a significant correlation in a sample is positive, the probability that the correlation in the population is zero or negative is at most 5%. Some readers will jump up and say that this statement ignores the prior probability of hypotheses being true or false. Please calm down and take a seat. The statement is not that the probability is exactly 5%. The exact probability is unknown. What is known is that the maximum probability is 5%; it could be less, but it cannot be more.
This is quiet obvious when we look at probabilities in terms of long-run frequencies. The maximum probability of false positive results is reached when all hypothesis that are being tested are FALSE. In this case, there are zero true positives (TRUE & SIG) and five false positives (FALSE & SIG). Thus, the relative frequency of false positives in the set of all tests (k = 100) is 5/100 = 0.05.
Importantly, this is not an empirical observation. The maximum probability of false positive results is set by alpha and holds in the limit under the assumption that the statistical tests were conducted properly.
It is also important that the use of a long-run frequency to estimate probability assumes that we have no additional information about the study. For example, if we know that 19 studies before this study tested the same hypothesis and produced non-significant results, the probability of a false positive would be a lot higher. As I am not concerned about probabilities of single studies, but rather the risk of false discoveries in sets of studies, the controversy between Bayesians and Frequentists is irrelevant. Even with prior knowledge about hypotheses being true or false, we cannot expect more than 5% false positive results with alpha = .05.
A valid criticism of claiming p < .05 as an important finding is that we are not interested in the percentage of false positives for ALL statistical tests. We are rather more interested in the percentage of false discoveries. That is, how many of the significant results could be false positives?
It would be easy to answer this question if all hypothesis tests were published (Sterling, 1959). In this case, we would have information about the total number of significant results as a proportion of all statistical tests. In the table above, we see that we only have 5 significant results with 100 attempts. This is not very assuring because we would expect 5 significant results by chance alone. Thus, the risk that the significant results are false discoveries is 5/5 = 100%.
It is interesting to examine scenarios with more discoveries. The next example shows 10 discoveries. As some of these discoveries are true discoveries, the percentage of false positives has to be less than 5. I used trial and error to find the maximum number of false positives.
The maximum percentage of false hypotheses is 94.7, which produce 4.7% false positives. The remaining 5.3% tests of true hypothesis contribute 5.3% significant results with 100% power. It is easy to see that this is a maximum because power cannot exceed 100%. Or stated differently, the type-II error probability (TRUE & NS) cannot be less than zero. The false discovery risk is 4.7 / (4.7 + 5.3) = 4.7 / 10 = 47%. Again, this is not an estimate of the actual percentage of false discoveries, which is unknown. It is an estimate of the maximum number of false discoveries given the observation that 10 out of 100 hypothesis were significant.
The false discovery risk decreases quickly when more significant results are observed.
With 20 significant results, the false discovery risk is 4.2/20 = 21%.
The following table shows the relationship between percent of significant results (discoveries) and the false discovery risk.
|Discoveries||False Discovery Risk|
The table suggests that researchers should aim for discovery rates (percentage of significant results) of 50% or more to keep the false discovery risk below 5%.
Estimating the False Discovery Risk in Psychology
The previous section showed that it is easy to estimate the maximum false discovery risk. The only problem to apply this approach is that the discovery rate in psychology is largely unknown because psychologists only publish significant results and provide no information about the number of attempts that were made to get these significant results (Sterling, 1959).
Brunner and Schimmack (2018) developed z-curve; a statistical approach for estimating the percentage of missing non-significant results based on the test-statistic of published significant results. Following Rosenthal (1979) these missing studies are called the file-drawer. Applying z-curve to focal tests of eminent social psychologists yields an estimate of 5 studies in file-drawers for every published significant result. This means the discovery rate is 17% (1 / (5 + 1)). Looking up the false discovery risk in the table suggests that up to 30 published results could be false positives (the estimate of 55% in the figure below is based on a different definition of a false positive)
This blog post explains what p-values mean and how they can be interpreted as the maximum long-run probability to obtain a false positive result. However, it is important not to confuse the percentage of false positives with the false discovery risk. One percentage is FP / k. The other is FP / (FP + TP).
The blog post also shows how we can estimate the maximum false discovery risk based on alpha and the discovery rate (i.e., the percentage of significant results). This is much more meaningful, but this information is typically not available. As Sterling (1959) pointed out, if only significant results are published, they become meaningless because the false discovery risk is unknown. Thus, psychologists must start reporting the number of attempts they made to make their empirical results meaningful.
While psychology journals publish only discoveries, statistical estimation is the only way to estimate the false discovery risk in psychology. I presented one example how the discovery rate can be estimated and what implications the estimate has for the false discovery risk in social psychology, where the false discovery risk is estimated to be 30%.
It is important to realize that false discoveries are based on the definition of a mistake about the sign or direction of an effect. Results with trivial effect sizes in the right direction are considered true positives. Thus, even a false discovery rate of 30% does not mean that 70% of all published results have practical significance, nor does it mean that 70% of published results can be replicated. Brunner and Schimmack are working on an alternative method that would treat extremely low powered studies with true effects as false discoveries. This method produced an estimate of 55% false discoveries for eminent social psychologists.
The estimate of 30% false discoveries for social psychology suggests that Ioannidis (2005) was wrong when he claimed that most published results are wrong. His claims were based on hypothetical scenarios that are unrealistic. The present estimates are based on actual data and suggest that false discovery risks are less than 50%. Of course, even false discovery risks of 30% are unacceptably high, but Ioannidis’s made a strong claim about false results without empirical support. I showed how this claim can be tested and I presented data that suggest it is wrong. Moreover, social psychology is the worst discipline in psychology. Thus, estimates for other areas of psychology are likely to be lower. This would mean that most published results in psychology are not false in the sense that they reported a false positive result.
More important is the demonstration that we do not need to make assumptions about the prior probability of hypotheses being right or wrong to make claims about false discovery risks. All we need is alpha and the discovery rate.
I am sure nothing I said is original in statistics, but it is original in the context of the endless debates about p-values and their interpretation in the social sciences. Psychology does not need new statistics, it needs credible information about the discovery rates in psychology, and for that we need to end a culture of reporting only significant results.