Originally published January 31, 2020
Revised December 27, 2020
Psychologists, social scientists, and medical researchers often conduct empirical studies with the goal to demonstrate an effect (e.g., a drug is effective). They do so by rejecting the null-hypothesis that there is no effect, when a test statistic falls into a region of improbable test-statistics, p < .05. This is called null-hypothesis significance testing (NHST).
The utility of NHST has been a topic of debate. One of the oldest criticisms of NHST is that the null-hypothesis is likely to be false most of the time (Lykken, 1968). As a result, demonstrating a significant result adds little information, while failing to do so because studies have low power creates false information and confusion.
This changed in the 2000s, when the opinion emerged that most published significant results are false (Ioannidis, 2005; Simmons, Nelson, & Simonsohn, 2011). In response, there have been some attempts to estimate the actual number of false positive results (Jager & Leek, 2013). However, there has been surprisingly little progress towards this goal.
One problem for empirical tests of the false discovery rate is that the null-hypothesis is an abstraction. Just like it is impossible to say the number of points that make up the letter X, it is impossible to count null-hypotheses because the true population effect size is always unknown (Zhao, 2011, JASA).
An article by Soric (1989, JASA) provides a simple solution to this problem. Although this article was influential in stimulating methods for genome-wide association studies (Benjamin & Hochberg, 1995, over 40,000) citations, the article itself has garnered fewer than 100 citations. Yet, it provides a simple and attractive way to examine how often researchers may be obtaining significant results when the null-hypothesis is true. Rather than trying to estimate the actual false discovery rate, the method estimates the maximum false discovery rate. If a literature has a low maximum false discovery rate, readers can be assured that most significant results are true positives.
The method is simple because researchers do not have to determine whether a specific finding was a true or false positive result. Rather, the maximum false discovery rate can be computed from the actual discovery rate (i.e., the percentage of significant results for all tests).
The logic of Soric’s (1989) approach is illustrated in Tables 1.
To maximize the false discovery rate, we make the simplifying assumption that all tests of true hypotheses (i.e., the null-hypothesis is false) are conducted with 100% power (i.e., all tests of true hypotheses produce a significant result). In Table 1, this leads to 60 significant results for 60 true hypotheses. The percentage of significant results for false hypotheses (i.e., the null-hypothesis is true) is given by the significance criterion, which is set at the typical level of 5%. This means that for every 20 tests, there are 19 non-significant results and one false positive result. In Table 1 this leads to 40 false positive results for 800 tests.
In this example, the discovery rate is (40 + 60)/860 = 11.6%. Out of these 100 discoveries, 60 are true discoveries and 40 are false discoveries. Thus, the false discovery rate is 40/100 = 40%.
Soric’s (1989) insight makes it easy to examine empirically whether a literature tests many false hypotheses, using a simple formula to compute the maximum false discovery rate from the observed discovery rate; that is, the percentage of significant results. All we need to do is count and use simple math to obtain valuable information about the false discovery rate.
However, a major problem with Soric’s approach is that the observed discovery rate in a literature may be misleading because journals are more likely to publish significant results than non-significant results. This is known as publication bias or the file-drawer problem (Rosenthal, 1979). In some sciences, publication bias is a big problem. Sterling (1959; also Sterling et al., 1995) found that the observed discovery rated in psychology is over 90%. Rather than suggesting that psychologists never test false hypotheses, it rather suggests that publication bias is particularly strong in psychology (Fanelli, 2010). Using these inflated discovery rates to estimate the maximum FDR would severely understimate the actual risk of false positive results.
Recently, Bartoš and Schimmack (2020) developed a statistical model that can correct for publication bias and produce a bias-corrected estimate of the discovery rate. This is called the expected discovery rate. A comparison of the observed discovery rate (ODR) and the expected discovery rate (EDR) can be used to assess the presence and extent of publication bias. In addition, the EDR can be used to compute Soric’s maximum false discovery rate when publication bias is present and inflates the ODR.
To demonstrate this approach, I I use test-statistics from the journal Psychonomic Bulletin and Review. The choice of this journal is motivated by prior meta-psychological investigations of results published in this journal. Gronau, Duizer, Bakker, and Wagenmakers (2017) used a Bayesian Mixture Model to estimate that about 40% of results published in this journal are false positive results. Using Soric’s formula in reverse shows that this estimate implies that cognitive psychologists test only 10% true hypotheses (Table 3; 72/172 = 42%). This is close to Dreber, Pfeiffer, Almenber, Isakssona, Wilsone, Chen, Nosek, and Johannesson’s (2015) estimate of only 9% true hypothesis in cognitive psychology.
These results are implausible because rather different results are obtained when Soric’s method is applied to the results from the Open Science Collaboration (2015) project that conducted actual replication studies and found that 50% of published significant results could be replicated; that is, produced a significant results again in the replication study. As there was no publication bias in the replication studies, the ODR of 50% can be used to compute the maximum false discovery rate, which is only 5%. This is much lower than the estimate obtained with Gronau et al.’s (2018) mixture model.
I used an R-script to automatically extract test-statistics from articles that were published in Psychonomic Bulletin and Review from 2000 to 2010. I limited the analysis to this period because concerns about replicability and false positives might have changed research practices after 2010. The program extracted 13,571 test statistics.
Figure 1 shows clear evidence of selection bias. The observed discovery rate of 70% is much higher than the estimated discovery rate of 35% and the 95%CI of the EDR, 25% to 53% does not include the ODR. As a result, the ODR produces an inflated estimate of the actual discover rate and cannot be used to compute the maximum false discovery rate.
However, even with a much lower estimated discovery rate of 36%, the maximum false discovery rate is only 10%. Even with the lower bound of the confidence interval for the EDR of 25%, the maximum FDR is only 16%.
Figure 2 shows the results for a replication with test statistics from 2011 to 2019. Although changes in research practices could have produced different results, the results are unchanged. The ODR is 69% vs. 70%; the EDR is 38% vs. 35% and the point estimate of the maximum FDR is 9% vs. 10%. This close replication also implies that research practices in cognitive psychology have not changed over the past decade.
The maximum FDR estimates of 10% confirms the results based on the replication rate in a small set of actual replication studies (OSC, 2015) with a much larger sample of test statistics. The results also show that Gronau et al.’s mixture model produces dramatically inflated estimates of the false discovery rate (see also Brunner & Schimmack, 2019, for a detailed discussion of their flawed model).
In contrast to cognitive psychology, social psychology has seen more replication failures. The OSC project estimated a discovery rate of only 25%. Even this low rate would imply that a maximum of 16% of discoveries in social psychology are false positives. A z-curve analysis of a representative sample of 678 focal tests in social psychology produced an estimated discovery rate of 19% with a 95%CI ranging from 6% to 36% (Schimmack, 2020). The point estimate implies a maximum FDR of 22%, but the lower limit of the confidence interval allows for a maximum FDR of 82%. Thus, social psychology may be a literature where most published results are false. However, the replication crisis in social psychology should not be generalized to other disciplines.
Numerous articles have made claims that false discoveries are rampant (Dreber et al., 2015; Gronau et al., 2015; Ioannidis, 2005; Simmons et al., 2011). However, these articles did not provide empirical data to support their claim. In contrast, empirical studies of the false discovery risk usually show much lower rates of false discoveries (Jager & Leek, 2013), but this finding has been dismissed (Ioannidis, 2014) or ignored (Gronau et al., 2018). Here I used a simpler approach to estimate the maximum false discovery rate and showed that most significant results in cognitive psychology are true discoveries. I hope that this demonstration revives attempts to estimate the science-wise false discovery rate (Jager & Leek, 2013) rather than relying on hypothetical scenarios or models that reflect researchers’ prior beliefs that may not match actual data (Gronau et al., 2018; Ioannidis, 2005).
Bartoš, F., & Schimmack, U. (2020, January 10). Z-Curve.2.0: Estimating Replication Rates and Discovery Rates. https://doi.org/10.31234/osf.io/urgtn
Dreber A., Pfeiffer T., Almenberg, J., Isaksson S., Wilson B., Chen Y., Nosek B. A., Johannesson, M. (2015). Prediction markets in science. Proceedings of the National Academy of Sciences, 50, 15343-15347. DOI: 10.1073/pnas.1516179112
Fanelli D (2010) Positive” Results Increase Down the Hierarchy of the Sciences. PLOS ONE 5(4): e10068. https://doi.org/10.1371/journal.pone.0010068
Gronau, Q. F., Duizer, M., Bakker, M., & Wagenmakers, E.-J. (2017). Bayesian mixture modeling of significant p values: A meta-analytic method to estimate the degree of contamination from H₀. Journal of Experimental Psychology: General, 146(9), 1223–1233. https://doi.org/10.1037/xge0000324
Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLOS Medicine 2(8): e124. https://doi.org/10.1371/journal.pmed.0020124
Ioannidis JP. (2014). Why “An estimate of the science-wise false discovery rate and application to the top medical literature” is false. Biostatistics, 15(1), 28-36.
Jager, L. R., & Leek, J. T. (2014). An estimate of the science-wise false discovery rate and application to the top medical literature. Biostatistics, 15(1), 1-12.
Lykken, D. T. (1968). Statistical significance in psychological research. Psychological Bulletin, 70(3, Pt.1), 151–159. https://doi.org/10.1037/h0026141
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), 1–8.
Schimmack, U. (2019). The Bayesian Mixture Model is fundamentally flawed. https://replicationindex.com/2019/04/01/the-bayesian-mixture-model-is-fundamentally-flawed/
Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne, 61(4), 364–376. https://doi.org/10.1037/cap0000246
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant. Psychological Science, 22(11), 1359–1366.
Soric, B. (1989). Statistical “Discoveries” and Effect-Size Estimation. Journal of the American Statistical Association, 84(406), 608-610. doi:10.2307/2289950
Zhao, Y. (2011). Posterior Probability of Discovery and Expected Rate of Discovery for Multiple Hypothesis Testing and High Throughput Assays. Journal of the American Statistical Association, 106, 984-996, DOI: 10.1198/jasa.2011.tm09737