Estimating the False Positive Risk in Psychological Science

Abstract: At most one-quarter of published significant results in psychology journals are false positive results. This is surprising news after a decade of false positive paranoia. However, the low positive rate is not a cause for celebration. It mainly reflects the low priori probability that the nil-hypothesis is true (Cohen, 1994). To produce meaningful results, psychologists need to maintain low false positive risks when they test stronger hypotheses that specify a minimum effect size.

Introduction

Like many other sciences, psychological science relies on null-hypothesis significance testing as the main statistical approach to draw inferences from data. This approach can be dated back to Fisher’s first manual for empirical researchers how to conduct statistical analyses. If the observed test-statistic produces a p-value below .05, the null-hypothesis can be rejected in favor of the alternative hypothesis that the population effect size is not zero. Many criticism of this statistical approach have failed to change research practices.

Cohen (1994) wrote a sarcastic article about NHST with the title “The Earth is round, p < .05.” In this article, Cohen made the bold claim “my work on power analysis has led me to realize that the nil-hypothesis is always false.” In other words, population effect sizes are unlikely to be exactly zero. Thus, rejecting the nil-hypothesis with a p-value below .05 only tells us something we already know. Moreover, when sample sizes are small, we often end up with p-values greater than .05 that do not allow us to reject a false null-hypothesis. I cite this article only to point out that in the 1990s, meta-psychologists were concerned with low statistical power because it produces many false negative results. In contrast, significant results were considered to be true positive findings. Although often meaningless (e.g., the amount of explained variance is greater than zero), they were not wrong.

Since then, psychology has encountered a racial shift in concerns about false positive results (i.e., significant p-values when the nil-hypothesis is true). I conducted an informal survey on social media. Only 23.7% of twitter respondents echoed Cohen’s view that false positive results are rare (less than 25%). The majority (52.6%) of respondents assumed that more than half of all published significant results are false positives.

The results were a bit different for the poll in the Psychological Methods Discussion Group on Facebook. Here the majority opted for 25 to 50 percent false positive results.

The shift from the 1990s to the 2020s can be explained by the replication crisis in social psychology that has attracted a lot of attention and has been generalized to all areas of psychology (Open Science Collaboration, 2015). Arguably, the most influential article that contributed to concerns about false positive results in psychology is Simmons, Nelsons, and Simonsohn’s (2011) article titled “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant” that has been cited 3,203. The key contribution of this article was to show that the use of questionable research practices that psychologists use to obtain p-values below .05 (e.g., using multiple dependent variables) can increase the risk of a false positive result from 5% to over 60%. Moreover, anonymous surveys suggested that researchers often engage in these practices (John et al., 2012). However, even massive use of QRPs will not produce a massive amount of false positive results, if most null-hypotheses are true. In this case, QRPs will inflate the effect size estimates (that nobody pays attention to, anyways), but the rate of false positive results will remain low if most tested hypotheses are true.

Some scientists have argued that scientists are much more likely to make false assumptions (e.g., the Earth is flat) than Cohen envisioned. Ioannidis (2005) famously declared that Most published results are false. He based this claim on hypothetical scenarios that produce more than 50% false positive results when 90% of studies test a true null-hypothesis. This assumption is a near complete reversal of Cohen’s assumption that we can nearly always assume that the effect size is not zero. The problem is that the actual rate of true and false hypotheses is unknown. Thus, estimates of false positive rates are essentially projective tests of gullibility and cynicism.

To provide psychologists with scientific information about the false positive risk in their science, we need a scientific method that can estimate the false discovery risk based on actual data rather than hypothetical scenarios. There have been several attempts to do so. So far, the most prominent study was Leek and Jager’s (2014) estimate of the false discovery rate in medicine. They obtained an estimate of 14%. Simulation studies showed some problems with their estimation model, but the superior z-curve method replicated the original result with a false discovery risk of 13%. This result is much more in line with Cohen’s view that most null-hypotheses are false (typically effect sizes are not zero) than with Ioannidis’s claim that the null-hypothesis is true in 90% of all significance tests.

In psychology, the focus has been on replication rates. The shocking finding was that only 25% of significant results in social psychology could be replicated in an honest and unbiased attempt to reproduce the original study (Open Science Collaboration, 2015). This low replication rate leaves ample room for false positive results, but it is unclear how many of the non-significant results were caused by a true null-hypothesis and how many were caused by low statistical power to detect an effect size greater than zero. Thus, this project provides no information about the false positive risk in psychological science.

Another noteworthy project used a representative sample of test results in social psychology journals (Motyl et al., 2017). This project produced over 1,000 p-values that were examined using a number of statistical tools available at that time. The key result was that there was clear evidence of publication bias. That is, focal hypothesis tests nearly always rejected the null-hypothesis, a finding that has been observed since the beginning of social psychology (Sterling, 1959). However, the actual power of studies to do so was much lower; a finding that is consistent with Cohen’s (1961) seminal analysis of power. However, the results provided no information about the false positive risk. Yet, this valuable dataset could be analyzed with statistical tools that estimate the false discovery risk (Schimmack, 2021). However, the number of significant p-values was too small to produce an informative estimate of the false discovery risk (k = 678; 95CI = .09 to .82).

Results

A decade after the “False Positive Psychology” article rocked psychological science, it remains unclear how much false positive results contribute to replication failures in psychology. To answer this question, we report the results of a z-curve analysis of 1,857 significant p-values that were obtained from hand-coding a representative sample of studies that were published between 2009 and 2014. The years 2013 and 2014 were included to incorporate Motyl et al.’s data. All other coding efforts focussed on the years 2009 and 2010, before concerns about replication failures could have changed research practices. In marked contrast to previous initiatives, the aim was to cover all areas of psychology. To obtain a broad range of disciplines in psychology, a list of 120 journals was compiled (Schimmack, 2021). These journals are the top journals of their disciplines with high impact factors. Students had some freedom in picking journals of their choice. For each journal, articles were selected based on a fixed sampling scheme to code articles 1, 3, 6, and 10 for every set of 10 articles (1,3,6,10,11,13…). The project is ongoing and the results reported below should be considered preliminary. Yet, they do present the first estimate of the false discovery risk in psychological science.

The results replicate many other findings that focal statistical tests are selected because they reject the null-hypothesis. Eighty-one percent of all tests had a p-value below .05. When marginally significant results are included as well, the observed discovery rate increases to 90%. However, the statistical power of studies does not warrant such high success rates. The z-curve estimate of mean power before selection for significance is only 31%; 95%CI = 19% to 37%. This statistic is called the expected discovery rate (EDR) because mean power is equivalent to the long-run percentage of significant results. Based on an insight by Soric (1989), we can use the EDR to quantify the maximum percentage of results that can be false positives, using the formula: FDR = (1/EDR – 1)*(alpha/(1-alpha)). The point estimate of the EDR of 31% corresponds to a point estimate of the False Discovery Risk of 12%. The 95%CI ranges from 8% to 28%. It is important to distinguish between the risk and rate of false positives. Soric’s method assumes that true hypotheses are tested with 100% power. This is an unrealistic assumption. When power is lower the false positive rate will be lower than the false positive risk. Thus, we can conclude from these results that it is unlikely that more than 25% of published significant results in psychology journals are false positive results.

One concerns about these results is that the number of test statistics differed across journals and that Motyl et al.’s large set of results from social psychology could have biased the results. We therefore also analyzed the data by journal and then computed the mean FDR and its 95%CI. This approach produced an even lower FDR estimate of 11%, 95%CI = 9% to 15%.

While a FDR of less than 25% may seem good news in a field that is suffering from false positive paranoia, it is still unacceptably high to ensure that published results can be trusted. Fortunately, there is a simple solution to this problem because Soric’s formula shows that the false discovery risk depends on alpha. Lowering alpha to .01 is sufficient to produce a false discovery risk below 5%. Although this seems like a small adjustment, it results in the loss of 37% significant results with p-values between .05 and .01. This recommendation is consistent with two papers that have argued against the blind use of Fisher’s alpha level of .05 (Benjamin et al., 2017; Lakens et al., 2018). The cost of lowering alpha to .005 would be to loss another 10% of significant findings (ODR = 47%).

Limitations and Future Directions

No study is perfect. As many women know, the first time is rarely the best time (Higgins et al., 2010). Similarly, this study has some limitations that need to be addressed in future studies.

The main limitation of this study is that the coded statistical tests may not be representative of psychological science. However, the random sampling from journals and the selection of a broad range of journals suggests that sampling bias has a relatively small effect on the results. A more serious problem is that there is likely to be heterogeneity across disciplines or even journals within disciplines. Larger samples are needed to test those moderator effects.

Another problem is that z-curve estimates of the EDR and FDR make assumptions about the selection process that may differ from the actual selection process. The best way to address this problem is to promote open science practices that reduce the selective publishing of statistically significant results.

Eventually, it will be necessary to conduct empirical tests with a representative sample of results published in psychology akin to the reproducibility project (Open Science Collaboration, 2015). At a first step, studies can be replicated with the original sample sizes. Results that are successfully replicated do not require further investigation. Replication failures need to be followed up with studies that can provide evidence for the null-hypothesis using equivalence testing with a minimum effect size that would be relevant (Lakens, Scheel, and Isager, 2018). This is the only way to estimate the false positive risk by means of replication studies.

Implications: What Would Cohen Say

The finding that most published results are not false may sound like good news for psychology. However, Cohen would merely point out that that a low rate of false positive results merely reflect the fact that the nil-hypothesis is rarely true. If some hypotheses were true and others were false, NHST (without QRPs) could be used to distinguish between them. However, if most effect sizes are greater than zero, not much is learned from statistical significance. The problem is not p-values or dichotomous think. The problem is that nobody is testing risky hypothesis that an effect size is of a minimum size, and decides in favor of the null-hypothesis when the data show the population effect size is not exactly zero, but practically meaningless (e.g., experimental ego-depletion effects are less than 1/10th of a standard deviation). Even specifying H0 as r < .05 or d < .01 would lower the discovery rates and increase the false discovery risk, while increasing the value of a statistically significance.

Cohen’s clear distinction between the null-hypothesis and the nil-hypothesis made it clear that nil-hypothesis testing is a ritual with little scientific value, while null-hypothesis testing is needed to advance psychological science. The past decade has been a distraction by suggesting that nil-hypothesis testing is meaningful, but only if open science practices are used to prevent false positive results. However, open science practices do not change the fundamental problem of nil-hypothesis testing that Cohen and others identified more than two decades ago. It is often said that science is self-correcting, but psychologists have not corrected the way they formulate their hypotheses. If psychology wants to be a science, they need to specify hypotheses that are worthy of empirical falsification. I am getting to old and cynical (much like my hero Cohen in the 1990s) to believe in change in my life-time, but I can write this message in a bottle and hope one day a new generation may find it and do something with it.

Leave a Reply