Category Archives: False Discovery Rate

Estimating the False Discovery Risk of Psychology Science

Abstract

Since 2011, the credibility of psychological science is in doubt. A major concern is that questionable research practices could have produced many false positive results, and it has been suggested that most published results are false. Here we present an empirical estimate of the false discovery risk using a z-curve analysis of randomly selected p-values from a broad range of journals that span most disciplines in psychology. The results suggest that no more than a quarter of published results could be false positives. We also show that the false positive risk can be reduced to less than 5% by using alpha = .01 as the criterion for statistical significance. This remedy can restore confidence in the direction of published effects. However, published effect sizes cannot be trusted because the z-curve analysis shows clear evidence of selection for significance that inflates effect size estimates.

Introduction

Several events in the early 2010s led to a credibility crisis in psychology. As journals selectively publish only statistically significant results, statistical significance loses its, well, significance. Every published focal hypothesis will be statistically significant, and it is unclear which of these results are true positives and which are false positives.

A key article that contributed to the credibility crisis was Simmons, Nelson, & Simonsohn’s article “False Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant”

The title made a bold statement that it is easy to obtain statistically significant results even when the null-hypothesis is true. This led to concerns that many, if not most, published results are indeed false positive results. Many meta-psychological articles quoted Simmons et al.’s (2011) article to suggest that there is a high risk or even a high rate of false positive results in the psychological literature; including my own 2012 article.

“Researchers can use questionable research practices (e.g., snooping, not reporting failed studies, dropping dependent variables, etc.; Simmons et al., 2011; Strube, 2006) to dramatically increase the chances of obtaining a false-positive result” (Schimmack, 2012, p. 552, 248 citations)

The Appendix lists citations from influential meta-psychological articles that imply a high false positive risk in the psychological literature. Only one article suggested that fears about high false positive rates may be unwarranted (Strobe & Strack, 2014). In contrast, other articles have suggested that false positive rates might be as high as 50% or more (Szucs & Ioannidis, 2017).

There have been two noteworthy attempts at estimating the false discovery rate in psychology. Szucs and Ioannidis (2017) automatically extracted p-values from five psychology journals and estimated the average power of extracted t-tests. They then used this power estimate in combination with the assumption that psychologists discover one true, non-zero, effect for every 13 true null-hypotheses to suggest that the false discovery rate in psychology exceeds 50%. The problem with this estimate is that it relies on the questionable assumption that psychologists tests a very small percentage of true hypotheses.

The other article tried to estimate the false positive rate based on 70 of the 100 studies that were replicated in the Open Science Collaboration project (Open Science Collaboration, 2015). The statistical model estimated that psychologists test 93 true null-hypotheses for every 7 true effects (true positives), and that true effects are tested with 75% power (Johnson et al., 2017). This yields a false positive rate of about 50%. The main problem with this study is the reliance on a small, unrepresentative sample of studies that focused heavily on experimental social psychology, a field that triggered concerns about the credibility of psychology in general (Schimmack, 2020). Another problem is that point estimates based on a small sample are unreliable.

To provide new and better information about the false positive risk in psychology, we conducted a new investigation that addresses three limitations of the previous studies. First, we used hand-coding of focal hypothesis tests, rather than automatic extraction of all test-statistics. Second, we sampled from a broad range of journals that cover all areas of psychology rather than focusing narrowly on experimental psychology. Third, we used a validated method to estimate the false discovery risk based on an estimate of the expected discovery rate (Bartos & Schimmack, 2021). In short, the false discovery risk decreases as a monotonic function of the number of discoveries (i.e., p-values below .05) (Soric, 1989).

Z-curve relies on the observation that false positives and true positives produce different distributions of p-values. To fit a model to distributions of significant p-values, z-curve transforms p-values into absolute z-scores. We illustrate z-curve with two simulation studies. The first simulation is based on Simmons et al.’s (2011) scenario in which the combination of four questionable research practices inflates the false positive risk from 5% to 60%. In our simulation, we assumed an equal number of true null-hypotheses (effect size d = 0) and true hypotheses with small to moderate effect sizes (d = .2 to .5). The use of questionable research practices also increases the chances of getting a significant result for true hypotheses. In our simulation, the probability to get significance with true H0 was 58%, whereas the probability to get significance with true H1 was .93. Given the 1:1 ratio of H0 and H1 that were tested, this yields a false discovery rate of 39%.

Figure 1 shows that questionable research practices produce a steeply declining z-curve. Based on this shape, z-curve estimates a discovery rate of 5%, with a 95%CI ranging from 5% to 10%. This translates into estimates of the false discovery risk of 100% with a 95%CI ranging from 46% to 100% (Soric, 1989). The reason why z-curve provides a conservative estimate of the false discovery risk is that p-hacking changes the shape of the distribution in a way that produces even more z-values just above 1.96 than mere selection for significance would produce. In other words, p-hacking destroys evidential value when true hypotheses are being tested. It is not necessary to simulate scenarios in which even more true null-hypotheses are being tested because this would make the z-curve even steeper. Thus, Figure 1 provides a prediction for our z-curve analyses based on actual data, if psychologists heavily rely on Simmons et al.’s recipe to produce significant results.

Figure 2 is based on a simulation of Johnson et al.’s (2013) scenario with a 9% discovery rate (9 true hypotheses for very 100 hypothesis tests), a false discovery rate of 50%, and power to detect true effects of 75% (Figure 2). Johnson et al. did not assume or model p-hacking.

The z-curve for this scenario also shows a steep decline that can be attributed to the high percentage of false positive results. However, there is also a notable tail with z-values greater than 3 that reflects the influence of true hypotheses with adequate power. In this scenario, the expected discovery rate is higher with a 95%CI ranging from 7% to 20%. This translates into a 95%CI for the false discovery risk ranging from 21% to 71% (Soric, 1989). This interval contains the true value of 50%, although the point estimate, 34% underestimates the true value. Thus, we recommend to use the upper limit of the 95%CI as an estimate of the maximum false discovery rate that is consistent with data.

We now turn to real data. Figure 3 shows a z-curve analysis of Kühberger, Frity, and Scherndl (2014) data. The authors conducted an audit of psychological research by randomly sampling 1,000 English language articles published in the year 2007 that were listed in PsychInfo. This audit produced 344 significant p-values that could be subjected to a z-curve analysis. The results differ notably from the previous results. The expected discovery rate is higher and implies a much smaller false discovery risk of only 9%. However, due to the small set of studies, the confidence interval is wide and allows for nearly 50% false positive results.

To produce a larger set of test-statistics, my students and I have hand-coded over 1,000 randomly selected articles from a broad range of journals (Schimmack, 2021). These data were combined with Motyl et al.’s (2017) coding of social psychology journals. The time period spans the years 2008 to 2014, with a focus on the year 2010 and 2009. This dataset produced 1,715 significant p-values. The estimated false discovery risk is similar to the estimate for Kühberger et al.’s (2014) studies. Although the point estimate for the false discovery risk is a bit higher, 12%, the upper bound of the 95%CI is lower because the confidence interval is tighter.

Given the similarity of the results, we combined the two datasets to obtain an even more precise estimate of the false discovery risk based on 2,059 significant p-values. However, the upper limit of the 95%CI decreased only slightly from 30% to 26%.

The most important conclusion from these findings is that concerns about the amount of false positive results have exaggerated assumptions about the prevalence of false positive results in psychology journals. The present results suggest that at most a quarter of published results are false positives and that actual z-curves are very different from those implied by the influential simulation studies of Simmons et al. (2011). Our empirical results show no evidence that massive p-hacking is a common practice.

However, a false positive rate of 25% is still unacceptably high. Fortunately, there is an easy solution to this problem because the false discovery rate depends on the significance threshold. Based on their pessimistic estimates, Johnson et al. (2015) suggested to lower alpha to .005 or even .001. However, these stringent criteria would render most published results statistically non-significant. We suggest to lower alpha to .01. Figure 6 shows the rational for this recommendation by fitting z-curve with alpha = .01 (i.e., the red vertical line that represents the significance criterion is moved from 1.96 to 2.58.

Lowering alpha to .01, lowers the percentage of significant results from 83% (not counting marginally significant, p < .1, results) to 53%. Thus, the expected discovery decreases, but the more stringent criterion for significance lowers the false discovery risk to 4% and even the upper limit of the 95%CI is just 4%.

It is likely that discovery rates vary across journals and disciplines (Schimmack, 2021). In the future, it may be possible to make more specific recommendations for different disciplines or journals based on their discovery rates. Journals that publish riskier hypotheses tests or studies with modest power would need a more stringent significance criterion to maintain an acceptable false discovery risk.

An alpha level of .01 is also recommended by Simmons et al.’s (2011) simulation studies of p-hacking. Massive p-hacking that inflates the false positive risk from 5% to 61% produces only 22% false positives with alpha = .01. Milder forms of p-hacking inflates the false positive risk produces only a probability of 8% to obtain a p-value below .01. Ideally, open science practices like pre-registration will curb the use of questionable practices in the future. Increasing sample sizes will also help to lower the false positive risk. A z-curve analysis of new studies can be used to estimate the current false discovery risk and may suggest that even the traditional alpha level of .05 is able to maintain a false discovery risk below 5%.

While the present results may be considered good news relative to the scenario that most published results cannot be trusted, the results do not change the fact that some areas of psychology have a replication crisis (Open Science Collaboration, 2015). The z-curve results show clear evidence of selection for significance, which leads to inflated effect size estimates. Studies suggest that effect sizes are often inflated by more than 100% (Open Science Collaboration, 2015). Thus, published effect size estimates cannot be trusted even if p-values below .01 show the correct sign of an effect. The present results also imply that effect size meta-analyses that did not correct for publication bias produce inflated effect size estimates. For these reasons, many meta-analyses have to be reexamined and use statistical tools that correct for publication bias.

Appendix

“Given that these publishing biases are pervasive across scientific practice, it is possible that false positives heavily contaminate the neuroscience literature as well, and this problem may
affect at least as much, if not even more so, the most prominent journals” (Button et al., 2013; 3,316 citations).

“In a theoretical analysis, Ioannidis estimated that publishing and analytic practices make it likely that more than half of research results are false and therefore irreproducible” (Open Science Collaboration, 2015, aac4716-1)

“There is increasing concern that most current published research findings are false. (Ioannidis,
2005, abstract)” (Cumming, 2014, p7, 1,633 citations).

“In a recent article, Simmons, Nelson, and Simonsohn (2011) showed how, due to the misuse of statistical tools, significant results could easily turn out to be false positives (i.e., effects considered significant whereas the null hypothesis is actually true). (Leys et al., 2013, p. 765, 1,406 citations)

“During data analysis it can be difficult for researchers to recognize P-hacking or data dredging because confirmation and hindsight biases can encourage the acceptance of outcomes that fit expectations or desires as appropriate, and the rejection of outcomes that do not as the result of suboptimal designs or analyses. Hypotheses may emerge that fit the data and are then reported without indication or recognition of their post hoc origin. This, unfortunately, is not scientific discovery, but self-deception. Uncontrolled, it can dramatically increase the false discovery rate” (Munafò et al., 2017, p. 2, 1,010 citations)

Just how dramatic these effects can be was demonstrated by Simmons, Nelson, and Simonsohn (2011) in a series of experiments and simulations that showed how greatly QRPs increase the likelihood of finding support for a false hypothesis. (John et al., 2012, p. 524, 877 citations).

“Simonsohn’s simulations have shown that changes in a few data-analysis
decisions can increase the
false-positive rate in a single study to 60%” (Nuzzo, 2014, 799 citations).

“the publication of an important article in Psychological Science showing how easily researchers can, in the absence of any real effects, nonetheless obtain statistically significant differences through various questionable research practices (QRPs) such as exploring multiple dependent variables or covariates and only reporting these when they yield significant results (Simmons, Nelson, & Simonsohn, 2011)” (Pashler & Wagenmakers, 2012, p. 528, 736 citations)

“Even seemingly conservative levels of p-hacking make it easy for researchers to find statistically significant support for nonexistent effects. Indeed, p-hacking can allow researchers to get most studies to reveal significant relationships between truly unrelated variables (Simmons et al., 2011).” (Simonsohn, Nelson, & Simmons, 2014, p. 534, 656 citations)

“Recent years have seen intense interest in the reproducibility of scientific results and the degree to which some problematic, but common, research practices may be responsible for high rates of false findings in the scientific literature, particularly within psychology but also more generally” (Poldrack et al., 2017, p. 115, 475 citations)

“especially in an environment in which multiple comparisons or researcher dfs (Simmons, Nelson, & Simonsohn, 2011) make it easy for researchers to find large and statistically significant effects that could arise from noise alone” (Gelman & Carlin,

“In an influential recent study, Simmons and colleagues demonstrated that even a moderate amount of flexibility in analysis choice—for example, selecting from among two DVs or
optionally including covariates in a regression analysis— could easily produce false-positive rates in excess of 60%, a figure they convincingly argue is probably a conservative
estimate (Simmons et al., 2011).” (Yarkoni & Westfall, 2017, p. 1103, 457 citations)

“In the face of human biases and the vested interest of the experimenter, such freedom of analysis provides access to a Pandora’s box of tricks that can be used to achieve any desired result (e.g., John et al., 2012; Simmons, Nelson, & Simonsohn, 2011″ (Wagenmakers et al., 2012, p. 633, 425 citations)

“Simmons et al. (2011) illustrated how easy it is to inflate Type I error rates when researchers employ hidden degrees of freedom in their analyses and design of studies (e.g., selecting the most desirable outcomes, letting the sample size depend on results of significance tests).” (Bakker et al., 2012, p. 545, 394 citations).

“Psychologists have recently become increasingly concerned about the likely overabundance of false positive results in the scientific literature. For example, Simmons, Nelson, and Simonsohn (2011) state that “In many cases, a researcher is more likely to falsely find
evidence that an effect exists than to correctly find evidence that it does not
” (p. 1359)” (Maxwell, Lau, & Howard, 2015, p. 487,

“More-over, the highest impact journals famously tend to favor highly surprising results; this makes it easy to see how the proportion of false positive findings could be even higher in such journals.” (Pashler & Harris, 2012, p. 532, 373 citations)

“There is increasing concern that many published results are false positives [1,2] (but see [3]).” (Head et al., 2015, p. 1, 356 citations)

“Quantifying p-hacking is important because publication of false positives hinders scientific
progress” (Head et al., 2015, p. 2, 356 citations).

“To be sure, methodological discussions are important for any discipline, and both fraud and dubious research procedures are damaging to the image of any field and potentially undermine confidence in the validity of social psychological research findings. Thus far, however, no solid data exist on the prevalence of such research practices in either social or any other area of psychology.” (Strobe & Strack, 2014, p. 60, 291 citations)

“Assuming a realistic range of prior probabilities for null hypotheses, false report probability is likely to exceed 50% for the whole literature” (Szucs & Ioannidis, 2017, p. 1, 269 citations)

“Notably, if we consider the recent estimate of 13:1 H0:H1 odds [30], then FRP exceeds 50% even in the absence of bias” (Szucs & Ioannidis, 2017, p. 12, 269 citations)

“In all, the combination of low power, selective reporting, and other biases and errors that have been well documented suggest that high FRP can be expected in cognitive neuroscience and psychology. For example, if we consider the recent estimate of 13:1 H0:H1 odds [30], then
FRP exceeds 50% even in the absence of bias.” (Szucs & Ioannidis, 2017, p. 15, 269 citations)

“Many prominent researchers believe that as much as half of the scientific literature—not only in medicine, by also in psychology and other fields—may be wrong [11,13–15]” (Smaldino & McElreath, 2016, p. 2, 251 citations).

“Researchers can use questionable research practices (e.g., snooping, not reporting failed studies, dropping dependent variables, etc.; Simmons et al., 2011; Strube, 2006) to dramatically increase the chances of obtaining a false-positive result” (Schimmack, 2012, p. 552, 248 citations)

“A more recent article compellingly demonstrated how flexibility in data collection, analysis, and reporting can dramatically increase false-positive rates (Simmons, Nelson, & Simonsohn, 2011).” (Dick et al., 2015, p. 43, 208 citations)

“In 2011, we wrote “False-Positive Psychology” (Simmons et al. 2011), an article reporting the surprisingly severe consequences of selectively reporting data and analyses, a practice that we later called p-hacking. In that article, we showed that conducting multiple analyses on the same data set and then reporting only the one(s) that obtained statistical significance (e.g., analyzing multiple measures but reporting only one) can dramatically increase the likelihood of publishing a false-positive finding. Independently and nearly simultaneously, John et al. (2012) documented that a large fraction of psychological researchers admitted engaging in precisely the forms of p-hacking that we had considered. Identifying these realities—that researchers engage in p-hacking and that p-hacking makes it trivially easy to accumulate significant evidence for a false hypothesisopened psychologists’ eyes to the fact that many published findings, and even whole literatures, could be false positive.” (Nelson, Simmons, & Simonsohn, 2018, 204 citations).

“As Simmons et al.(2011) concluded—reflecting broadly on the state of the discipline—“it is unacceptably easy to publish ‘statistically significant’ evidence consistent with any hypothesis”(p.1359)” (Earp & Trafimov, 2015, p. 4, 200 citations)

“The second, related set of events was the publication of articles by a series of authors (Ioannidis 2005, Kerr 1998, Simmons et al. 2011, Vul et al. 2009) criticizing questionable research practices (QRPs) that result in grossly inflated false positive error rates in the psychological literature” (Shrout & Rodgers, 2018, p. 489, 195 citations).

“Let us add a new dimension, which was brought up in a seminal publication of Simmons, Nelson & Simonsohn (2011). They stated that researchers actually have so much flexibility in deciding how to analyse their data that this flexibility allows them to coax statistically significant results from nearly any data set” (Forstmeier, Wagenmakers, & Parker, 2017, p. 1945, 173 citations)

“Publication bias (Ioannidis, 2005) and flexibility during data analyses (Simmons, Nelson, & Simonsohn, 2011) create a situation in which false positives are easy to publish, whereas contradictory null findings do not reach scientific journals (but see Nosek & Lakens, in press)” (Lakens & Evers, 2014, p. 278, 139 citations)

“Recent reports hold that allegedly common research practices allow psychologists to support just about any conclusion (Ioannidis, 2005; Simmons, Nelson, & Simonsohn, 2011).” (Koole & Lakens, 2012, p. 608, 139 citations)

“Researchers then may be tempted to write up and concoct papers around the significant results and send them to journals for publication. This outcome selection seems to be widespread practice in psychology [12], which implies a lot of false positive results in the literature and a massive overestimation of ES, especially in meta-analyses” (

“Researcher df, or researchers’ behavior directed at obtaining statistically significant results (Simonsohn, Nelson, & Simmons, 2013), which is also known as p-hacking or questionable research practices in the context of null hypothesis significance testing (e.g., O’Boyle, Banks, & Gonzalez-Mulé, 2014), results in a higher frequency of studies with false positives (Simmons et al., 2011) and inflates genuine effects (Bakker et al., 2012).” (van Assen, van Aert, & Wicherts, p. 294, 133 citations)

“The scientific community has witnessed growing concern about the high rate of false positives and unreliable results within the psychological literature, but the harmful impact
of false negatives has been largely ignored” (Vadillo, Konstantinidis, & Shanks, p. 87, 131 citations)

“Much of the debate has concerned habits (such as “phacking” and the filedrawer effect) which can boost the prevalence of false positives in the published literature (Ioannidis, Munafò, Fusar-Poli, Nosek, & David, 2014; Simmons, Nelson, & Simonsohn, 2011).” (Vadillo, Konstantinidis, & Shanks, p. 87, 131 citations)

“Simmons, Nelson, and Simonsohn (2011) showed that researchers without scruples can nearly always find a p < .05 in a data set if they set their minds to it.” (Crandall & Sherman, 2014, p. 96, 114 citations)

Personalized P-Values for Social/Personality Psychologists

Last update 8/25/2021
(expanded to 410 social/personality psychologists; included Dan Ariely)

Introduction

Since Fisher invented null-hypothesis significance testing, researchers have used p < .05 as a statistical criterion to interpret results as discoveries worthwhile of discussion (i.e., the null-hypothesis is false). Once published, these results are often treated as real findings even though alpha does not control the risk of false discoveries.

Statisticians have warned against the exclusive reliance on p < .05, but nearly 100 years after Fisher popularized this approach, it is still the most common way to interpret data. The main reason is that many attempts to improve on this practice have failed. The main problem is that a single statistical result is difficult to interpret. However, when individual results are interpreted in the context of other results, they become more informative. Based on the distribution of p-values it is possible to estimate the maximum false discovery rate (Bartos & Schimmack, 2020; Jager & Leek, 2014). This approach can be applied to the p-values published by individual authors to adjust p-values to keep the risk of false discoveries at a reasonable level, FDR < .05.

Researchers who mainly test true hypotheses with high power have a high discovery rate (many p-values below .05) and a low false discovery rate (FDR < .05). Figure 1 shows an example of a researcher who followed this strategy (for a detailed description of z-curve plots, see Schimmack, 2021).

We see that out of the 317 test-statistics retrieved from his articles, 246 were significant with alpha = .05. This is an observed discovery rate of 78%. We also see that this discovery rate closely matches the estimated discovery rate based on the distribution of the significant p-values, p < .05. The EDR is 79%. With an EDR of 79%, the maximum false discovery rate is only 1%. However, the 95%CI is wide and the lower bound of the CI for the EDR, 27%, allows for 14% false discoveries.

When the ODR matches the EDR, there is no evidence of publication bias. In this case, we can improve the estimates by fitting all p-values, including the non-significant ones. With a tighter CI for the EDR, we see that the 95%CI for the maximum FDR ranges from 1% to 3%. Thus, we can be confident that no more than 5% of the significant results wit alpha = .05 are false discoveries. Readers can therefore continue to use alpha = .05 to look for interesting discoveries in Matsumoto’s articles.

Figure 3 shows the results for a different type of researcher who took a risk and studied weak effect sizes with small samples. This produces many non-significant results that are often not published. The selection for significance inflates the observed discovery rate, but the z-curve plot and the comparison with the EDR shows the influence of publication bias. Here the ODR is similar to Figure 1, but the EDR is only 11%. An EDR of 11% translates into a large maximum false discovery rate of 41%. In addition, the 95%CI of the EDR includes 5%, which means the risk of false positives could be as high as 100%. In this case, using alpha = .05 to interpret results as discoveries is very risky. Clearly, p < .05 means something very different when reading an article by David Matsumoto or Shelly Chaiken.

Rather than dismissing all of Chaiken’s results, we can try to lower alpha to reduce the false discovery rate. If we set alpha = .01, the FDR is 15%. If we set alpha = .005, the FDR is 8%. To get the FDR below 5%, we need to set alpha to .001.

A uniform criterion of FDR < 5% is applied to all researchers in the rankings below. For some this means no adjustment to the traditional criterion. For others, alpha is lowered to .01, and for a few even lower than that.

The rankings below are based on automatrically extracted test-statistics from 40 journals (List of journals). The results should be interpreted with caution and treated as preliminary. They depend on the specific set of journals that were searched, the way results are being reported, and many other factors. The data are available (data.drop) and researchers can exclude articles or add articles and run their own analyses using the z-curve package in R (https://replicationindex.com/2020/01/10/z-curve-2-0/).

I am also happy to receive feedback about coding errors. I also recommended to hand-code articles to adjust alpha for focal hypothesis tests. This typically lowers the EDR and increases the FDR. For example, the automated method produced an EDR of 31 for Bargh, whereas hand-coding of focal tests produced an EDR of 12 (Bargh-Audit).

And here are the rankings. The results are fully automated and I was not able to cover up the fact that I placed only #188 out of 400 in the rankings. In another post, I will explain how researchers can move up in the rankings. Of course, one way to move up in the rankings is to increase statistical power in future studies. The rankings will be updated again when the 2021 data are available.

Despite the preliminary nature, I am confident that the results provide valuable information. Until know all p-values below .05 have been treated as if they are equally informative. The rankings here show that this is not the case. While p = .02 can be informative for one researcher, p = .002 may still entail a high false discovery risk for another researcher.

Good science requires not only open and objective reporting of new data; it also requires unbiased review of the literature. However, there are no rules and regulations regarding citations, and many authors cherry-pick citations that are consistent with their claims. Even when studies have failed to replicate, original studies are cited without citing the replication failures. In some cases, authors even cite original articles that have been retracted. Fortunately, it is easy to spot these acts of unscientific behavior. Here I am starting a project to list examples of bad scientific behaviors. Hopefully, more scientists will take the time to hold their colleagues accountable for ethical behavior in citations. They can even do so by posting anonymously on the PubPeer comment site.

RankNameTestsODREDRERRFDRAlpha
1Robert A. Emmons538789901.05
2Allison L. Skinner2295981851.05
3David Matsumoto3788379851.05
4Linda J. Skitka5326875822.05
5Todd K. Shackelford3057775822.05
6Jonathan B. Freeman2745975812.05
7Virgil Zeigler-Hill5157274812.05
8Arthur A. Stone3107573812.05
9David P. Schmitt2077871772.05
10Emily A. Impett5497770762.05
11Paula Bressan628270762.05
12Kurt Gray4877969812.05
13Michael E. McCullough3346969782.05
14Kipling D. Williams8437569772.05
15John M. Zelenski1567169762.05
16Amy J. C. Cuddy2128368782.05
17Elke U. Weber3126968770.05
18Hilary B. Bergsieker4396768742.05
19Cameron Anderson6527167743.05
20Rachael E. Jack2497066803.05
21Jamil Zaki4307866763.05
22A. Janet Tomiyama767865763.05
23Benjamin R. Karney3925665733.05
24Phoebe C. Ellsworth6057465723.05
25Jim Sidanius4876965723.05
26Amelie Mummendey4617065723.05
27Carol D. Ryff2808464763.05
28Juliane Degner4356364713.05
29Steven J. Heine5977863773.05
30David M. Amodio5846663703.05
31Thomas N Bradbury3986163693.05
32Elaine Fox4727962783.05
33Miles Hewstone14277062733.05
34Linda R. Tropp3446561803.05
35Rainer Greifeneder9447561773.05
36Klaus Fiedler19507761743.05
37Jesse Graham3777060763.05
38Richard W. Robins2707660704.05
39Simine Vazire1376660644.05
40On Amir2676759884.05
41Edward P. Lemay2898759814.05
42William B. Swann Jr.10707859804.05
43Margaret S. Clark5057559774.05
44Bernhard Leidner7246459654.05
45B. Keith Payne8797158764.05
46Ximena B. Arriaga2846658694.05
47Joris Lammers7286958694.05
48Patricia G. Devine6067158674.05
49Rainer Reisenzein2016557694.05
50Barbara A. Mellers2878056784.05
51Joris Lammers7056956694.05
52Jean M. Twenge3817256594.05
53Nicholas Epley15047455724.05
54Kaiping Peng5667754754.05
55Krishna Savani6387153695.05
56Leslie Ashburn-Nardo1098052835.05
57Lee Jussim2268052715.05
58Richard M. Ryan9987852695.05
59Ethan Kross6146652675.05
60Edward L. Deci2847952635.05
61Roger Giner-Sorolla6638151805.05
62Bertram F. Malle4227351755.05
63George A. Bonanno4797251705.05
64Jens B. Asendorpf2537451695.05
65Samuel D. Gosling1085851625.05
66Tessa V. West6917151595.05
67Paul Rozin4497850845.05
68Joachim I. Krueger4367850815.05
69Sheena S. Iyengar2076350805.05
70James J. Gross11047250775.05
71Mark Rubin3066850755.05
72Pieter Van Dessel5787050755.05
73Shinobu Kitayama9837650715.05
74Matthew J. Hornsey16567450715.05
75Janice R. Kelly3667550705.05
76Antonio L. Freitas2477950645.05
77Paul K. Piff1667750635.05
78Mina Cikara3927149805.05
79Beate Seibt3797249626.01
80Ludwin E. Molina1636949615.05
81Bertram Gawronski18037248766.01
82Penelope Lockwood4587148706.01
83Edward R. Hirt10428148656.01
84Matthew D. Lieberman3987247806.01
85John T. Cacioppo4387647696.01
86Agneta H. Fischer9527547696.01
87Leaf van Boven7117247676.01
88Stephanie A. Fryberg2486247666.01
89Daniel M. Wegner6027647656.01
90Anne E. Wilson7857147646.01
91Rainer Banse4027846726.01
92Alice H. Eagly3307546716.01
93Jeanne L. Tsai12417346676.01
94Jennifer S. Lerner1818046616.01
95Andrea L. Meltzer5495245726.01
96R. Chris Fraley6427045727.01
97Constantine Sedikides25667145706.01
98Paul Slovic3777445706.01
99Dacher Keltner12337245646.01
100Brian A. Nosek8166844817.01
101George Loewenstein7527144727.01
102Ursula Hess7747844717.01
103Jason P. Mitchell6007343737.01
104Jessica L. Tracy6327443717.01
105Charles M. Judd10547643687.01
106S. Alexander Haslam11987243647.01
107Mark Schaller5657343617.01
108Susan T. Fiske9117842747.01
109Lisa Feldman Barrett6446942707.01
110Jolanda Jetten19567342677.01
111Mario Mikulincer9018942647.01
112Bernadette Park9737742647.01
113Paul A. M. Van Lange10927042637.01
114Wendi L. Gardner7986742637.01
115Will M. Gervais1106942597.01
116Jordan B. Peterson2666041797.01
117Philip E. Tetlock5497941737.01
118Amanda B. Diekman4388341707.01
119Daniel H. J. Wigboldus4927641678.01
120Michael Inzlicht6866641638.01
121Naomi Ellemers23887441638.01
122Phillip Atiba Goff2996841627.01
123Stacey Sinclair3277041578.01
124Francesca Gino25217540698.01
125Michael I. Norton11367140698.01
126David J. Hauser1567440688.01
127Elizabeth Page-Gould4115740668.01
128Tiffany A. Ito3498040648.01
129Richard E. Petty27716940648.01
130Tim Wildschut13747340648.01
131Norbert Schwarz13377240638.01
132Veronika Job3627040638.01
133Wendy Wood4627540628.01
134Minah H. Jung1568339838.01
135Marcel Zeelenberg8687639798.01
136Tobias Greitemeyer17377239678.01
137Jason E. Plaks5827039678.01
138Carol S. Dweck10287039638.01
139Christian S. Crandall3627539598.01
140Harry T. Reis9986938749.01
141Vanessa K. Bohns4207738748.01
142Jerry Suls4137138688.01
143Eric D. Knowles3846838648.01
144C. Nathan DeWall13367338639.01
145Clayton R. Critcher6978238639.01
146John F. Dovidio20196938629.01
147Joshua Correll5496138629.01
148Abigail A. Scholer5565838629.01
149Chris Janiszewski1078138589.01
150Herbert Bless5867338579.01
151Mahzarin R. Banaji8807337789.01
152Rolf Reber2806437729.01
153Kevin N. Ochsner4067937709.01
154Mark J. Brandt2777037709.01
155Geoff MacDonald4066737679.01
156Mara Mather10387837679.01
157Antony S. R. Manstead16567237629.01
158Lorne Campbell4336737619.01
159Sanford E. DeVoe2367137619.01
160Ayelet Fishbach14167837599.01
161Fritz Strack6077537569.01
162Jeff T. Larsen18174366710.01
163Nyla R. Branscombe12767036659.01
164Yaacov Schul4116136649.01
165D. S. Moskowitz34187436639.01
166Pablo Brinol13566736629.01
167Todd B. Kashdan3777336619.01
168Barbara L. Fredrickson2877236619.01
169Duane T. Wegener9807736609.01
170Joanne V. Wood10937436609.01
171Daniel A. Effron4846636609.01
172Niall Bolger3766736589.01
173Craig A. Anderson4677636559.01
174Michael Harris Bond37873358410.01
175Glenn Adams27071357310.01
176Daniel M. Bernstein40473357010.01
177C. Miguel Brendl12176356810.01
178Azim F. Sharif18374356810.01
179Emily Balcetis59969356810.01
180Eva Walther49382356610.01
181Michael D. Robinson138878356610.01
182Igor Grossmann20364356610.01
183Diana I. Tamir15662356210.01
184Samuel L. Gaertner32175356110.01
185John T. Jost79470356110.01
186Eric L. Uhlmann45767356110.01
187Nalini Ambady125662355610.01
188Daphna Oyserman44655355410.01
189Victoria M. Esses29575355310.01
190Linda J. Levine49574347810.01
191Wiebke Bleidorn9963347410.01
192Thomas Gilovich119380346910.01
193Alexander J. Rothman13369346510.01
194Francis J. Flynn37872346310.01
195Paula M. Niedenthal52269346110.01
196Ozlem Ayduk54962345910.01
197Paul Ekman8870345510.01
198Alison Ledgerwood21475345410.01
199Christopher R. Agnew32575337610.01
200Michelle N. Shiota24260336311.01
201Malte Friese50161335711.01
202Kerry Kawakami48768335610.01
203Danu Anthony Stinson49477335411.01
204Jennifer A. Richeson83167335211.01
205Margo J. Monteith77376327711.01
206Ulrich Schimmack31875326311.01
207Mark Snyder56272326311.01
208Michele J. Gelfand36576326311.01
209Russell H. Fazio109469326111.01
210Eric van Dijk23867326011.01
211Tom Meyvis37777326011.01
212Eli J. Finkel139262325711.01
213Robert B. Cialdini37972325611.01
214Jonathan W. Kunstman43066325311.01
215Delroy L. Paulhus12177318212.01
216Yuen J. Huo13274318011.01
217Gerd Bohner51371317011.01
218Christopher K. Hsee68975316311.01
219Vivian Zayas25171316012.01
220John A. Bargh65172315512.01
221Tom Pyszczynski94869315412.01
222Roy F. Baumeister244269315212.01
223E. Ashby Plant83177315111.01
224Kathleen D. Vohs94468315112.01
225Jamie Arndt131869315012.01
226Anthony G. Greenwald35772308312.01
227Nicholas O. Rule129468307513.01
228Lauren J. Human44759307012.01
229Jennifer Crocker51568306712.01
230Dale T. Miller52171306412.01
231Thomas W. Schubert35370306012.01
232Joseph A. Vandello49473306012.01
233W. Keith Campbell52870305812.01
234Arthur Aron30765305612.01
235Pamela K. Smith14966305212.01
236Aaron C. Kay132070305112.01
237Steven W. Gangestad19863304113.005
238Eliot R. Smith44579297313.01
239Nir Halevy26268297213.01
240E. Allan Lind37082297213.01
241Richard E. Nisbett31973296913.01
242Hazel Rose Markus67476296813.01
243Emanuele Castano44569296513.01
244Dirk Wentura83065296413.01
245Boris Egloff27481295813.01
246Monica Biernat81377295713.01
247Gordon B. Moskowitz37472295713.01
248Russell Spears228673295513.01
249Jeff Greenberg135877295413.01
250Caryl E. Rusbult21860295413.01
251Naomi I. Eisenberger17974287914.01
252Brent W. Roberts56272287714.01
253Yoav Bar-Anan52575287613.01
254Eddie Harmon-Jones73873287014.01
255Matthew Feinberg29577286914.01
256Roland Neumann25877286713.01
257Eugene M. Caruso82275286413.01
258Ulrich Kuehnen82275286413.01
259Elizabeth W. Dunn39575286414.01
260Jeffry A. Simpson69774285513.01
261Sander L. Koole76765285214.01
262Richard J. Davidson38064285114.01
263Shelly L. Gable36464285014.01
264Adam D. Galinsky215470284913.01
265Grainne M. Fitzsimons58568284914.01
266Geoffrey J. Leonardelli29068284814.005
267Joshua Aronson18385284614.005
268Henk Aarts100367284514.005
269Vanessa K. Bohns42276277415.01
270Jan De Houwer197270277214.01
271Dan Ariely60070276914.01
272Charles Stangor18581276815.01
273Karl Christoph Klauer80167276514.01
274Mario Gollwitzer50058276214.01
275Jennifer S. Beer8056275414.01
276Eldar Shafir10778275114.01
277Guido H. E. Gendolla42276274714.005
278Klaus R. Scherer46783267815.01
279William G. Graziano53271266615.01
280Galen V. Bodenhausen58574266115.01
281Sonja Lyubomirsky53071265915.01
282Kai Sassenberg87271265615.01
283Kristin Laurin64863265115.01
284Claude M. Steele43473264215.005
285David G. Rand39270258115.01
286Paul Bloom50272257916.01
287Kerri L. Johnson53276257615.01
288Batja Mesquita41671257316.01
289Rebecca J. Schlegel26167257115.01
290Phillip R. Shaver56681257116.01
291David Dunning81874257016.01
292Laurie A. Rudman48272256816.01
293David A. Lishner10565256316.01
294Mark J. Landau95078254516.005
295Ronald S. Friedman18379254416.005
296Joel Cooper25772253916.005
297Alison L. Chasteen22368246916.01
298Jeff Galak31373246817.01
299Steven J. Sherman88874246216.01
300Shigehiro Oishi110964246117.01
301Thomas Mussweiler60470244317.005
302Mark W. Baldwin24772244117.005
303Evan P. Apfelbaum25662244117.005
304Nurit Shnabel56476237818.01
305Klaus Rothermund73871237618.01
306Felicia Pratto41073237518.01
307Jonathan Haidt36876237317.01
308Roland Imhoff36574237318.01
309Jeffrey W Sherman99268237117.01
310Jennifer L. Eberhardt20271236218.005
311Bernard A. Nijstad69371235218.005
312Brandon J. Schmeichel65266234517.005
313Sam J. Maglio32572234217.005
314David M. Buss46182228019.01
315Yoel Inbar28067227119.01
316Serena Chen86572226719.005
317Spike W. S. Lee14568226419.005
318Marilynn B. Brewer31475226218.005
319Michael Ross116470226218.005
320Dieter Frey153868225818.005
321G. Daniel Lassiter18982225519.01
322Sean M. McCrea58473225419.005
323Wendy Berry Mendes96568224419.005
324Paul W. Eastwick58365216919.005
325Kees van den Bos115084216920.005
326Maya Tamir134280216419.005
327Joseph P. Forgas88883215919.005
328Michaela Wanke36274215919.005
329Dolores Albarracin54066215620.005
330Elizabeth Levy Paluck3184215520.005
331Vanessa LoBue29968207621.01
332Christopher J. Armitage16062207321.005
333Elizabeth A. Phelps68678207221.005
334Jay J. van Bavel43764207121.005
335David A. Pizarro22771206921.005
336Andrew J. Elliot101881206721.005
337William A. Cunningham23876206422.005
338Laura D. Scherer21269206421.01
339Kentaro Fujita45869206221.005
340Geoffrey L. Cohen159068205021.005
341Ana Guinote37876204721.005
342Tanya L. Chartrand42467203321.001
343Selin Kesebir32866197322.005
344Vincent Y. Yzerbyt141273197322.01
345James K. McNulty104756196523.005
346Robert S. Wyer87182196322.005
347Travis Proulx17463196222.005
348Peter M. Gollwitzer130364195822.005
349Nilanjana Dasgupta38376195222.005
350Jamie L. Goldenberg56877195022.01
351Richard P. Eibach75369194723.001
352Gerald L. Clore45674194522.001
353James M. Tyler13087187424.005
354Roland Deutsch36578187124.005
355Ed Diener49864186824.005
356Kennon M. Sheldon69874186623.005
357Wilhelm Hofmann62467186623.005
358Laura L. Carstensen72377186424.005
359Toni Schmader54669186124.005
360Frank D. Fincham73469185924.005
361David K. Sherman112861185724.005
362Lisa K. Libby41865185424.005
363Chen-Bo Zhong32768184925.005
364Stefan C. Schmukle11462177126.005
365Michel Tuan Pham24686176825.005
366Leandre R. Fabrigar63270176726.005
367Neal J. Roese36864176525.005
368Carey K. Morewedge63376176526.005
369Timothy D. Wilson79865176326.005
370Brad J. Bushman89774176225.005
371Ara Norenzayan22572176125.005
372Benoit Monin63565175625.005
373Michael W. Kraus61772175526.005
374Ad van Knippenberg68372175526.001
375E. Tory. Higgins186868175425.001
376Ap Dijksterhuis75068175426.005
377Joseph Cesario14662174526.001
378Simone Schnall27062173126.001
379Joshua M. Ackerman38053167013.01
380Melissa J. Ferguson116372166927.005
381Laura A. King39176166829.005
382Daniel T. Gilbert72465166527.005
383Charles S. Carver15482166428.005
384Leif D. Nelson40974166428.005
385David DeSteno20183165728.005
386Sandra L. Murray69760165528.001
387Heejung S. Kim85859165529.001
388Mark P. Zanna65964164828.001
389Nira Liberman130475156531.005
390Gun R. Semin15979156429.005
391Tal Eyal43962156229.005
392Nathaniel M Lambert45666155930.001
393Angela L. Duckworth12261155530.005
394Dana R. Carney20060155330.001
395Garriy Shteynberg16854153130.005
396Lee Ross34977146331.001
397Arie W. Kruglanski122878145833.001
398Ziva Kunda21767145631.001
399Shelley E. Taylor42769145231.001
400Jon K. Maner104065145232.001
401Gabriele Oettingen104761144933.001
402Nicole L. Mead24070144633.01
403Gregory M. Walton58769144433.001
404Michael A. Olson34665136335.001
405Fiona Lee22167135834.001
406Melody M. Chao23757135836.001
407Adam L. Alter31478135436.001
408Sarah E. Hill50978135234.001
409Jaime L. Kurtz9155133837.001
410Michael A. Zarate12052133136.001
411Jennifer K. Bosson65976126440.001
412Daniel M. Oppenheimer19880126037.001
413Deborah A. Prentice8980125738.001
414Yaacov Trope127773125738.001
415Oscar Ybarra30563125540.001
416William von Hippel39865124840.001
417Steven J. Spencer54167124438.001
418Martie G. Haselton18673115443.001
419Shelly Chaiken36074115244.001
420Susan M. Andersen36174114843.001
421Dov Cohen64168114441.001
422Mark Muraven49652114441.001
423Ian McGregor40966114041.001
424Hans Ijzerman2145694651.001
425Linda M. Isbell1156494150.001
426Cheryl J. Wakslak2787383559.001

Ioannidis is Wrong Most of the Time

John P. A. Ioannidis is a rock star in the world of science (wikipedia).

By traditional standards of science, he is one of the most prolific and influential scientists alive. He has published over 1,000 articles that have been cited over 100,000 times.

He is best known for the title of his article “Why most published research findings are false” that has been cited nearly 5,000 times. The irony of this title is that it may also apply to Ioannidis, especially because there is a trade-off between quality and quantity in publishing.

Fact Checking Ioannidis

The title of Ioannidis’s article implies a factual statement: “Most published results ARE false.” However, the actual article does not contain empirical data to support this claim. Rather, Ioannidis presents some hypothetical scenarios that show under what conditions published results MAY BE false.

To produce mostly false findings, a literature has to meet two conditions.

First, it has to test mostly false hypotheses.
Second, it has to test hypotheses in studies with low statistical power, that is a low probability of producing true positive results.

To give a simple example, imagine a field that tests only 10% true hypothesis with just 20% power. As power predicts the percentage of true discoveries, only 2 out of the 10 true hypothesis will be significant. Meanwhile, the alpha criterion of 5% implies that 5% of the false hypotheses will also produce a significant result. Thus, 5 of the 90 false hypotheses will also produce a significant result. As a result, there will be two times more false positives (4.5 over 100) than true positives (2 over 100).

These relatively simple calculations were well known by 2005 (Soric, 1989). Thus, why did Ioannidis article have such a big impact? The answer is that Ioannidis convinced many people that his hypothetical examples are realistic and describe most areas in science.

2020 has shown that Ioannidis’s claim does not apply to all areas of science. In amazing speed, bio-tech companies were able to make not just one but several successful vaccine’s with high effectiveness. Clearly some sciences are making real progress. On the other hand, other areas of science suggest that Ioannidis’s claims were accurate. For example, the whole literature on single-gene variations as predictors of human behavior has produced mostly false claims. Social psychology has a replication crisis where only 25% of published results could be replicated (OSC, 2015).

Aside from this sporadic and anecdotal evidence, it remains unclear how many false results are published in science as a whole. The reason is that it is impossible to quantify the number of false positive results in science. Fortunately, it is not necessary to know the actual rate of false positives to test Ioannidis’s prediction that most published results are false positives. All we need to know is the discovery rate of a field (Soric, 1989). The discovery rate makes it possible to quantify the maximum percentage of false positive discoveries. If the maximum false discovery rate is well below 50%, we can reject Ioannidis’s hypothesis that most published results are false.

The empirical problem is that the observed discovery rate in a field may be inflated by publication bias. It is therefore necessary to estimate the amount of publication bias and if necessary correct the discovery rate, if publication bias is present.

In 2005, Ioannidis and Trikalinos (2005) developed their own test for publication bias, but this test had a number of shortcomings. First, it could be biased in heterogeneous literatures. Second, it required effect sizes to compute power. Third, it only provided information about the presence of publication bias and did not quantify it. Fourth, it did not provide bias-corrected estimates of the true discovery rate.

When the replication crisis became apparent in psychology, I started to develop new bias tests that address these limitations (Bartos & Schimmack, 2020; Brunner & Schimmack, 2020; Schimmack, 2012). The newest tool, called z-curve.2.0 (and yes, there is a app for that), overcomes all of the limitations of Ioannidis’s approach. Most important, it makes it possible to compute a bias-corrected discovery rate that is called the expected discovery rate. The expected discovery rate can be used to examine and quantify publication bias by comparing it to the observed discovery rate. Moreover, the expected discovery rate can be used to compute the maximum false discovery rate.

The Data

The data were compiled by Simon Schwab from the Cochrane database (https://www.cochrane.org/) that covers results from thousands of clinical trials. The data are publicly available (https://osf.io/xjv9g/) under a CC-By Attribution 4.0 International license (“Re-estimating 400,000 treatment effects from intervention studies in the Cochrane Database of Systematic Reviews”; (see also van Zwet, Schwab, & Senn, 2020).

Studies often report results for several outcomes. I selected only results for the primary outcome. It is often suggested that researchers switch outcomes to produce significant results. Thus, primary outcomes are the most likely to show evidence of publication bias, while secondary outcomes might even be biased to show more negative results for the same reason. The choice of primary outcomes also ensures that the test statistics are statistically independent because they are based on independent samples.

Results

I first fitted the default model to the data. The default model assumes that publication bias is present and only uses statistically significant results to fit the model. Z-curve.2.0 uses a finite mixture model to approximate the observed distribution of z-scores with a limited number of non-centrality parameters. After finding optimal weights for the components, power can be computed as the weighted average of the implied power of the components (Bartos & Schimmack, 2020). Bootstrapping is used to compute 95% confidence intervals that have shown to have good coverage in simulation studies (Bartos & Schimmack, 2020).

The main finding with the default model is that the model (grey curve) fits the observed distribution of z-scores very well in the range of significant results. However, z-curve has problems extrapolating from significant results to the distribution of non-significant results. In this case, the model (grey curve) underestimates the amount of non-significant results. Thus, there is no evidence of publication bias. This is seen in a comparison of the observed and expected discovery rates. The observed discovery rate of 26% is lower than the expected discovery rate of 38%.

When there is no evidence of publication bias, there is no reason to fit the model only to the significant results. Rather, the model can be fitted to the full distribution of all test statistics. The results are shown in Figure 2.

The key finding for this blog post is that the estimated discovery rate of 27% closely matches the observed discovery rate of 26%. Thus, there is no evidence of publication bias. In this case, simply counting the percentage of significant results provides a valid estimate of the discovery rate in clinical trials. Roughly one-quarter of trials end up with a positive result. The new question is how many of these results might be false positives.

To maximize the rate of false positives, we have to assume that true positives were obtained with maximum power (Soric, 1989). In this scenario, we could get as many as 14% (4 over 27) false positive results.

Even if we use the upper limit of the 95% confidence interval, we only get 19% false positives. Moreover, it is clear that Soric’s (1989) scenario overestimate the false discovery rate because it is unlikely that all tests of true hypotheses have 100% power.

In short, an empirical test of Ioannidis’s hypothesis that most published results in science are false shows that this claim is at best a wild overgeneralization. It is not true for clinical trials in medicine. In fact, the real problem is that many clinical trials may be underpowered to detect clinically relevant effects. This can be seen in the estimated replication rate of 61%, which is the mean power of studies with significant results. This estimate of power includes false positives with 5% power. If we assume that 14% of the significant results are false positives, the conditional power based on a true discovery is estimated to be 70% (14 * .05 + 86 * . 70 = .61).

With information about power, we can modify Soric’s worst case scenario and change power from 100% to 70%. This has only a small influence on the false positive discovery rate that decreases to 11% (3 over 27). However, the rate of false negatives increases from 0 to 14% (10 over 74). This also means that there are now three-times more false negatives than false positives (10 over 3).

Even this scenario overestimates power of studies that produced false negative results because power of studies with significant results is higher than power of studies that produced non-significant results when power is heterogenous (Brunner & Schimmack, 2020). In the worst case scenario, the null-hypothesis may rarely be true and power of studies with non-significant results could be as low as 14.5%. To explain, if we redo all of the studies, we expected that 61% of the significant studies produce a significant result again, producing 16.5% significant results. We also expect that the discovery rate will be 27% again. Thus, the remaining 73% of studies have to make up the difference between 27% and 16.5%, which is 10.5%. For 73 studies to produce 10.5 significant results, the studies have to have 14.5% power. 27 = 27 * .61 + 73 * .145.

In short, while Ioannidis predicted that most published results are false positives, it is much more likely that most published results are false negatives. This problem is of course not new. To make conclusions about effectiveness of treatments, medical researchers usually do not rely on a single clinical trial. Rather results of several studies are combined in a meta-analysis. As long as there is no publication bias, meta-analyses of original studies can boost power and reduce the risk of false negative results. It is therefore encouraging that the present results suggest that there is relatively little publication bias in these studies. Additional analyses for subgroups of studies can be conducted, but are beyond the main point of this blog post.

Conclusion

Ioannidis wrote an influential article that used hypothetical scenarios to make the prediction that most published results are false positives. Although this article is often cited as if it contained evidence to support this claim, the article contained no empirical evidence. Surprisingly, there also have been few attempts to test Ioannidis’s claim empirically. Probably the main reason is that nobody knew how to test it. Here I showed a way to test Ioannidis’s claim and I presented clear empirical evidence that contradicts this claim in Ioannidis’s own field of science, namely medicine.

The main feature that distinguishes science and fiction is not that science is always right. Rather, science is superior because proper use of the scientific method allows for science to correct itself, when better data become available. In 2005, Ioannidis had no data and no statistical method to prove his claim. Fifteen years later, we have good data and a scientific method to test his claim. It is time for science to correct itself and to stop making unfounded claims that science is more often wrong than right.

The danger of not trusting science has been on display this year, where millions of Americans ignored good scientific evidence, leading to the unnecessary death of many US Americans. So far, 330, 000 US Americans are estimated to have died of Covid-19. In a similar country like Canada, 14,000 Canadians have died so far. To adjust for population, we can compare the number of deaths per million, which is 1000 in the USA and 400 in Canada. The unscientific approach to the pandemic in the US may explain some of this discrepancy. Along with the development of vaccines, it is clear that science is not always wrong and can save lives. Iannaidis (2005) made unfounded claims that success stories are the exception rather than the norm. At least in medicine, intervention studies show real successes more often than false ones.

The Covid-19 pandemic also provides another example where Ioannidis used off-the-cuff calculations to make big claims without any evidence. In a popular article titled “A fiasco in the making” he speculated that the Covid-19 virus might be less deadly than the flu and suggested that policies to curb the spread of the virus were irrational.

As the evidence accumulated, it became clear that the Covid-19 virus is claiming many more lives than the flu, despite policies that Ioannidis considered to be irrational. Scientific estimates suggest that Covid-19 is 5 to 10 times more deadly than the flu (BNN), not less deadly as Ioannidis implied. Once more, Ioannidis quick, unempirical claims were contradicted by hard evidence. It is not clear how many of his other 1,000 plus articles are equally questionable.

To conclude, Ioannidis should be the last one to be surprised that several of his claims are wrong. Why should he be better than other scientists? The question is only how he deals with this information. However, for science it is not important whether scientists correct themselves. Science corrects itself by replacing old, false information with better information. One question is what science does with false and misleading information that is highly cited.

If YouTube can remove a video with Ioannidis’s false claims about Covid-19 (WP), maybe PLOS Medicine can retract an article with the false claim that “most published results in science are false”.

Washington Post

The attention-grabbing title is simply misleading because nothing in the article supports the claim. Moreover, actual empirical data contradict the claim at least in some domains. Most claims in science are not false and in a world with growing science skepticism spreading false claims about science may be just as deadly as spreading false claims about Covid-19.

If we learned anything from 2020, it is that science and democracy are not perfect, but a lot better than superstition and demagogy.

I wish you all a happier 2021.

Soric’s Maximum False Discovery Rate

Originally published January 31, 2020
Revised December 27, 2020

Psychologists, social scientists, and medical researchers often conduct empirical studies with the goal to demonstrate an effect (e.g., a drug is effective). They do so by rejecting the null-hypothesis that there is no effect, when a test statistic falls into a region of improbable test-statistics, p < .05. This is called null-hypothesis significance testing (NHST).

The utility of NHST has been a topic of debate. One of the oldest criticisms of NHST is that the null-hypothesis is likely to be false most of the time (Lykken, 1968). As a result, demonstrating a significant result adds little information, while failing to do so because studies have low power creates false information and confusion.

This changed in the 2000s, when the opinion emerged that most published significant results are false (Ioannidis, 2005; Simmons, Nelson, & Simonsohn, 2011). In response, there have been some attempts to estimate the actual number of false positive results (Jager & Leek, 2013). However, there has been surprisingly little progress towards this goal.

One problem for empirical tests of the false discovery rate is that the null-hypothesis is an abstraction. Just like it is impossible to say the number of points that make up the letter X, it is impossible to count null-hypotheses because the true population effect size is always unknown (Zhao, 2011, JASA).

An article by Soric (1989, JASA) provides a simple solution to this problem. Although this article was influential in stimulating methods for genome-wide association studies (Benjamin & Hochberg, 1995, over 40,000) citations, the article itself has garnered fewer than 100 citations. Yet, it provides a simple and attractive way to examine how often researchers may be obtaining significant results when the null-hypothesis is true. Rather than trying to estimate the actual false discovery rate, the method estimates the maximum false discovery rate. If a literature has a low maximum false discovery rate, readers can be assured that most significant results are true positives.

The method is simple because researchers do not have to determine whether a specific finding was a true or false positive result. Rather, the maximum false discovery rate can be computed from the actual discovery rate (i.e., the percentage of significant results for all tests).

The logic of Soric’s (1989) approach is illustrated in Tables 1.

NSSIG
TRUE06060
FALSE76040800
760100860
Table 1

To maximize the false discovery rate, we make the simplifying assumption that all tests of true hypotheses (i.e., the null-hypothesis is false) are conducted with 100% power (i.e., all tests of true hypotheses produce a significant result). In Table 1, this leads to 60 significant results for 60 true hypotheses. The percentage of significant results for false hypotheses (i.e., the null-hypothesis is true) is given by the significance criterion, which is set at the typical level of 5%. This means that for every 20 tests, there are 19 non-significant results and one false positive result. In Table 1 this leads to 40 false positive results for 800 tests.

In this example, the discovery rate is (40 + 60)/860 = 11.6%. Out of these 100 discoveries, 60 are true discoveries and 40 are false discoveries. Thus, the false discovery rate is 40/100 = 40%.

Soric’s (1989) insight makes it easy to examine empirically whether a literature tests many false hypotheses, using a simple formula to compute the maximum false discovery rate from the observed discovery rate; that is, the percentage of significant results. All we need to do is count and use simple math to obtain valuable information about the false discovery rate.

However, a major problem with Soric’s approach is that the observed discovery rate in a literature may be misleading because journals are more likely to publish significant results than non-significant results. This is known as publication bias or the file-drawer problem (Rosenthal, 1979). In some sciences, publication bias is a big problem. Sterling (1959; also Sterling et al., 1995) found that the observed discovery rated in psychology is over 90%. Rather than suggesting that psychologists never test false hypotheses, it rather suggests that publication bias is particularly strong in psychology (Fanelli, 2010). Using these inflated discovery rates to estimate the maximum FDR would severely understimate the actual risk of false positive results.

Recently, Bartoš and Schimmack (2020) developed a statistical model that can correct for publication bias and produce a bias-corrected estimate of the discovery rate. This is called the expected discovery rate. A comparison of the observed discovery rate (ODR) and the expected discovery rate (EDR) can be used to assess the presence and extent of publication bias. In addition, the EDR can be used to compute Soric’s maximum false discovery rate when publication bias is present and inflates the ODR.

To demonstrate this approach, I I use test-statistics from the journal Psychonomic Bulletin and Review. The choice of this journal is motivated by prior meta-psychological investigations of results published in this journal. Gronau, Duizer, Bakker, and Wagenmakers (2017) used a Bayesian Mixture Model to estimate that about 40% of results published in this journal are false positive results. Using Soric’s formula in reverse shows that this estimate implies that cognitive psychologists test only 10% true hypotheses (Table 3; 72/172 = 42%). This is close to Dreber, Pfeiffer, Almenber, Isakssona, Wilsone, Chen, Nosek, and Johannesson’s (2015) estimate of only 9% true hypothesis in cognitive psychology.

NSSIG
TRUE0100100
FALSE136872900
13681721000
Table 3

These results are implausible because rather different results are obtained when Soric’s method is applied to the results from the Open Science Collaboration (2015) project that conducted actual replication studies and found that 50% of published significant results could be replicated; that is, produced a significant results again in the replication study. As there was no publication bias in the replication studies, the ODR of 50% can be used to compute the maximum false discovery rate, which is only 5%. This is much lower than the estimate obtained with Gronau et al.’s (2018) mixture model.

I used an R-script to automatically extract test-statistics from articles that were published in Psychonomic Bulletin and Review from 2000 to 2010. I limited the analysis to this period because concerns about replicability and false positives might have changed research practices after 2010. The program extracted 13,571 test statistics.

Figure 1 shows clear evidence of selection bias. The observed discovery rate of 70% is much higher than the estimated discovery rate of 35% and the 95%CI of the EDR, 25% to 53% does not include the ODR. As a result, the ODR produces an inflated estimate of the actual discover rate and cannot be used to compute the maximum false discovery rate.

However, even with a much lower estimated discovery rate of 36%, the maximum false discovery rate is only 10%. Even with the lower bound of the confidence interval for the EDR of 25%, the maximum FDR is only 16%.

Figure 2 shows the results for a replication with test statistics from 2011 to 2019. Although changes in research practices could have produced different results, the results are unchanged. The ODR is 69% vs. 70%; the EDR is 38% vs. 35% and the point estimate of the maximum FDR is 9% vs. 10%. This close replication also implies that research practices in cognitive psychology have not changed over the past decade.

The maximum FDR estimates of 10% confirms the results based on the replication rate in a small set of actual replication studies (OSC, 2015) with a much larger sample of test statistics. The results also show that Gronau et al.’s mixture model produces dramatically inflated estimates of the false discovery rate (see also Brunner & Schimmack, 2019, for a detailed discussion of their flawed model).

In contrast to cognitive psychology, social psychology has seen more replication failures. The OSC project estimated a discovery rate of only 25%. Even this low rate would imply that a maximum of 16% of discoveries in social psychology are false positives. A z-curve analysis of a representative sample of 678 focal tests in social psychology produced an estimated discovery rate of 19% with a 95%CI ranging from 6% to 36% (Schimmack, 2020). The point estimate implies a maximum FDR of 22%, but the lower limit of the confidence interval allows for a maximum FDR of 82%. Thus, social psychology may be a literature where most published results are false. However, the replication crisis in social psychology should not be generalized to other disciplines.

Conclusion

Numerous articles have made claims that false discoveries are rampant (Dreber et al., 2015; Gronau et al., 2015; Ioannidis, 2005; Simmons et al., 2011). However, these articles did not provide empirical data to support their claim. In contrast, empirical studies of the false discovery risk usually show much lower rates of false discoveries (Jager & Leek, 2013), but this finding has been dismissed (Ioannidis, 2014) or ignored (Gronau et al., 2018). Here I used a simpler approach to estimate the maximum false discovery rate and showed that most significant results in cognitive psychology are true discoveries. I hope that this demonstration revives attempts to estimate the science-wise false discovery rate (Jager & Leek, 2013) rather than relying on hypothetical scenarios or models that reflect researchers’ prior beliefs that may not match actual data (Gronau et al., 2018; Ioannidis, 2005).

References

Bartoš, F., & Schimmack, U. (2020, January 10). Z-Curve.2.0: Estimating Replication Rates and Discovery Rates. https://doi.org/10.31234/osf.io/urgtn

Dreber A., Pfeiffer T., Almenberg, J., Isaksson S., Wilson B., Chen Y., Nosek B. A.,  Johannesson, M. (2015). Prediction markets in science. Proceedings of the National Academy of Sciences, 50, 15343-15347. DOI: 10.1073/pnas.1516179112

Fanelli D (2010) Positive” Results Increase Down the Hierarchy of the Sciences. PLOS ONE 5(4): e10068. https://doi.org/10.1371/journal.pone.0010068

Gronau, Q. F., Duizer, M., Bakker, M., & Wagenmakers, E.-J. (2017). Bayesian mixture modeling of significant p values: A meta-analytic method to estimate the degree of contamination from H₀. Journal of Experimental Psychology: General, 146(9), 1223–1233. https://doi.org/10.1037/xge0000324

Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLOS Medicine 2(8): e124. https://doi.org/10.1371/journal.pmed.0020124

Ioannidis JP. (2014). Why “An estimate of the science-wise false discovery rate and application to the top medical literature” is false. Biostatistics, 15(1), 28-36.
DOI: 10.1093/biostatistics/kxt036.

Jager, L. R., & Leek, J. T. (2014). An estimate of the science-wise false discovery rate and application to the top medical literature. Biostatistics, 15(1), 1-12.
DOI: 10.1093/biostatistics/kxt007

Lykken, D. T. (1968). Statistical significance in psychological research. Psychological Bulletin, 70(3, Pt.1), 151–159. https://doi.org/10.1037/h0026141

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), 1–8.

Schimmack, U. (2019). The Bayesian Mixture Model is fundamentally flawed. https://replicationindex.com/2019/04/01/the-bayesian-mixture-model-is-fundamentally-flawed/

Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne, 61(4), 364–376. https://doi.org/10.1037/cap0000246

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant. Psychological Science22(11), 1359–1366. 
https://doi.org/10.1177/0956797611417632

Soric, B. (1989). Statistical “Discoveries” and Effect-Size Estimation. Journal of the American Statistical Association, 84(406), 608-610. doi:10.2307/2289950

Zhao, Y. (2011). Posterior Probability of Discovery and Expected Rate of Discovery for Multiple Hypothesis Testing and High Throughput Assays. Journal of the American Statistical Association, 106, 984-996, DOI: 10.1198/jasa.2011.tm09737