Category Archives: Ioannidis

Ioannidis is Wrong Most of the Time

John P. A. Ioannidis is a rock star in the world of science (wikipedia).

By traditional standards of science, he is one of the most prolific and influential scientists alive. He has published over 1,000 articles that have been cited over 100,000 times.

He is best known for the title of his article “Why most published research findings are false” that has been cited nearly 5,000 times. The irony of this title is that it may also apply to Ioannidis, especially because there is a trade-off between quality and quantity in publishing.

Fact Checking Ioannidis

The title of Ioannidis’s article implies a factual statement: “Most published results ARE false.” However, the actual article does not contain empirical data to support this claim. Rather, Ioannidis presents some hypothetical scenarios that show under what conditions published results MAY BE false.

To produce mostly false findings, a literature has to meet two conditions.

First, it has to test mostly false hypotheses.
Second, it has to test hypotheses in studies with low statistical power, that is a low probability of producing true positive results.

To give a simple example, imagine a field that tests only 10% true hypothesis with just 20% power. As power predicts the percentage of true discoveries, only 2 out of the 10 true hypothesis will be significant. Meanwhile, the alpha criterion of 5% implies that 5% of the false hypotheses will also produce a significant result. Thus, 5 of the 90 false hypotheses will also produce a significant result. As a result, there will be two times more false positives (4.5 over 100) than true positives (2 over 100).

These relatively simple calculations were well known by 2005 (Soric, 1989). Thus, why did Ioannidis article have such a big impact? The answer is that Ioannidis convinced many people that his hypothetical examples are realistic and describe most areas in science.

2020 has shown that Ioannidis’s claim does not apply to all areas of science. In amazing speed, bio-tech companies were able to make not just one but several successful vaccine’s with high effectiveness. Clearly some sciences are making real progress. On the other hand, other areas of science suggest that Ioannidis’s claims were accurate. For example, the whole literature on single-gene variations as predictors of human behavior has produced mostly false claims. Social psychology has a replication crisis where only 25% of published results could be replicated (OSC, 2015).

Aside from this sporadic and anecdotal evidence, it remains unclear how many false results are published in science as a whole. The reason is that it is impossible to quantify the number of false positive results in science. Fortunately, it is not necessary to know the actual rate of false positives to test Ioannidis’s prediction that most published results are false positives. All we need to know is the discovery rate of a field (Soric, 1989). The discovery rate makes it possible to quantify the maximum percentage of false positive discoveries. If the maximum false discovery rate is well below 50%, we can reject Ioannidis’s hypothesis that most published results are false.

The empirical problem is that the observed discovery rate in a field may be inflated by publication bias. It is therefore necessary to estimate the amount of publication bias and if necessary correct the discovery rate, if publication bias is present.

In 2005, Ioannidis and Trikalinos (2005) developed their own test for publication bias, but this test had a number of shortcomings. First, it could be biased in heterogeneous literatures. Second, it required effect sizes to compute power. Third, it only provided information about the presence of publication bias and did not quantify it. Fourth, it did not provide bias-corrected estimates of the true discovery rate.

When the replication crisis became apparent in psychology, I started to develop new bias tests that address these limitations (Bartos & Schimmack, 2020; Brunner & Schimmack, 2020; Schimmack, 2012). The newest tool, called z-curve.2.0 (and yes, there is a app for that), overcomes all of the limitations of Ioannidis’s approach. Most important, it makes it possible to compute a bias-corrected discovery rate that is called the expected discovery rate. The expected discovery rate can be used to examine and quantify publication bias by comparing it to the observed discovery rate. Moreover, the expected discovery rate can be used to compute the maximum false discovery rate.

The Data

The data were compiled by Simon Schwab from the Cochrane database (https://www.cochrane.org/) that covers results from thousands of clinical trials. The data are publicly available (https://osf.io/xjv9g/) under a CC-By Attribution 4.0 International license (“Re-estimating 400,000 treatment effects from intervention studies in the Cochrane Database of Systematic Reviews”; (see also van Zwet, Schwab, & Senn, 2020).

Studies often report results for several outcomes. I selected only results for the primary outcome. It is often suggested that researchers switch outcomes to produce significant results. Thus, primary outcomes are the most likely to show evidence of publication bias, while secondary outcomes might even be biased to show more negative results for the same reason. The choice of primary outcomes also ensures that the test statistics are statistically independent because they are based on independent samples.

Results

I first fitted the default model to the data. The default model assumes that publication bias is present and only uses statistically significant results to fit the model. Z-curve.2.0 uses a finite mixture model to approximate the observed distribution of z-scores with a limited number of non-centrality parameters. After finding optimal weights for the components, power can be computed as the weighted average of the implied power of the components (Bartos & Schimmack, 2020). Bootstrapping is used to compute 95% confidence intervals that have shown to have good coverage in simulation studies (Bartos & Schimmack, 2020).

The main finding with the default model is that the model (grey curve) fits the observed distribution of z-scores very well in the range of significant results. However, z-curve has problems extrapolating from significant results to the distribution of non-significant results. In this case, the model (grey curve) underestimates the amount of non-significant results. Thus, there is no evidence of publication bias. This is seen in a comparison of the observed and expected discovery rates. The observed discovery rate of 26% is lower than the expected discovery rate of 38%.

When there is no evidence of publication bias, there is no reason to fit the model only to the significant results. Rather, the model can be fitted to the full distribution of all test statistics. The results are shown in Figure 2.

The key finding for this blog post is that the estimated discovery rate of 27% closely matches the observed discovery rate of 26%. Thus, there is no evidence of publication bias. In this case, simply counting the percentage of significant results provides a valid estimate of the discovery rate in clinical trials. Roughly one-quarter of trials end up with a positive result. The new question is how many of these results might be false positives.

To maximize the rate of false positives, we have to assume that true positives were obtained with maximum power (Soric, 1989). In this scenario, we could get as many as 14% (4 over 27) false positive results.

Even if we use the upper limit of the 95% confidence interval, we only get 19% false positives. Moreover, it is clear that Soric’s (1989) scenario overestimate the false discovery rate because it is unlikely that all tests of true hypotheses have 100% power.

In short, an empirical test of Ioannidis’s hypothesis that most published results in science are false shows that this claim is at best a wild overgeneralization. It is not true for clinical trials in medicine. In fact, the real problem is that many clinical trials may be underpowered to detect clinically relevant effects. This can be seen in the estimated replication rate of 61%, which is the mean power of studies with significant results. This estimate of power includes false positives with 5% power. If we assume that 14% of the significant results are false positives, the conditional power based on a true discovery is estimated to be 70% (14 * .05 + 86 * . 70 = .61).

With information about power, we can modify Soric’s worst case scenario and change power from 100% to 70%. This has only a small influence on the false positive discovery rate that decreases to 11% (3 over 27). However, the rate of false negatives increases from 0 to 14% (10 over 74). This also means that there are now three-times more false negatives than false positives (10 over 3).

Even this scenario overestimates power of studies that produced false negative results because power of studies with significant results is higher than power of studies that produced non-significant results when power is heterogenous (Brunner & Schimmack, 2020). In the worst case scenario, the null-hypothesis may rarely be true and power of studies with non-significant results could be as low as 14.5%. To explain, if we redo all of the studies, we expected that 61% of the significant studies produce a significant result again, producing 16.5% significant results. We also expect that the discovery rate will be 27% again. Thus, the remaining 73% of studies have to make up the difference between 27% and 16.5%, which is 10.5%. For 73 studies to produce 10.5 significant results, the studies have to have 14.5% power. 27 = 27 * .61 + 73 * .145.

In short, while Ioannidis predicted that most published results are false positives, it is much more likely that most published results are false negatives. This problem is of course not new. To make conclusions about effectiveness of treatments, medical researchers usually do not rely on a single clinical trial. Rather results of several studies are combined in a meta-analysis. As long as there is no publication bias, meta-analyses of original studies can boost power and reduce the risk of false negative results. It is therefore encouraging that the present results suggest that there is relatively little publication bias in these studies. Additional analyses for subgroups of studies can be conducted, but are beyond the main point of this blog post.

Conclusion

Ioannidis wrote an influential article that used hypothetical scenarios to make the prediction that most published results are false positives. Although this article is often cited as if it contained evidence to support this claim, the article contained no empirical evidence. Surprisingly, there also have been few attempts to test Ioannidis’s claim empirically. Probably the main reason is that nobody knew how to test it. Here I showed a way to test Ioannidis’s claim and I presented clear empirical evidence that contradicts this claim in Ioannidis’s own field of science, namely medicine.

The main feature that distinguishes science and fiction is not that science is always right. Rather, science is superior because proper use of the scientific method allows for science to correct itself, when better data become available. In 2005, Ioannidis had no data and no statistical method to prove his claim. Fifteen years later, we have good data and a scientific method to test his claim. It is time for science to correct itself and to stop making unfounded claims that science is more often wrong than right.

The danger of not trusting science has been on display this year, where millions of Americans ignored good scientific evidence, leading to the unnecessary death of many US Americans. So far, 330, 000 US Americans are estimated to have died of Covid-19. In a similar country like Canada, 14,000 Canadians have died so far. To adjust for population, we can compare the number of deaths per million, which is 1000 in the USA and 400 in Canada. The unscientific approach to the pandemic in the US may explain some of this discrepancy. Along with the development of vaccines, it is clear that science is not always wrong and can save lives. Iannaidis (2005) made unfounded claims that success stories are the exception rather than the norm. At least in medicine, intervention studies show real successes more often than false ones.

The Covid-19 pandemic also provides another example where Ioannidis used off-the-cuff calculations to make big claims without any evidence. In a popular article titled “A fiasco in the making” he speculated that the Covid-19 virus might be less deadly than the flu and suggested that policies to curb the spread of the virus were irrational.

As the evidence accumulated, it became clear that the Covid-19 virus is claiming many more lives than the flu, despite policies that Ioannidis considered to be irrational. Scientific estimates suggest that Covid-19 is 5 to 10 times more deadly than the flu (BNN), not less deadly as Ioannidis implied. Once more, Ioannidis quick, unempirical claims were contradicted by hard evidence. It is not clear how many of his other 1,000 plus articles are equally questionable.

To conclude, Ioannidis should be the last one to be surprised that several of his claims are wrong. Why should he be better than other scientists? The question is only how he deals with this information. However, for science it is not important whether scientists correct themselves. Science corrects itself by replacing old, false information with better information. One question is what science does with false and misleading information that is highly cited.

If YouTube can remove a video with Ioannidis’s false claims about Covid-19 (WP), maybe PLOS Medicine can retract an article with the false claim that “most published results in science are false”.

Washington Post

The attention-grabbing title is simply misleading because nothing in the article supports the claim. Moreover, actual empirical data contradict the claim at least in some domains. Most claims in science are not false and in a world with growing science skepticism spreading false claims about science may be just as deadly as spreading false claims about Covid-19.

If we learned anything from 2020, it is that science and democracy are not perfect, but a lot better than superstition and demagogy.

I wish you all a happier 2021.

Soric’s Maximum False Discovery Rate

Originally published January 31, 2020
Revised December 27, 2020

Psychologists, social scientists, and medical researchers often conduct empirical studies with the goal to demonstrate an effect (e.g., a drug is effective). They do so by rejecting the null-hypothesis that there is no effect, when a test statistic falls into a region of improbable test-statistics, p < .05. This is called null-hypothesis significance testing (NHST).

The utility of NHST has been a topic of debate. One of the oldest criticisms of NHST is that the null-hypothesis is likely to be false most of the time (Lykken, 1968). As a result, demonstrating a significant result adds little information, while failing to do so because studies have low power creates false information and confusion.

This changed in the 2000s, when the opinion emerged that most published significant results are false (Ioannidis, 2005; Simmons, Nelson, & Simonsohn, 2011). In response, there have been some attempts to estimate the actual number of false positive results (Jager & Leek, 2013). However, there has been surprisingly little progress towards this goal.

One problem for empirical tests of the false discovery rate is that the null-hypothesis is an abstraction. Just like it is impossible to say the number of points that make up the letter X, it is impossible to count null-hypotheses because the true population effect size is always unknown (Zhao, 2011, JASA).

An article by Soric (1989, JASA) provides a simple solution to this problem. Although this article was influential in stimulating methods for genome-wide association studies (Benjamin & Hochberg, 1995, over 40,000) citations, the article itself has garnered fewer than 100 citations. Yet, it provides a simple and attractive way to examine how often researchers may be obtaining significant results when the null-hypothesis is true. Rather than trying to estimate the actual false discovery rate, the method estimates the maximum false discovery rate. If a literature has a low maximum false discovery rate, readers can be assured that most significant results are true positives.

The method is simple because researchers do not have to determine whether a specific finding was a true or false positive result. Rather, the maximum false discovery rate can be computed from the actual discovery rate (i.e., the percentage of significant results for all tests).

The logic of Soric’s (1989) approach is illustrated in Tables 1.

NSSIG
TRUE06060
FALSE76040800
760100860
Table 1

To maximize the false discovery rate, we make the simplifying assumption that all tests of true hypotheses (i.e., the null-hypothesis is false) are conducted with 100% power (i.e., all tests of true hypotheses produce a significant result). In Table 1, this leads to 60 significant results for 60 true hypotheses. The percentage of significant results for false hypotheses (i.e., the null-hypothesis is true) is given by the significance criterion, which is set at the typical level of 5%. This means that for every 20 tests, there are 19 non-significant results and one false positive result. In Table 1 this leads to 40 false positive results for 800 tests.

In this example, the discovery rate is (40 + 60)/860 = 11.6%. Out of these 100 discoveries, 60 are true discoveries and 40 are false discoveries. Thus, the false discovery rate is 40/100 = 40%.

Soric’s (1989) insight makes it easy to examine empirically whether a literature tests many false hypotheses, using a simple formula to compute the maximum false discovery rate from the observed discovery rate; that is, the percentage of significant results. All we need to do is count and use simple math to obtain valuable information about the false discovery rate.

However, a major problem with Soric’s approach is that the observed discovery rate in a literature may be misleading because journals are more likely to publish significant results than non-significant results. This is known as publication bias or the file-drawer problem (Rosenthal, 1979). In some sciences, publication bias is a big problem. Sterling (1959; also Sterling et al., 1995) found that the observed discovery rated in psychology is over 90%. Rather than suggesting that psychologists never test false hypotheses, it rather suggests that publication bias is particularly strong in psychology (Fanelli, 2010). Using these inflated discovery rates to estimate the maximum FDR would severely understimate the actual risk of false positive results.

Recently, Bartoš and Schimmack (2020) developed a statistical model that can correct for publication bias and produce a bias-corrected estimate of the discovery rate. This is called the expected discovery rate. A comparison of the observed discovery rate (ODR) and the expected discovery rate (EDR) can be used to assess the presence and extent of publication bias. In addition, the EDR can be used to compute Soric’s maximum false discovery rate when publication bias is present and inflates the ODR.

To demonstrate this approach, I I use test-statistics from the journal Psychonomic Bulletin and Review. The choice of this journal is motivated by prior meta-psychological investigations of results published in this journal. Gronau, Duizer, Bakker, and Wagenmakers (2017) used a Bayesian Mixture Model to estimate that about 40% of results published in this journal are false positive results. Using Soric’s formula in reverse shows that this estimate implies that cognitive psychologists test only 10% true hypotheses (Table 3; 72/172 = 42%). This is close to Dreber, Pfeiffer, Almenber, Isakssona, Wilsone, Chen, Nosek, and Johannesson’s (2015) estimate of only 9% true hypothesis in cognitive psychology.

NSSIG
TRUE0100100
FALSE136872900
13681721000
Table 3

These results are implausible because rather different results are obtained when Soric’s method is applied to the results from the Open Science Collaboration (2015) project that conducted actual replication studies and found that 50% of published significant results could be replicated; that is, produced a significant results again in the replication study. As there was no publication bias in the replication studies, the ODR of 50% can be used to compute the maximum false discovery rate, which is only 5%. This is much lower than the estimate obtained with Gronau et al.’s (2018) mixture model.

I used an R-script to automatically extract test-statistics from articles that were published in Psychonomic Bulletin and Review from 2000 to 2010. I limited the analysis to this period because concerns about replicability and false positives might have changed research practices after 2010. The program extracted 13,571 test statistics.

Figure 1 shows clear evidence of selection bias. The observed discovery rate of 70% is much higher than the estimated discovery rate of 35% and the 95%CI of the EDR, 25% to 53% does not include the ODR. As a result, the ODR produces an inflated estimate of the actual discover rate and cannot be used to compute the maximum false discovery rate.

However, even with a much lower estimated discovery rate of 36%, the maximum false discovery rate is only 10%. Even with the lower bound of the confidence interval for the EDR of 25%, the maximum FDR is only 16%.

Figure 2 shows the results for a replication with test statistics from 2011 to 2019. Although changes in research practices could have produced different results, the results are unchanged. The ODR is 69% vs. 70%; the EDR is 38% vs. 35% and the point estimate of the maximum FDR is 9% vs. 10%. This close replication also implies that research practices in cognitive psychology have not changed over the past decade.

The maximum FDR estimates of 10% confirms the results based on the replication rate in a small set of actual replication studies (OSC, 2015) with a much larger sample of test statistics. The results also show that Gronau et al.’s mixture model produces dramatically inflated estimates of the false discovery rate (see also Brunner & Schimmack, 2019, for a detailed discussion of their flawed model).

In contrast to cognitive psychology, social psychology has seen more replication failures. The OSC project estimated a discovery rate of only 25%. Even this low rate would imply that a maximum of 16% of discoveries in social psychology are false positives. A z-curve analysis of a representative sample of 678 focal tests in social psychology produced an estimated discovery rate of 19% with a 95%CI ranging from 6% to 36% (Schimmack, 2020). The point estimate implies a maximum FDR of 22%, but the lower limit of the confidence interval allows for a maximum FDR of 82%. Thus, social psychology may be a literature where most published results are false. However, the replication crisis in social psychology should not be generalized to other disciplines.

Conclusion

Numerous articles have made claims that false discoveries are rampant (Dreber et al., 2015; Gronau et al., 2015; Ioannidis, 2005; Simmons et al., 2011). However, these articles did not provide empirical data to support their claim. In contrast, empirical studies of the false discovery risk usually show much lower rates of false discoveries (Jager & Leek, 2013), but this finding has been dismissed (Ioannidis, 2014) or ignored (Gronau et al., 2018). Here I used a simpler approach to estimate the maximum false discovery rate and showed that most significant results in cognitive psychology are true discoveries. I hope that this demonstration revives attempts to estimate the science-wise false discovery rate (Jager & Leek, 2013) rather than relying on hypothetical scenarios or models that reflect researchers’ prior beliefs that may not match actual data (Gronau et al., 2018; Ioannidis, 2005).

References

Bartoš, F., & Schimmack, U. (2020, January 10). Z-Curve.2.0: Estimating Replication Rates and Discovery Rates. https://doi.org/10.31234/osf.io/urgtn

Dreber A., Pfeiffer T., Almenberg, J., Isaksson S., Wilson B., Chen Y., Nosek B. A.,  Johannesson, M. (2015). Prediction markets in science. Proceedings of the National Academy of Sciences, 50, 15343-15347. DOI: 10.1073/pnas.1516179112

Fanelli D (2010) Positive” Results Increase Down the Hierarchy of the Sciences. PLOS ONE 5(4): e10068. https://doi.org/10.1371/journal.pone.0010068

Gronau, Q. F., Duizer, M., Bakker, M., & Wagenmakers, E.-J. (2017). Bayesian mixture modeling of significant p values: A meta-analytic method to estimate the degree of contamination from H₀. Journal of Experimental Psychology: General, 146(9), 1223–1233. https://doi.org/10.1037/xge0000324

Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLOS Medicine 2(8): e124. https://doi.org/10.1371/journal.pmed.0020124

Ioannidis JP. (2014). Why “An estimate of the science-wise false discovery rate and application to the top medical literature” is false. Biostatistics, 15(1), 28-36.
DOI: 10.1093/biostatistics/kxt036.

Jager, L. R., & Leek, J. T. (2014). An estimate of the science-wise false discovery rate and application to the top medical literature. Biostatistics, 15(1), 1-12.
DOI: 10.1093/biostatistics/kxt007

Lykken, D. T. (1968). Statistical significance in psychological research. Psychological Bulletin, 70(3, Pt.1), 151–159. https://doi.org/10.1037/h0026141

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), 1–8.

Schimmack, U. (2019). The Bayesian Mixture Model is fundamentally flawed. https://replicationindex.com/2019/04/01/the-bayesian-mixture-model-is-fundamentally-flawed/

Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne, 61(4), 364–376. https://doi.org/10.1037/cap0000246

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant. Psychological Science22(11), 1359–1366. 
https://doi.org/10.1177/0956797611417632

Soric, B. (1989). Statistical “Discoveries” and Effect-Size Estimation. Journal of the American Statistical Association, 84(406), 608-610. doi:10.2307/2289950

Zhao, Y. (2011). Posterior Probability of Discovery and Expected Rate of Discovery for Multiple Hypothesis Testing and High Throughput Assays. Journal of the American Statistical Association, 106, 984-996, DOI: 10.1198/jasa.2011.tm09737

Are Most Published Results in Psychology False? An Empirical Study

Why Most Published Research Findings  are False by John P. A. Ioannidis

In 2005, John P. A. Ioannidis wrote an influential article with the title “Why Most Published Research Findings are False.” The article starts with the observation that “there is increasing concern that most current published research findings are false” (e124). Later on, however, the concern becomes a fact. “It can be proven that most claimed research findings are false” (e124). It is not surprising that an article that claims to have proof for such a stunning claim has received a lot of attention (2,199 citations and 399 citations in 2016 alone in Web of Science).

Most citing articles focus on the possibility that many or even more than half of all published results could be false. Few articles cite Ioannidis to make the factual statement that most published results are false, and there appears to be no critical examination of Ioannidis’s simulations that he used to support his claim.

This blog post shows that these simulations make questionable assumptions and shows with empirical data that Ioannidis’s simulations are inconsistent with actual data.

Critical Examination of Ioannidis’s Simulations

First, it is important to define what a false finding is. In many sciences, a finding is published when a statistical test produced a significant result (p < .05). For example, a drug trial may show a significant difference between a drug and a placebo control condition with a p-value of .02. This finding is then interpreted as evidence for the effectiveness of the drug.

How could this published finding be false? The logic of significance testing makes this clear. The only inference that is being made is that the population effect size (i.e., the effect size that could be obtained if the same experiment were repeated with an infinite number of participants) is different from zero and in the same direction as the one observed in the study. Thus, the claim that most significant results are false implies that in more than 50% of all published significant results the null-hypothesis was true. That is, a false positive result was reported.

Ioannidis then introduces the positive predictive value (PPV). The positive predictive value is the proportion of positive results (p < .05) that are true positives.

(1) PPV = TP/(TP + FP)

PTP = True Positive Results, FP = False Positive Results

The proportion of true positive results (TP) depends on the percentage of true hypothesis (PTH) and the probability of producing a significant result when a hypothesis is true. This probability is known as statistical power. Statistical power is typically defined as 1 minus the type-II error (beta).

(2) TP = PTH * Power = PTH * (1 – beta)

The probability of a false positive result depends on the proportion of false hypotheses (PFH) and the criterion for significance (alpha).

(3) FP = PFH * alpha

This means that the actual proportion of true significant results is a function of the ratio of true and false hypotheses (PTH:PFH), power, and alpha.

(4) PPV = (PTH*power) / ((PTH*power) + (PFH * alpha))

Ioannidis translates his claim that most published findings are false into a PPV below 50%. This would mean that the null-hypothesis is true in more than 50% of published results that falsely rejected it.

(5) (PTH*power) / ((PTH*power) + (PFH * alpha))  < .50

Equation (5) can be simplied to the inequality equation

(6) alpha > PTH/PFH * power

We can rearrange formula (6) and substitute PFH with (1-PHT) to determine the maximum proportion of true hypotheses to produce over 50% false positive results.

(7a)  =  alpha = PTH/(1-PTH) * power

(7b) = alpha*(1-PTH) = PTH * power

(7c) = alpha – PTH*alpha = PTH * power

(7d) =  alpha = PTH*alpha + PTH*power

(7e) = alpha = PTH(alpha + power)

(7f) =  alpha/(power + alpha) = PTH

 

Table 1 shows the results.

Power                  PTH / PFH             
90%                       5  / 95
80%                       6  / 94
70%                       7  / 93
60%                       8  / 92
50%                       9  / 91
40%                      11 / 89
30%                       14 / 86
20%                      20 / 80
10%                       33 / 67                     

Even if researchers would conduct studies with only 20% power to discover true positive results, we would only obtain more than 50% false positive results if only 20% of hypothesis were true. This makes it rather implausible that most published results could be false.

To justify his bold claim, Ioannidis introduces the notion of bias. Bias can be introduced due to various questionable research practices that help researchers to report significant results. The main effect of these practices is that the probability of a false positive result to become significant increases.

Simmons et al. (2011) showed that massive use several questionable research practices (p-hacking) can increase the risk of a false positive result from the nominal 5% to 60%. If we assume that bias is rampant and substitute the nominal alpha of 5% with an assumed alpha of 50%, fewer false hypotheses are needed to produce more false than true positives (Table 2).

Power                 PTH/PFH             
90%                     40 / 60
80%                     43 / 57
70%                     46 / 54
60%                     50 / 50
50%                     55 / 45
40%                     60 / 40
30%                     67 / 33
20%                     75 / 25
10%                      86 / 14                    

If we assume that bias inflates the risk of type-I errors from 5% to 60%, it is no longer implausible that most research findings are false. In fact, more than 50% of published results would be false if researchers tested hypothesis with 50% power and 50% of tested hypothesis are false.

However, the calculations in Table 2 ignore the fact that questionable research practices that inflate false positives also decrease the rate of false negatives. For example, a researcher who continues testing until a significant result is obtained, increases the chances of obtaining a significant result no matter whether the hypothesis is true or false.

Ioannidis recognizes this, but he assumes that bias has the same effect for true hypothesis and false hypothesis. This assumption is questionable because it is easier to produce a significant result if an effect exists than if no effect exists. Ioannidis’s assumption implies that bias increases the proportion of false positive results a lot more than the proportion of true positive results.

For example, if power is 50%, only 50% of true hypothesis produce a significant result. However, with a bias factor of .4, another 40% of the false negative results will become significant, adding another .4*.5 = 20% true positive results to the number of true positive results. This gives a total of 70% positive results, which is a 40% increase over the number of positive results that would have been obtained without bias. However, this increase in true positive results pales in comparison to the effect that 40% bias has on the rate of false positives. As there are 95% true negatives, 40% bias produces another .95*.40 = 38% of false positive results. So instead of 5% false positive results, bias increases the percentage of false positive results from 5% to 43%, an increase by 760%. Thus, the effect of bias on the PPV is not equal. A 40% increase of false positives has a much stronger impact on the PPV than a 40% increase of true positives. Ioannidis provides no rational for this bias model.

A bigger concern is that Ioannidis makes sweeping claims about the proportion of false published findings based on untested assumptions about the proportion of null-effects, statistical power, and the amount of bias due to questionable research practices.
For example, he suggests that 4 out of 5 discoveries in adequately powered (80% power) exploratory epidemiological studies are false positives (PPV = .20). To arrive at this estimate, he assumes that only 1 out of 11 hypotheses is true and that for every 1000 studies, bias adds only 1000* .30*.10*.20 = 6 true positives results compared to 1000* .30*.90*.95 = 265 false positive results (i.e., 44:1 ratio). The assumed bias turns a PPV of 62% without bias into a PPV of 20% with bias. These untested assumptions are used to support the claim that “simulations show that for most study designs and settings, it is more likely for a research claim to be false than true.” (e124).

Many of these assumptions can be challenged. For example, statisticians have pointed out that the null-hypothesis is unlikely to be true in most studies (Cohen, 1994). This does not mean that all published results are true, but Ioannidis’ claims rest on the opposite assumption that most hypothesis are a priori false. This makes little sense when the a priori hypothesis is specified as a null-effect and even a small effect size is sufficient for a hypothesis to be correct.

Ioannidis also ignores attempts to estimate the typical power of studies (Cohen, 1962). At least in psychology, the typical power is estimated to be around 50%. As shown in Table 2, even massive bias would still produce more true than false positive results, if the null-hypothesis is false in no more than 50% of all statistical tests.

In conclusion, Ioannidis’s claim that most published results are false depends heavily on untested assumptions and cannot be considered a factual assessment of the actual number of false results in published journals.

Testing Ioannidis’s Simulations

10 years after the publication of “Why Most Published Research Findings Are False,”  it is possible to put Ioannidis’s simulations to an empirical test. Powergraphs (Schimmack, 2015) can be used to estimate the average replicability of published test results. For this purpose, each test statistic is converted into a z-value. A powergraph is foremost a histogram of z-values. The distribution of z-values provides information about the average statistical power of published results because studies with higher power produce higher z-values.

Figure 1 illustrates the distribution of z-values that is expected for Ioanndis’s model for “adequately powered exploratory epidemiological study” (Simulation 6 in Figure 4). Ioannidis assumes that for every true positive, there are 10 false positives (R = 1:10). He also assumed that studies have 80% power to detect a true positive. In addition, he assumed 30% bias.

ioannidis-fig6

A 30% bias implies that for every 100 false hypotheses, there would be 33 (100*[.30*.95+.05]) rather than 5 false positive results (.95*.30+.05)/.95). The effect on false negatives is much smaller (100*[.30*.20 + .80]). Bias was modeled by increasing the number of attempts to produce a significant result so that proportion of true and false hypothesis matched the predicted proportions. Given an assumed 1:10 ratio of true to false hypothesis, the ratio is 335 false hypotheses to 86 true hypotheses. The simulation assumed that researchers tested 100,000 false hypotheses and observed 35000 false positive results and that they tested 10,000 true hypotheses and observed 8,600 true positive results. Bias was simulated by increasing the number of tests to produce the predicted ratio of true and false positive results.

Figure 1 only shows significant results because only significant results would be reported as positive results. Figure 1 shows that a high proportion of z-values are in the range between 1.95 (p = .05) and 3 (p = .001). Powergraphs use z-curve (Schimmack & Brunner, 2016) to estimate the probability that an exact replication study would replicate a significant result. In this simulation, this probability is a mixture of false positives and studies with 80% power. The true average probability is 20%. The z-curve estimate is 21%. Z-curve can also estimate the replicability for other sets of studies. The figure on the right shows replicability for studies that produced an observed z-score greater than 3 (p < .001). The estimate shows an average replicability of 59%. Thus, researchers can increase the chance of replicating published findings by adjusting the criterion value and ignoring significant results with p-values greater than p = .001, even if they were reported as significant with p < .05.

Figure 2 shows the distribution of z-values for Ioannidis’s example of a research program that produces more true than false positives, PPV = .85 (Simulation 1 in Table 4).

ioannidis-fig1

Visual inspection of Figure 1 and Figure 2 is sufficient to show that a robust research program produces a dramatically different distribution of z-values. The distribution of z-values in Figure 2 and a replicability estimate of 67% are impossible if most of the published significant results were false.  The maximum value that could be obtained is obtained with a PPV of 50% and 100% power for the true positive results, which yields a replicability estimate of .05*.50 + 1*.50 = 55%. As power is much lower than 100%, the real maximum value is below 50%.

The powergraph on the right shows the replicability estimate for tests that produced a z-value greater than 3 (p < .001). As only a small proportion of false positives are included in this set, z-curve correctly estimates the average power of these studies as 80%. These examples demonstrate that it is possible to test Ioannidis’s claim that most published (significant) results are false empirically. The distribution of test results provides relevant information about the proportion of false positives and power. If actual data are more similar to the distribution in Figure 1, it is possible that most published results are false positives, although it is impossible to distinguish false positives from false negatives with extremely low power. In contrast, if data look more like those in Figure 2, the evidence would contradict Ioannidis’s bold and unsupported claim that most published results are false.

The maximum replicabiltiy that could be obtained with 50% false-positives would require that the true positive studies have 100% power. In this case, replicability would be .50*.05 + .50*1 = 52.5%.  However, 100% power is unrealistic. Figure 3 shows the distribution for a scenario with 90% power and 100% bias and an equal percentage of true and false hypotheses. The true replicabilty for this scenario is .05*.50 + .90 * .50 = 47.5%. z-curve slightly overestimates replicabilty and produced an estimate of 51%.  Even 90% power is unlikely in a real set of data. Thus, replicability estimates above 50% are inconsistent with Ioannidis’s hypothesis that most published positive results are false.  Moreover, the distribution of z-values greater than 3 is also informative. If positive results are a mixture of many false positive results and true positive results with high power, the replicabilty estimate for z-values greater than 3 should be high. In contrast, if this estimate is not much higher than the estimate for all z-values, it suggest that there is a high proportion of studies that produced true positive results with low power.

ioannidis-fig3

Empirical Evidence

I have produced powergraphs and replicability estimates for over 100 psychology journals (2015 Replicabilty Rankings). Not a single journal produced a replicability estimate below 50%. Below are a few selected examples.

The Journal of Experimental Psychology: Learning, Memory and Cognition publishes results from cognitive psychology. In 2015, a replication project (OSC, 2015) demonstrated that 50% of significant results produced a significant result in a replication study. It is unlikely that all non-significant results were false positives. Thus, the results show that Ioannidis’s claim that most published results are false does not apply to results published in this journal.

Powergraphs for JEP-LMC3.g

The powergraphs further support this conclusion. The graphs look a lot more like Figure 2 than Figure 1 and the replicability estimate is even higher than the one expected from Ioannidis’s simulation with a PPV of 85%.

Another journal that was subjected to replication attempts was Psychological Science. The success rate for Psychological Science was below 50%. However, it is important to keep in mind that a non-significant result in a replication study does not prove that the original result was a false positive. Thus, the PPV could still be greater than 50%.

Powergraphs for PsySci3.g

The powergraph for Psychological Science shows more z-values in the range between 2 and 3 (p > .001). Nevertheless, the replicability estimate is comparable to the one in Figure 2 which simulated a high PPV of 85%. Closer inspection of the results published in this journal would be required to determine whether a PPV below .50 is plausible.

The third journal that was subjected to a replication attempt was the Journal of Personality and Social Psychology. The journal has three sections, but I focus on the Attitude and Social Cognition section because many replication studies were from this section. The success rate of replication studies was only 25%. However, there is controversy about the reason for this high number of failed replications and once more it is not clear what percentage of failed replications were due to false positive results in the original studies.

Powergraphs for JPSP-ASC3.g

One problem with the journal rankings is that they are based on automated extraction of all test results. Ioannidis might argue that his claim focused only on test results that tested an original, novel, or an important finding, whereas articles also often report significance tests for other effects. For example, an intervention study may show a strong decrease in depression, when only the interaction with treatment is theoretically relevant.

I am currently working on powergraphs that are limited to theoretically important statistical tests. These results may show lower replicability estimates. Thus, it remains to be seen how consistent Ioannidis’s predictions are for tests of novel and original hypotheses. Powergraphs provide a valuable tool to address this important question.

Moreover, powergraphs can be used to examine whether science is improving. So far, powergraphs of psychology journals have shown no systematic improvement in response to concerns about high false positive rates in published journals. The powergraphs for 2016 will be published soon. Stay tuned.

 

Power Failure in Neuroscience

Original: December 5, 2014
Revised: December 28, 2020

Power Failure in Neuroscience

An article in Nature Review” Neuroscience suggested that the median power in neuroscience studies is just 21% (Katherine S. Button, John P. A. Ioannidis, Claire Mokrysz, Brian A.Nosek, Jonathan Flint, Emma S.J. Robinson and Marcus R. Munafò, 2013).

The authors of this article examined meta-analyses of primary studies in neuroscience that were published in 2011. They analyzed 49 meta-analyses that were based on a total of 730 original studies (on average, 15 studies per meta-analysis, range 2 to 57).

For each primary study, the authors computed observed power based on the sample size and the estimated effect size in the meta-analysis.

Based on their analyses, the authors concluded that the median power in neuroscience is 21%.

There is a major problem with this estimate that the authors overlooked. The power estimate is incredibly low because a median power estimate of 21% corresponds to a p-value of p = .25. If median power were 21%, it would mean that over 50% of the original studies in the meta-analysis reported a non-significant result (p > .05). This seems rather unlikely because journals tend to publish mostly significant results.

The estimate is even less plausible because it is based on meta-analytic averages without any correction for bias. These effect sizes are likely to be inflated, which means that median power estimate is inflated. Thus, true power is even less than 21% and even more results are non-significant.

What could explain this implausible result?

  1. A meta-analysis includes published and unpublished studies. It is possible that the published studies reported significant results with observed power greater than 50% (p < .05) and the unpublished studies reported non-significant results with power less than 50%. However, this would imply that meta-analysts were able to retrieve as many unpublished studies as published studies. The authors did not report whether power of published and unpublished studies differed.
  2. A second possibility is that the power analyses produced false results. The authors relied on Ioannidis and Trikalinos’s (2007) approach to the estimation of power. This approach assumes that studies in a meta-analysis have the same true effect size and that the meta-analytic average (weighted mean) provides the best estimate of the true effect size. This estimate of the true effect size is then used to estimate power in individual studies based on the sample size of the study. As already noted by Ioannidis and Trikalinos (2007), this approach can produce biased results when effect sizes in a meta-analysis are heterogeneous.
  3. Estimating power simply on the basis of effect size and sample size can be misleading when the design is not a simple comparison of two groups. Between-subject designs are common in animal studies in neuroscience. However, many fMRI studies use within-subject designs that achieve high statistical power with a few participants because participants serve as their own controls.

Schimmack (2012) proposed an alternative procedure that does not have this limitation. Power is estimated individually for each study based on the observed effect size in this study. This approach makes it possible to estimate median power for heterogeneous sets of studies with different effect sizes. Moreover, this approach makes it possible to compute power when power is not simply a function of sample size and effect size (e.g., within-subject designs).

R-Index of Nature Neuroscience: Analysis

To examine the replicability of research published in nature and neuroscience, I retrieved the most cited articles in this journal until I had a sample of 20 studies. I needed 14 articles to meet this goal. The number of studies in these articles ranged from 1 to 7.

The success rate for focal significance tests was 97%. This implies that the vast majority of significance tests reported a significant result. The median observed power was 84%. The inflation rate is 13% (97% – 84% = 13%). The R-Index is 71%. Based on these numbers, the R-Index predicts that the majority of studies in nature neuroscience would replicate in an exact replication study.

This conclusion differs dramatically from Button et al.’s (2013) conclusion. I therefore examined some of the articles that were used for Button et al.’s analyses.

A study by Davidson et al. (2003) examined treatment effects in 12 depressed patients and compared them to 5 healthy controls. The main findings in this article were three significant interactions between time of treatment and group with z-scores of 3.84, 4.60, and 4.08. Observed power for these values with p = .05 is over 95%. If a more conservative significance level of p = .001 is used, power is still over 70%. However, the meta-analysis focused on the correlation between brain activity at baseline and changes in depression over time. This correlation is shown with a scatterplot without reporting the actual correlation or testing it for significance. The text further states that a similar correlation was observed for an alternative depression measure with r = .46 and noting correctly that this correlation is not significant, t(10) = 1.64, p = .13, d = .95, obs. power = 32%. The meta-analysis found a mean effect size of .92. A power analysis with d = .92 and N = 12 yields a power estimate of 30%. Presumably, this is the value that Button et al. used to estimate power for the Davidson et al. (2003) article. However, the meta-analysis did not include the more powerful analyses that compared patients and controls over time.

Conclusion

In the current replication crisis, there is a lot of confusion about the replicability of published findings. Button et al. (2013) aimed to provide some objective information about the replicability of neuroscience research. They concluded that replicability is very low with a median estimate of 21%. In this post, I point out some problems with their statistical approach and the focus on meta-analyses as a way to make inferences about replicability of published studies. My own analysis shows a relatively high R-Index of 71%. To make sense of this index it is instructive to compare it to the following R-Indices.

In a replication project of psychological studies, I found an R-Index of 43% and 28% of studies were successfully replicated.

In the many-labs replication project, 10 out of 12 studies were successfully replicated, a replication rate of 83% and the R-Index was 72%.

Caveat

Neuroscience studies may have high observed power and still not replicate very well in exact replications. The reason is that measuring brain activity is difficult and requires many steps to convert and reduce observed data into measures of brain activity in specific regions. Actual replication studies are needed to examine the replicability of published results.