Ioannidis (2005) was wrong: Most published research findings are not false

Fifteen years ago, Ioannidis (2005) sounded the alarm bells about the quality of published research findings. To be clear, I fully agree that research practices are lax and published results often have low credibility; but are they false?

To claim that most published results are false requires a definition of a true or false result. Ioannidis’s definition of a false result is clear and consistent with the logic of null-hypothesis testing. Accordingly, a research finding is false if it leads to the conclusion that there is an effect in the population without an actual effect in the population. This is typically called a false positive (FP) or a type-I error.

Table 1. Truth of Hypothesis by Outcome of Significance Table

	NS	SIG	Sum
TRUE	P(FN)	P(TP)	P(True)
FALSE	P(TN)	P(FP)	P(False)
Sum	P(NS)	P(SIG)	k

Table 1 shows that the proportion of false positive result. P(FP) is the probability that a hypothesis is false and a significant result was obtained. P(FP) is controlled by the significance criterion. With the standard criterion of alpha = .05 and P(False) = 100, P(FP) = 5 and P(TN) = 95. That is, the significance tests leads to the wrong conclusion that an effect exists, when this is the case in 5% of all tests of a false hypothesis.

In the top row, the proportion of true positive results depends on the statistical power of a test. With 100% power and if all hypothesis are true, P(TP) would be 100%.

The quantity of interest is the proportion of false positives, P(FP) for all significant results, P(SIG) = P(FP) + P(TP). I call this quantity the false discovery rate; P(FP) / P(Sig).

This quantity depends on the proportion of true hypothesis, P(True) and false hypotheses, P(False). If all hypotheses that are being tested are false, the false discovery rate is 100%, 5 / (5 + 0) = 1. If all hypotheses that are being tested are true, the false discovery rate is 0; independent of the power of the test; 0 / (P(TP) + 0) = 0.

We can now state Ioannidis’s claim that most published research findings are false as the prediction that the false discovery rate is greater than 50%. This prediction is implied when Ioannidis’s writes “most research findings are false for most research designs and for most fields” because “in the described framework, a PPV exceeding 50% is quite difficult to get” (p. 699), where PPV stands for Positive Predictive Value, which is defined as the proportion of true positives among significant results; PPV = P(TP) / (P(Sig). Thus, Ioannidis clearly claims that most fields have a false discovery rate greater than 50% or a true discovery rate of less than 50%.

False Discovery Rate versus False Discovery Risk

The false discovery rate has also been called the false positive report probability (Wacholder et al., 2004). Table 1 makes it clear that the false discovery rate depends on the proportion of true and false hypotheses that are being tested. It is well known that it is impossible to provide conclusive evidence for the null-hypothesis that there is absolutely no effect (Cohen, 1994; Tukey, 1991). Thus, it is impossible to count the occurrences of true or false effects, and it is impossible to determine the false discovery rate. Thus calculations of the false discovery rate requires assumptions about the proportion of true and false hypotheses (Wacholder et al., 2004; Ioannidis, 2005).

In contrast, the false discovery risk (FDR) is the maximum proportion of significant results that can be false positives for an observed proportion of significant versus significant results. Thus, the false discovery risk can be calculated without assumptions. This statistical fact is demonstrated with a few examples and then derived from a formula that relates FDR to the discovery rate (P(sig) / k.

Table 2 shows a scenario where all tested hypotheses are false. In this case, all of the significant results are false positives. Table 2 also shows that it is easy to identify this scenario because the proportion of significant results, P(SIG)/k, matches the significance criterion.

Table 2. Truth of Hypothesis by Outcome of Significance Table

	NS	SIG	Sum
TRUE	0	0	0
FALSE	95	5	100
Sum	95	5	100

Thus, if the relative frequency of significant results matches the significance criterion (alpha), the false discovery risk is 100%. It is a risk because all hypothesis could be true positives and the studies have extremely low power (Table 3). Even with 100 true hypotheses, the success rate is indistinguishable from alpha.

Table 3. Truth of Hypothesis by Outcome of Significance Table

	NS	SIG	Sum
TRUE	94.09	5.01	100
FALSE	0	0	0
Sum	94.09	5.01	100

More interesting are scenarios when the percentage of significant results exceeds alpha. For this event to occur, at least some of the significant results must have tested a true hypothesis. The greater the number of significant results, the more true hypotheses must have been tested.

To determine the false discovery risk, we assume that all non-significant results stem from tests of false hypotheses and that tests of true hypotheses have 100% power. This scenario maximizes the false discovery rate because this scenario maximizes the number of false positives, given a fixed proportion of significant results.

Table 4. Truth of Hypothesis by Outcome of Significance Table

	NS	SIG	Sum
TRUE	P(FN) = 0	P(True)	P(True)
FALSE	P(TN) = (1-alpha)*P(False)	P(FP) = alpha *P(False)	P(False)
Sum	P(NS) = (1-alpha)*P(False)	P(Sig) = PTrue+(alpha*P(False))	1

For example, if k = 100 and 40% of all hypotheses are true, P(True) = 40 and P(False) = 100-40 = 60. Given the assumption of 100% power, there are 40 true positives and zero false negatives. With 60 false hypotheses there are 60*.05 = 3 false positives and 57 true negatives. With 40 true positives and 3 false positives, the false discovery risk is 3/(40+3) = .07.

Importantly, the relationship is deterministic and it is possible to calculate the false discovery risk (FDR) from the observed discovery rate, P(Sig)/k, as shown in the following derivation based on the cells in Table 4.

P(Sig) = P(True) + alpha * P(False)

as P(True) = 1-P(False) we can rewrite

P(Sig) = P(True) + alpha*(1-P(True))

and solve for P(True)

P(Sig) = P(True) + alpha – alpha*P(True)

P(Sig) = P(True)*(1 – alpha) + alpha

P(Sig) – alpha = P(True)*(1-alpha)

(P(Sig) – alpha)/(1-alpha) = P(True)

We can now substitute P(True) in the formula for the FDR. Simplified the formula reduces to

FDR = (1/P.sig – 1)*(.05/.95), with alpha = .05

Figure 1 plots the FDR as a function of the discovery rate; DR = P(sig)/k

Figure 1 shows that the false discovery risk is above 50% when the discovery rate is below 9.5%. Thus, Ioannidis’s claim that most published results are false implies that there are no more than 9.5 significant results for every 100 attempts.

The novel contribution of shifting from rates to risk is clear when Ioannidis writes that “it is unavoidable that one should make approximate assumptions on how many relationships are expected to be
true among those probed across the relevant research fields and research
designs” (p. 701). This makes calculations of rates dependent on assumptions that are difficult to verify. However, the false discovery risk depends only on the assumption that power cannot exceed 1, which is true by definition. It is therefore possible to assess false discovery risks without speculating about proportions of true hypotheses. This makes it possible to test Ioannidis’s claim empirically by computing the false discovery risk based on observable discovery rates.

Ioannidis Scenario 1

Ioannidis presented a few hypothetical scenarios with a false discovery risk greater than 50%. The first scenario was called “meta-analysis of small inconclusive studies” and assumed 80% power, alpha = .05, 25% true hypothesis vs. 75% false hypotheses and a bias component of .4. The bias component essentially changes alpha and beta from their nominal levels by using questionable research practices (John et al., 2012). This is evident when we fill out the 2 x 2 table for this scenario.

Table 5. Scenario “Meta-analysis of small inconclusive studies”

	NS	SIG	Sum
TRUE	3	22	25
FALSE	43	32	75
Sum	46	54	100

The false positive rate for this scenario is 32/54 = 59%, which is above 50%. Due to bias, there are now 32 false positives for 75 true null-hypothesis, which implies a type-I error rate of 32/75 = 43%, while unbiased research would have produced only 3.75 false positives, 75*.05 = 3.75. To make this a plausible scenario, it is important to understand how researchers can inflate the rate of false positive results from 3.75 to 32.

One way to increase the percentage of false positives is to test the same hypothesis repeatedly and to publish only significant results. This bias is called publication bias.

“What is less well appreciated is that bias and the extent of repeated independent testing by different teams of investigators around the globe may further distort this picture and may lead to even smaller probabilities of the research findings being indeed true” (p. 697)

However, this form of bias assumes that researchers are conducting more studies than Table 5 implies. As 20 tests of a false hypothesis are needed to produce 1 false positive result, the actual number of studies would be much larger than 100. Moreover, repeated testing of true hypotheses does not alter power. The success rate is inflated by running more studies. Table 6 shows the frequencies assuming an actual alpha of 5% and power of 80% to to produce a false discovery rate of 59% (a PPV of 41%).

Table 6. Counting all Tests

	NS	SIG	Sum
TRUE	0.8	3.3	4.1
FALSE	91.1	4.8	95.9
Sum	91.9	8.1	100

The false discovery rate is the same as in Table 5; 3.3/8.1 = 59%. Once all of the attempted studies are included, it is obvious that many more false hypothesis were tested than the 75% rate implies in the scenario because false hypotheses were tested repeatedly to produce a false positive result. As a result the discovery rate is not 54% as implied in Table 5, but only 8.1%. To verify that the numbers in Table 6 are correct, we can see that the real alpha is 4.8/95.9 = 5%, and power is 3.3/4.1 = 80%. In this scenario, the false discovery risk, 59.7, is only slight higher than the false discovery rate, 59.3.

The other scenarios constructed by Ioannidis have higher false discover rates or a lower positive predictive value (PPV = 1- False Discovery Rate). As a result, the observed discovery rates are also lower. In fact, they are close to 5%, which is close to the lower limit set by alpha. Evidently, discovery rates of 5 to 6 percent would raise red flags, if they were observed.

Name	PPV	DR
Underpowered, but well-performed phase I/II RCT	23%	6%
Underpowered, poorly performed phase I/II RCT	17%	6%
Adequately powered exploratory epidemiological study	20%	6%
Underpowered exploratory epidemiological study	12%	5%
Discovery-oriented exploratory research with massive testing	0.001%	5%
As in previous example, but with more limited bias (more standardized)	0.0015%	5%

In conclusion, if bias is modeled as repeated testing of false hypothesis, the discovery rates in Ioannidis’s scenarios are very low. It seems unrealistic that researchers have sufficient resources to conduct 100 studies to produce only 6 significant results.

Other Questionable Research Practices

Ioannadis also hints at fraud as a reason for a high false discovery rate.

Bias can entail manipulation in the analysis or reporting of findings. Selective or distorted reporting is a typical form of such bias.

However, there is little evidence that data manipulation is rampant (John et al., 2012). Ioannidis also suggests that researchers might exploit multiple dependent variables to report a significant result for at least one of them.

True findings may be more common when outcomes are unequivocal and universally agreed (e.g., death) rather than when multifarious outcomes are devised. (p. 698)

However, there is no need to think about entries in the 2 x 2 table as independent studies. Rather, the unit of analysis are statistical tests. Conducting multiple statistical tests in a single study is a more efficient way to produce true and false significant results. Alpha = .05 implies that there will be one false positive result for every 20 statistical tests of a false hypothesis. This is true, whether these tests are conducted in 20 separate studies or within a single study. The same applies to multiple independent variables. Thus, Ioannidis’s scenarios still imply that researchers observe only 5 or 6 significant results for every 100 statistical tests.

It follows that Ioannidis claim that “Most Research Findings Are False
for Most Research Designs and for Most Fields” implies that the long run discovery rate in most fields is at most 10 significant results for every statistical analysis. As discovery rates do not depend on unknown priors this prediction can be tested empirically.

What is the Actual Discovery Rate?

To test Ioannidis’s claim, it is necessary to know the discovery rates of scientific fields. Unfortunately, this information is harder to obtain than one might think because of the pervasive influence of publication bias (Sterling, 1959). For example, psychology journals publish 95% significant results. If we would take this discovery rate at face value, it would imply that the false discovery risk in psychology is 0.3%. The problem is that the reported discovery rate in psychology is not the actual discovery rate because non-significant results are often not reported.

Recently, Brunner and Schimmack (2018) developed a statistical method, z-curve, that makes it possible to estimate the number of unpublished non-significant results based on the statistical evidence of published significant results. The figure below illustrates the method with a representative sample of focal hypothesis tests in social psychology journals (Motyl et al., 2017).

The red vertical line at z = 1.96 represents the criterion for statistical significance with alpha = .05. The histogram of z-scores on the right side of z = 1.96 shows the distribution of observed z-scores with significant results. Z-curve fits a mixture model to the observed distribution. The fitted model is used to project the distribution into the range of non-significant results. The area under the grey curve shows the estimated size of the file-drawer with unpublished non-significant results. The file-drawer ratio shows how many non-significant studies are predicted for every published significant result. The ratio of 3.29:1 studies suggests that the discovery rate is 1/(1+3.29) = 23%, and with a discovery rate of 23%, the false discovery risk is 17.6%. This is considerably below Ioannidis’s prediction that false discovery rates are above 50% in most fields.

Other areas in psychology have higher discovery rates because they conduct studies with more power (Open Science Collaboration, 2015). The next graph shows results for a representative sample of focal tests published in the Journal of Experimental Psychology: Learning, Memory, and Cognition. As predicted, cognitive psychology has a smaller file-drawer of 1.63 non-significant results for every published significant result. This translates into an estimated discovery rate of 1/(1+1.63) = 38% and a false discovery risk of 8.6%.

Conclusion

This article makes several contributions to meta-science. First, it introduces the concept of false discovery risk as the maximum false discovery rate that is consistent with an observed discovery rate (i.e., the percentage of significant results). Second, it shows that the false discovery risk is determined by the discovery rate and can be estimated without making assumptions about the prior odds of hypotheses being true or false. Third, it is shown that Ionnidis’s claims that most published results are false translates into the prediction that discovery rates are below 10%. On the flip side, this means that fields with discovery rates over 10% do not publish more than 50% false positive results. It is also shown that most of Ioannidis’s scenarios for different research fields translate into discovery rates of 5 or 6 percent. This seems an implausible scenario for most fields of research. Finally, I compute false discovery risks for social psychology and cognitive psychology using z-curve (Brunner & Schimmack, 2018), which makes it possible to estimate the percentage of unpublished non-significant results based on published significant results. The estimated false discovery risk in these two research fields are 17.6% and 8.6%, respectively. These estimates contradict Ioannidis’s claim that most research fields publish more than 50% false discoveries. Thus, Ioannidi’s claim that most published results are false is itself false. More important, the article shows how it is possible to estimate false discovery risk of various fields, which makes it possible to evaluate fields in terms of their ability to produce true discoveries.

10 thoughts on “Ioannidis (2005) was wrong: Most published research findings are not false”

Pingback: Dr. Ulrich Schimmack’s Blog about Replicability | Replicability-Index
Pingback: 2019 Blogs | Replicability-Index
Vincent says:

March 27, 2019 at 8:04 am

Interesting article. I like the idea of taking into account the percentage of significant findings in a field to obtain a bound on the false discovery rate.I think however there is a serious error in the following reasoning:

“To determine the false discovery risk, we assume that all non-significant results stem from tests of false hypotheses and that tests of true hypotheses have 100% power. This scenario maximizes the false discovery rate because this scenario maximizes the number of false positives, given a fixed proportion of significant results.”

This is not true.

100% power DECREASES the false discovery rate rather than maximize it!

Suppose you and I test the same two hypotheses one true and one false. You have 100% power and I have 80% power. It just so happens that you do not make any type I error (and by assumption you neither make a type II error). It follows that you get a P(sig) of 50%.

Suppose we cannot communicate our results to each other and so have to compute our own FDR’s. But, in this hyptothetical example it so happens that we both see the same P(sig) of 50%, which is lucky because it makes our results comparable.

Now the problem becomes clear: your FDR, GIVEN a P(sig) of 50% and P(TRUE) of 50% is 0 while mine is a bit more complicated:

Knowing that I have one significant and one non-significant result I know I either made no type I and no type II error (probability 0.8(1 – alpha) conditioning on that one of the hypothesis is true and the other false) and hence no false discovery OR I made BOTH a type I and a type II error (probability 0.2 alpha) and hence a false discovery. So MY false discovery risk, GIVEN a P(sig) of 50% and a P(TRUE) of 50% is 0.8(1 – alpha) + 0.2 alpha = 0.8 – 0.6 alpha, which (given that alpha is at most 1) is considerably greater than your value of 0.

Of course this is an overly mathematical way of stating something that is intuitively obvious: with greater power comes smaller risk of making a false claim. So when assuming 100% power you are UNDERESTIMATING the FDR rather than overestimating it, as cited paragraph claims.

There is an even simpler way to see this. FDR deceases with P(sig). Your formula P(sig) = P(true) + alphaP(false) INCREASES P(sig) w.r.t the true value P(sig) = (1-beta)P(true) + alpha P(false). But since FDR decreases with increasing P(sig) this means that using this formula lowers the FDR.

Loading...

Vincent says:

March 27, 2019 at 10:38 am

Hmmm, I did some more thinking and it seems that my own previous comment is wrong as well, although I cannot yet precisely pin-point how. Maybe I’ll come back to it.

Loading...

Eric Durak says:

December 6, 2019 at 11:32 am

Your mathematical formula is interesting, but I will still side with Dr. Ioannidis because of pervasive money in the US system. Many of these papers are not even written by scientists, but by industry persons – like ALEC in the lawmaking process, pharma has disintegrated the scientific process whereby many drugs (statins) and vaccine reports have completely circumvented the true scientific process. I understand your wanting to make sure the process is as clean as possible, but the issue of fraud isn’t just a mathematical one – it’s a philosophical one. If your job is to maximize shareholder profit, you will make sure every paper published shows great results. Here is where the distrust of not only fellow scientists, but the public at large comes in.

Loading...

RhapsodySea says:

June 23, 2020 at 4:50 pm

The premises of the analysis is not correct.
“With the standard criterion of alpha = .05 and P(False) = 100, P(FP) = 5 and P(TN) = 95.”
This is not how p-value analysis works. In a p-value analysis, the conditional probability p(data | NULL hypothesis is true) < 0.05. But in general, p(data | NULL) != p(NULL | data). This is the fundamental problem.

Only when p(data) = 1, that is the result is reproducible, we can deduce p(data | NULL) = p(NULL | data). But when most of the results are not reproducible, there is a link in the chain of the deduction is broken, so the whole result is then based on a fallacy.

Loading...

1. Dr. R says:
  
  June 24, 2020 at 3:29 pm
  
  It always amazes me how academics can disagree about everything, even math.
  
  Imagine two urns. One is filled with white balls (true hypotheses that are significant) and black balls (true hypotheses that are not significant). Power (1 – beta) determines the proportion of white and black balls. With 50% power, I would have an equal number of black and white balls. If I sample from this urn, I have a 50% chance to get a significant result for a true hypothesis.
  
  The other urn is also filled with white and black balls, but this time white balls are significant results when the hypothesis is false and black balls are non-significant results when the hypothesis is falls. By setting alpha = .05, I am ensuring that this urn is filled with 19 black balls for every white ball (or 1 out of 20 balls is white). If I sample from this urn, I am going to get 5% white balls in the long run.
  
  This is what alpha = .05 means.
  
  This has nothing to do with p/data or data/p. which Bayesians try to apply to single studies. Here we are talking about the long-run frequencies of certain outcomes.
  
  Loading...
  
Josh Ferrara says:

December 24, 2020 at 3:44 am

Get this critique peer reviewed and start the conversation. This is how peer review is suppose to work. Anyone can write in isolation and seem correct.

Loading...

1. Ulrich Schimmack says:
  
  December 24, 2020 at 11:16 am
  
  I will with even better data from medicine. Stay tuned.
  
  Loading...
  
Pingback: Jak argumentují vírologové? - 2. část - Resetheus z.s.