Open Science Reveals Most Published Results are Not False

CORRECTION: Open science also means that our mistakes are open and transparent. Shortly after I posted this blog, Spencer Greenberg pointed out that I made a mistake when I used the discovery rate in OSC to estimate the discovery rate in psychological science. I am glad he caught my mistake quickly and I can warn readers that my conclusions do not hold. A 50% success rate for replications in cognitive psychology suggests that most results in cognitive psychology are not false positives, but the low replication rate of 25% for social psychology does allow for a much higher false discover rate than I estimated in this blog post.

===========================================================================

Money does not make the world go round, it cannot buy love, but it does pretty much everything else. Money is behind most scientific discoveries. Just like investments in stock markets, investments in science are unpredictable. Some of these investments are successful (e.g., Covid-19 vaccines), but others are not.

Most scientists, like myself, rely on government funding that is distributed in a peer-reviewed process by scientists to scientists. It is difficult to see how scientists would fund research that aims to show that most of their work is useless, if not fraudulent. This is where private money comes in.

The Arnold foundation handed out two big grants to reform science (Arnold Foundation Awards $6 Million to Boost Quality of Research; The Center for Open Science receives $7.5 million in additional funding from the Laura and John Arnold Foundation).

One grant was given to Ioannidis, who was famous for declaring that “most published results are false” (Ioannidis, 2005). The other grant was given to Nosek, to establish the Open Science Foundation.

Ioannidis and Nosek also worked together as co-authors (Button et al., 2013). In terms of traditional metrics of impact, the Arnold foundations’ investment paid off. Ioannidis’s (2005) has been cited over 4,000 times. Button et al.’s article has been cited over 2,000 times. And an influential article by Nosek and many others that replicated 100 studies from psychology has been cited over 2,000 times.

These articles are go-to citations for authors to claim that science is in a replication crisis, most published results are false, and major reforms to scientific practices are needed. It is no secret that many authors who cite these articles have not read the actual article. This explains why thousands of citations do not include a single article that points out that the Open Science Collaboration findings contradict Ioannidis’s claim that most published results are false.

The Claim

Ioannidis (2005) used hypothetical examples to speculate that most published results are false. The main assumption underlying these scenarios was that researchers are much more likely to test false hypotheses (a vaccine has no effect) than true hypotheses (a vaccine has an effect). The second assumption was that even when researchers test true hypotheses, they do so with a low probability to provide enough evidence (p < .05) that an effect occurred.

Under these assumptions, most empirical tests of hypotheses produce non-significant results (p > .05) and among those that are significant, the majority come from the large number of tests that tested a false hypothesis (false positives).

In theory, it would be easy to verify Ioannidis’s predictions because he predicts that most results are not significant, p > .05. Thus, a simple count of significant and non-significant results would reveal that many published results are false. The problem is that not all hypotheses tests are published and that significant results are more likely to be published than non-significant results. This bias in the selection of results is known as publication bias. Ioannidis (2005) called it researcher bias. As the amount of researcher bias is unknown, there is ample room to suggest that it is large enough to fit Ioannidis’s prediction that most published significant results are false positives.

The Missing Piece

Fifteen years after Ioannidis claimed that most published results are false, there have been few attempts to test this hypothesis empirically. One attempt was made byJager and Leek (2014). This article made two important contributions. First, Jager and Leek created a program to harvest statistical results from abstracts in medical journals. Second, they developed a model to analyze the harvested p-values to estimate the percentage of false positive results in the medical literature. They ended up with an estimate of 14%, which is well below Ioannidis’s claim that over 50% of published results are false.

Ioannidis’s reply made it clear that a multi-million investment in his idea made it impossible to look at this evidence objectively. Clearly, his speculations based on no data must be right and an actual empirical test must be wrong, if it didn’t confirm his prediction. In science this is known as confirmation bias. Ironically, confirmation bias is one of the main obstacles that prevents science from making progress and to correct false beliefs.

Fortunately, there is a much easier way to test Ioannidis’s claim than Jager and Leek’s model that may have underestimated the false discovery risk. All we need to estimate to estimate the false discovery rate under the worst case scenario is a credible estimate of the discovery rate (i.e., the percentage of significant results). Once we know how many tests produced a positive result, we can compute the maximum false discovery rate using a simple formula developed by Soric (1989).

Maximum False Discovery Rate = (1/Discovery Rate – 1)*(.05/.95)

The only challenge is to find a discovery rate that is not inflated by publication bias. And that is where Nosek and the Open Science Foundation come in.

The Reproducibility Project

It has been known for decades that psychology has a publication bias problem. Sterling (1959) observed that over 90% of published results report a statistically significant result. This finding was replicated in 1995 (Sterling et al., 1995) and again in 2015, when the a large team of psychologists replicated 100 studies and 97% of the original studies reported a statistically significant result (Open Science Collaboration, 2015).

Using Soric’s formula this would imply a false discovery rate of 0. However, the replication studies showed that this high discovery rate is inflated by publication bias. More important, the replication studies provide an unbiased estimate of the actual discovery rate in psychology. Thus, these results can be used to estimate the maximum false discovery rate in psychology, using Soric’s formula.

The headline finding of this article was that 36% (35/97) of the replication studies reproduced a significant result.

Using Soric’s formula, this implies a maximum (!) false discovery rate of 9%, which is well below the predicted 50% by Ioannidis. The difference is so large that no statistical test is needed to infer that the Nosek’s results falsify Ioannidis’s claim.

Table 1 also shows the discovery rates for specific journals or research areas. The discovery rate for cognitive psychology in the journal Psychological Science is 53%, which implies a maximum FDR of 5%. For cognitive psychology published in the Journal of Experimental Psychology: Learning, Memory, and Cognition the DR of 48% implies a maximum FDR of 6%.

Things look worse for social psychology, which has also seen a string of major replication failures (Schimmack, 2020). However, even here we do not get false discovery rates over 50%. For social psychology published in Psychological Science, the discovery rate of 29% implies a maximum false discovery rate of 13%, and social psychology published in JPSP has a discovery rate of 23% and a maximum false discovery rate of 18%.

These results do not imply that everything is going well in social psychology, but they do show how unrealistic Ioannidis’s scenarios were that produced false discovery rates over 50%.

Conclusion

The Arnold foundation has funded major attempts to improve science. This is a laudable goal and I have spent the past 10 years working towards the same goal. Here I simply point out that one big successful initiative, the reproducibility project (Open Science Collaboration, 2015), produced valuable data that can be used to test a fundamental assumption in the open science movement, namely the fear that most published results are false. Using the empirical data from the Open Science Collaboration we find no empirical support for this claim. Rather the results are in line with Jager and Leek’s (2014) findings that strictly false results where the null-hypothesis is true are the exception rather than the norm.

This does not mean that everything is going well in science because rejecting the null-hypothesis is only a first step towards testing a theory. However, it is also not helpful to spread false claims about science that may undermine trust in science. “Most published results are false” is an eye-catching claim, but it lacks empirical support. In fact, it has been falsified in every empirical test that has been conducted. Ironically, the strongest empirical evidence based on actual replication studies comes from a project that used open science practices that would not have happened without Ioannidis’s alarmist claim. This shows the advantages of open science practices and implementing these practices remains a valuable goal even if most published results are not strictly false positives.

1 thought on “Open Science Reveals Most Published Results are Not False”

Ulrich Schimmack says:

December 29, 2020 at 7:50 pm

Spencer Greenberg posted this comment in the Facebook Discussion Group. He had some problems posting it here, so I am doing it on his behalf.

“This is really interesting Uli, thank you for sharing it. It’s also clearly a very important question. I’m a bit confused about these findings though – if you have time it would be great if you could clarify them.
Looking at this formula:

Maximum False Discovery Rate = (1/Discovery Rate – 1)*(.05/.95)

I assume the Discovery Rate is the fraction of all statistical tests performed in the field (on, let’s say, potentially publishable questions of interest) that give a positive result (reject the null).

I’m confused though how this fraction could be estimated from the reproducibility project results.

The post says they found “36% (35/97) of the replication studies reproduced a significant result,” but I don’t get why would this would be expected to match the percent of all statistical tests performed in the field that are positive.
Imagine a hypothetical world where for every 100 papers scientists find 98 negatives, 1 true positive, and 1 false positive, but they only publish the positive ones.

[update: as Uli pointed out to me, this is unrealistic when alpha=0.05 as the cutoff for p-values, I’m going to stick with this example though, since I think the logic still holds for the point I’m trying to make, and it keeps the situation simple]

In that case, the reproducibly project, if conducted perfectly with massive sample sizes would have found a 50% replication rate if I’m not mistaken, but the discovery rate of science would be only 2%, wouldn’t it?

Thinking about this another way: let’s suppose that indeed 36% of studies in psychology are not false positives and are truly replicable (with a perfect conducted replication), and the reproducibility project did perfect replications with huge sample sizes. They would find that 36% replicate, and the false discovery rate would be 64%, yet the formula would imply it’s tiny.

Thanks!

Loading...

Replicability-Index

Improving the replicability of empirical research