A New Look at False Discoveries in the Open Science Collaboration Reproducibility Project

In a groundbreaking article, a team of psychologists replicated 97 published studies with a significant result. The key finding was that only 36% of the 97 significant results could replicated; that is the replication study reproduced a significant result.

One conclusion that can be drawn from this result is that the average success rate in psychology research is around 40%, but journals publish over 90% significant results, which shows that the published record is biased on favor of supporting evidence.

However, the result does not tell us how many of the published results were false positives. In this post, I use the replication studies to estimate the false discovery risk; that is the maximum false discovery rate (Soric, 1989).

Soric demonstrated that the maximum false discovery rate is determined by the discovery rate; that is the percentage of significant results for all statistical tests. The problem is that we typically only see a biased sample of mostly significant results so that the discovery rate is unknown.

Brunner and Schimmack (2018) developed a method, z-curve, that makes it possible to estimate the discovery rate based on the power of the significant results. For example, if a significant result was obtained with 20% power, an average 5 studies are needed to produce a significant result. Thus, the expected value is 5. For false positives, the probabilty of a significant result is alpha, which is typically 5%. So, 20 studies are needed to get one significant result.

Previously I used z-curve for sets of studies published in journals that were selected for significance. Here I use the results of the replication studies from the reproducibility project to estimate the false discovery risk in psychological science; or at least for the three journals that were used for the project (JPSP, JEP-LMC, Psych Science).

The dataset consists of 88 studies. 9 studies were excluded because the replication study was less than ideal (e.g., smaller sample size than original study). Because there is no selection for significance, z-curve used all studies to estimate the weights for different levels of power that could reproduce the observed distribution of z-scores. The first finding is that the proportion of significant results in the reproducbility project, the discovery rate was 38%. This is consistent with the estimated discovery rate based on the power estimates of 40%. This confirms that the published results are an unbiased sample. The other statistics in the figure are less interesting because they focus on the studies that produced a significant result again. For example, the 74% replication rate estimates suggests that the success rate would increase to 74% if only the 35 studies with significant results were replicated again (re-replicated). Soric’s FDR tells us that no more than 9% of the 35 studies with significant results are false discoveries. However, the more interesting question is how many of the 88 studies that were replicated could be false discoveries. This would be an estimate of the false discovery rate in psychology.

Obtaining this estimate is straightforward. We simply can use the weights of the model that do not distinguish between significant and non-significant results. They apply to the whole distribution. This does not change anything about the number of studies that would be needed to produce a significant result. So, we can divide the weights by power and sum them to get the average number of studies that would be required to get 1 significant result for each of the 88 studies. The estimate is 4.18 studies for each significant result, which translates into a discovery rate of 24%. This suggests that experimental psychologists conduct on average 4 studies for every significant result that gets published.

We can then use Soric’s formula and find that a discovery rate of 24% yields a false discovery risk of 17%.

This estimate is somewhat larger than the estimate based on z-curve analysis of the original studies, which was only 10% (see Figure 2).

The reason could be that it is difficult to adjust for the use of questionable research practices. However, it is also possible that problems with some replication studies produced false positives that inflate the FDR estimate based on the replication studies. However, both estimates show that most published results in psychology journals are not false positives.

Although this is good news, it is important to realize that Soric’s FDR focuses on the nil-hypothesis that the population effect size is zero or even in the opposite direction. A bigger concern is that many published results have dramatically inflated effect sizes that may be theoretically or practically irrelevant. Z-curve provides a way to estimate the FDR that treats studies with very low power as false positives. Z-curve is fitted to the data with varying amounts of false positives. If model fit is not much different from the free model, the data provide are consistent with the specified number of false positives. This value is reported in Figure 1 and shows that up to 35% of published results could be false positives if studies with less than 17% power are considered false positives. This estimate changes with the definition of false positives.


In conclusion, this post showed how z-curve can be used to estimate the false discovery risk in psychological science based on a set of unbiased replication studies. As more replication studies are being conducted, z-curve can provide valuable information about the false discovery risk in psychological science.

2 thoughts on “A New Look at False Discoveries in the Open Science Collaboration Reproducibility Project

Leave a Reply