How Often Do Researchers Test True Hypotheses?

Psychologists conduct empirical studies with the goal to demonstrate an effect by means of rejecting the null-hypothesis that there is no effect, p < .05. The utility of doing so depends on the a priori probability that the null-hypothesis is false. One of the oldest criticisms of hull-hypothesis significance testing (NHST) is that the null-hypothesis is virtually always false (Lykken, 1968). As a result, demonstrating a significant result adds little information, while failing to do so because studies have low power creates false information and confusion.

Since 2011, an alternative view is that the many published results are false positives (Simmons, Nelson, & Simonsohn, 2011). The narrative is that researchers use questionable research practices that make it extremely likely that true null-hypotheses produce significant results. This argument assumes that psychologists often conduct studies where the effect size is zero. That is, Bem’s (2011) crazy studies of extrasensory perception are representative of most studies in psychology. This seems a bit unlikely for all areas of psychology, especially correlational studies that examine how naturally occurring phenomena covary.

Debates like this are difficult to settle because the null-hypothesis is an elusive phenomenon. As all empirical studies have some sampling error (and there is some uncertainty about small effect sizes that may be artifacts), even large samples or meta-analyses can never affirm the null-hypothesis. However, it is possible to estimate the minimum percentage of null-hypotheses that are being tested on the basis of the discovery rate; that is the percentage of hypothesis tests that produced a significant result (Soric, 1989). The logic of this approach is illustrated in Tables 1 and 2.

Table 1

Table 1 illustrates an example where only 7% (60/860) hypotheses that are being tested are true (there is an effect). These hypotheses are tested with 100% power. As a result, there are no additional true hypotheses with non-significant results. With alpha = .05, only 1 out of 20 false hypotheses produce a significant result. Thus, there are 19 times more false hypotheses with non-significant results (40 * 19 = 760) than false positives (40). In this example the discovery rate is 100/860 = 11.6%.

The assumption that power is 100% is unrealistic but it makes it possible to estimate the minimum rate of true hypotheses. As power decreases, the rate of true hypotheses increases. This is illustrated by lowering mean power to 20%, while keeping the discovery rate the same.

Table 2

Once more the discovery rate is 100/860 = 11.6%. However, now a lot more of the hypotheses are true hypotheses (380/860 = 44%). As there are many non-significant true hypotheses, there are also a lot fewer non-significant false hypotheses. If we lower power further, the number of true hypotheses increases and reaches 100% in the limit when power approaches alpha. That is, all hypotheses may have very small effect sizes that are different from zero. Thus, not a single hypothesis tests an effect that is exactly zero.

Soric’s (1989) insight makes it possible to examine empirically whether a literature tests many true hypotheses. However, a problem for the application of the approach to actual data is that the true discovery rate is unknown because psychology journals tend to publish nearly exclusively significant results, while non-significant results remain hidden in the proverbial file-drawer (Rosenthal, 1979). Recently, Bartos and Schimmack (2020) developed a statistical model that solves this problem. The model called z-curve makes it possible to estimate the discovery rate on the basis of the distribution of published significant results. The method is called z-curve because published tests statistics are converted into absolute z-scores.

To demonstrate z-curve estimates of the true hypothesis rate (THR), I use test-statistics from the journal Psychonomic Bulletin and Review. The choice of this journal is motivated by prior meta-psychological investigations of results published in this journal. Gronau, Duizer, Bakker, and Wagenmakers (2017) used a Bayesian Mixture Model to estimate that about 40% of results published in this journal are false positive results. Table 3 shows that a 40% False Positive Rate (72/172 = .42) corresponds roughly to an estimate that cognitive psychologists test only 10% true hypotheses. This is a surprisingly low estimates but it matches estimates that psychologists in general test only 9% true hypotheses (Dreber, Pfeiffer, Almenber, Isakssona, Wilsone, Chen, Nosek, & Johannesson, 2015).

Table 3

Given these estimates, it is not surprising that some psychologists attribute replication failures to a high rate of false positive results that are published in psychology journals. The problem, however, is that the BMM is fundamentally flawed and uses a dogmatic prior to produce inflated estimates of the false positive rate (Schimmack & Brunner, 2019).

To examine this issue with z-curve, I used automatically extracted test-statistics that were published in Psychonomic Bulletin and Review from 2000 to 2010. The program extracted 25,248 test statistics, which allows for much more robust estimates than the 855 t-tests used by Gronau et al. Figure 1 shows clear evidence of selection bias, which makes the observed discover rate of 70%, 95%CI = 69% to 70%, unusable. Z-curve suggests that the true discovery rate is only 36%. Despite the large sample size, there is considerable uncertainty about this estimate, 95%CI = 26% to 54%. These results suggest that for every significant result, researchers obtain 1 to 3 non-significant results (File-Drawer Ratio, 1.75; 95%CI = 0.85, 2.85).

The expected discovery rate of 36%, implies a minimum rate of True Hypotheses of 33%. This estimate is considerably higher than the estimates of 10% based on much smaller datasets and questionable statistical models (###, Gronau et al.).

The estimate of the minimum THR assumes that power is 100%. This is an unrealistic assumption. Z-curve is able to provide a more realistic maximum value based on the expected replication rate of 76% and the estimated maximum false positive rate of 9%. With a false positive rate of 9%, power for the remaining 91% of true positives has to be 83% to produce the ER of 76% (9% * .05 + 91% *.83 = 76%). With fewer false positives, power would decrease to a minimum of 76% when there are no false positives. Thus, power cannot be higher than 83%. Using a value of 83% power, yields a true positive rate of 62%. This scenario is illustrated in Table 4.

Table 4

Power is 316/381 = 83%, the False Positive Rate is 31/347 = 9%, and the True Hypothesis Rate is 381/1000 = 38%. This is still a conservative estimate because power for the non-significant results is lower than power for the significant results, but it is more realistic than simply assuming that power is 100%. In this case, the ERR is fairly high and the max THR estimate changes only slightly from 33% to 38%.


Z-curve makes it possible to provide an empirical answer to the question how often researchers test true versus false hypothesis. For the journal Psychonomic Bulletin and Review the estimate suggests that at least a third of the hypotheses are true hypotheses. This is a minimum estimate. This is a more plausible estimate than the estimate that only 10% of hypotheses are true (Gronau et al., 2017). At the same time, there is no evidence that psychologists only test true hypotheses, which was a concern for decades (Lykken, 1968; Cohen, 1994). This claim was based on the idea that effect sizes are never exactly zero. However, effect sizes can be very close to zero and would require extremely large sample sizes to have sufficient power to detect them reliably. Moreover, these effects would be theoretically meaningless. Figure 1 suggests that cognitive psychologists are sometimes testing effects like this. The problem is that this disconfirming evidence is rarely reported which impedes theory development. Thus, psychologists must report all theoretically important results even when they fail to support their predictions. Failing to do so is unscientific.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s