A Comparison of False Discovery Rates and Type-S Error Rates

Applied researchers who have substantive research questions (e.g., is there a wage gap for men and women) rely on statistical methods that were invented over 100 years. This statistical approach is known as Null-Hypothesis-Hypothesis-Testing (NHST) and every year thousands of new students in the social sciences are introduced to NHST to understand published articles and to conduct their own research. Thus, NHST is as fundamental to quantitative research as microscopes are for biologists and telescopes are for astronomers. However, unlike builders of microscopes and telescopes, many statisticians believe that NHST is flawed and needs to be replaced, ideally with their own statistical approach. After all, statisticians are human and are influenced by the same incentives that motivated other scientists. The holy grail is to replace NHST and to become the father (most of these driven statisticians are male) of the new statistics.

The religious fervor is especially notable among statisticians who call themselves Bayesians. Like atheists who only have the belief that there is no God in common, hate of NST is the only common element of Bayesians, which explains why they have failed to provide a coherent alternative to NHST. A prominent example of the religious zeal of Bayesians is Andrew Gelman’s blog. For example, a recent blog post was titled “Bayesians moving from defense to offense.”

Criticism of NHST is as old as NHST itself and is often cited. However, there have also been articles in support of NHST that are less well known and often neglected, especially by Bayesian critics of NHST. One article that made a lot of sense to me was Tukey’s (1991) defense of NHST that I featured on my blog (Schimmack, 2019).

The article makes it clear that there is a fundamental misunderstanding of the hypotheses that are typically tested in NHST. Take our example of the wage gap as an example. NHST would test this question by postulating the null-hypothesis, which is typically the assumption that there is no wage gap. This hypothesis is typically a strawman and nobody beliefs it to be true. The only reason to postulate the null-hypothesis that there is no wage gap is to use empirical data to falsify or reject it, when the data provide sufficient evidence that the hypothesis is false. This happens when the observed difference in wages is about twice the size of the sampling error, where sampling error reflects the amount of variability in the wage differences that are obtained in a random sample. This estimate of the real wage gap in the population will vary from sample to sample. However, sampling error alone is unlikely to produce differences that are twice the size of the sampling error. In fact, we can state that any wage gap that is twice the sampling error or more will occur only 5% of all attempts if there is no wage gap in the population. And when this happens, p-values below .05 allow researchers to reject the null-hypothesis that the wage gap is zero and conclude that there is a wage gap.

This version of NHST can be easily criticized. For starters, what is the point of collecting data just to reject a hypothesis that nobody believed anyways. The best defense of this practice would be that we still need empirical data to be sure. After all, there is a 1/1000000 probability that the null-hypothesis might be true. Fair enough, but that only leads to the next problem. We really do not care about the conclusion that there is a wage gap because the immediate next question is the direction of the difference. Do men earn more than women or do women earn more than men? Strictly speaking, the rejection of the null-hypothesis that wages are exactly the same does not allow us to make claims about the direction of the effect. Does that mean we need to do another study to test this hypothesis and if so how would we analyze the data to do that if NHST doesn’t allow us to draw inferences about the direction of an effect?

To use NHST to draw inferences about the direction of an effect, we need to test a directional hypothesis. In our example, we might want to specify H0 as the presumably false hypothesis that women earn more or an equal amount than men and use the statistical significance criterion to reject this hypotheses when our sample shows that men earn more than women and the difference is significant, p < .05. When we test this directional hypothesis, we actually only need an effect size that is 1.65 times larger than sampling error to reject the null-hypothesis that women earn more or an equal amount. The use of NHST with directional hypotheses is known as one-tailed or one-sided testing.

With one-sided tests, p < .05 quantifies the risk of drawing a false conclusion about the sign of an effect. That is, the results of a study might show that men earn more than women, the p-value is below .05, but in the population there is no wage gap (zero difference) or women actually earn more than men. NHST does not differentiate between an outcome where the difference is exactly zero (equal pay) or the difference is in the opposite direction (women earn more than men). Both outcomes are errors because the study suggested women earn more when this is not the case.

The risk of drawing a false conclusion in NHST is called the Type 1 error. The statistical test produced the wrong results because sampling error produced a very unlikely outcome (e.g., the sample included many unemployed men). The 5% criterion is a conventional criterion to ensure that no more than 5% of statistical tests show significance when the null-hypothesis is true. This would be the same as using rapid Covid tests that have a 5% probability to show that you are positive when you are actually negative. Statisticians have also debated the 5% criterion, but NHST allows researchers to use other criterion values. Thus, for the discussion of NHST and criticisms of NHST, it is not important what Type-1 error we find acceptable and I will continue to use the typical 5% value.

Now you may wonder what you can do when you are not sure about the sign of an effect (e.g., are men or women more extraverted)? It would be silly to make up a directional hypothesis, just to falsify it, and then conclude that the opposite is true. Moreover, you might find a significant result in the correct direction if you specified the null-hypothesis correctly, but you would not find significance if you picked the wrong hypothesis. That makes the procedure rather arbitrary and useless. Fortunately, there is a solution to this problem. You just do two one-sided tests. First, you try to reject the hypothesis that “men are as extraverted or more extraverted than women” and then you try to reject the “hypothesis that “women are as extraverted or more extraverted than men.” If you obtain significance for one of these tests you can infer that the alternative hypothesis is true. That is, if your sample shows that men are more extraverted than women and p < .05 you are allowed to infer that men are more extraverted than women in the population. If your sample shows that women are more extraverted than men and p < .05, you are allowed to infer that women are more extraverted than men in the population. So, you can generalize the sign of the effect in your sample to the population.

However, there is a catch. You tested two hypotheses and the Type-I error risk increases each time you test a new hypothesis. A 5% error rate implies that every 20 tests of a false hypothesis will produce a Type-I error in the long run. So, if you conduct two one-sided tests with the traditional 5% risk of a Type-1 error, your risk is actually 10%. Fortunately, there is a simple solution to this. You can lower the Type-I error risk of the directional tests. If you cut the risk in half, you have a 2.5% risk to make a Type-1 error in one direction and a 2.5% risk of a type-1 error in the opposite direction and your combined risk of making an error in either direction is 5%. 

Maybe you already realized it, but conducting two one-sided tests with 2.5% Type-I error rates is identical to conducting a two-sided tests with 5% probability. That is the essence of Tukey’s defense of NHST. While it looks as if we are testing an implausible null-hypothesis that there is no wage gap, we are really testing whether men earn more than women or women earn more than men and are allowed to infer from a significant result, p < .05 and higher pay for men than for women in a sample that women earn less than men in the population under investigation. Rather than refuting a silly nil-hypothesis (Cohen, 1994), NHST is a statistical tool to draw inferences about the direction of an effect in a sample about the direction of an effect in the population.

Unfortunately, Tukey’s (1991) insight that a two-tailed test is really a convenient way to conduct two one-sided tests to test for significance in both directions is often ignored by critics of NHST. Gelman introduced sign errors as an alternative to dumb NHST under the assumption that NHST is only used to reject the hypothesis that an effect size is zero, but is never used to test the direction of an effect. This is a misrepresentation of the way NHST is used, especially in clinical trials. Nobody would argue that treatment is beneficial if the p-value is below .05, but the results show more benefits in the placebo condition. The problem with Gelman’s criticism of NHST is that it is a criticism of dumb NHST and not NHST as it is used in practice. In reality researchers follow Tukey’s (1991) logic and use NHST to test two directional hypotheses simultaneously. Thus, Type-1 errors include sign errors. Type-1 errors occur when the sign of a significant result is different from the population effect size, including population effect sizes of zero.

The problem with Gelman’s Type-S error is that Gelman ignores the possibility that the population effect size can be zero. Gelman’s Type-S error is not defined when the population effect size is zero (personal communication, Gelman, 2024). The omission of the classic Type-1 error (rejecting the point-null or nil-hypothesis) is difficult to justify. The main justification for this assumption is that most population effect sizes are unlikely to be exactly zero. For example, the wage gap between men and woman is unlikely to be less than 1/100000000 dollars. However, what about time reversed causality and extrasensory perception (Bem, 2011)? Gelman clearly does not believe in ESP, even with effect sizes of 0.00000001 standard deviations. He is also often critical of other findings like ovulation effects on preferences for Obama. He might argue that these effect sizes are much smaller than inflated estimates in small samples, but not zero. However, why couldn’t some effects be zero or so close to zero that we don’t care about the effect size?

Assuming that there are no zero effects can easily explain why vanZwet et al. (2023) ended up with an estimate that only 2% of significant results in clinical trials are false, whereas Schimmack & Bartos (2023) estimated that up to 14% of significant results could be false. The difference might be due to the fact that estimates of the false discovery rate include significant results with an effect size of zero as errors, whereas the type-S error assumes that these errors do not exist.

Unfortunately, we can not quantify the amount of population effect sizes that are zero because sampling error will always result in imprecise estimates of the population effect size. A solution to this problem is rounding. At some point, we simply do not care about a difference from 0. For example, a wage gap of $0.00001 dollars is practically the same as a wage gap of 0 dollars. Even a difference of $0.01 (1 cent) is meaningless. Rounding has obvious implications for the estimation of Type-S and FDR rates. Let’s say a set of 1000 clinical trials with effect sizes close to zero (some would argue homeopathy would fit the bill) have effect sizes that are close to zero, less than 1/1000th of a standard deviation, but they are not exactly zero). Rounding would turn these miniscule effects into zero-effects and findings in either direction would be considered false positives. However, Type-S errors are cut in half because significant results with the same sign are not considered errors even if the effect size is a fraction of a cent.

In conclusion, the interpretation of NHST as two one-sided tests with alpha/2 implies that sign errors are less likely than Type-1 errors and that the percentage of sign errors among significant results (Type-S error rate) is always lower than the false discovery rate (i.e., the percentage of false positives among significant results).

The real question is whether this difference is large enough to explain the difference between vanZwet et al.’s (2023) results and other studies that estimated the FDR (Jager & Leek, 2014; Schimmack, 2023). I had to resort to simulation studies to find the answer and I am happy to share the results of this simulation (r-code on OSF).

The simulation is based on Ioannidis’s scenario for underpowered, but well performed Clinical Trials with 1 true hypothesis for every 5 false hypothesis and low power (20%). To simply the simulation, the simulation did not include bias. The false hypotheses were simulated with a small standard deviation of population effect sizes (SD = .01). This ensures that none of the false hypotheses are strictly zero, but effect sizes are close to zero (d = -.05 to .05).

Figure 1 shows the distribution of effect sizes for a between-subject design with N = 100 (50 per group).

This simulation produces equal Type-S and FDR estimates of 24%. The reason that these estimates are the same that the condition “<” “<=” are the same when no population effect sizes are zero.

Figure 2 shows the same simulation, but effect sizes are rounded to one decimal. As a result, the small effect sizes around 0 all become zero.

In this scenario, the FDR is 53% and the Type-S error rate is only 0.4%. 

In sum, the criticism of dumb NHST is that there are no sign errors because it only tests whether the effect size is zero, which it rarely is. Second, the smart version of NHST (Tukey, 1991) tests directional hypothesis and treats sign errors and rejection of the point-null hypotheses as errors. Third, Gelman’s Type-S error counts only sign errors and underestimates error rates when many effect sizes are practically zero. Thus, it is a mistake to exclude zero effect sizes from the computation of false discovery rates. Type-S error rates underestimate error rates when some studies have an effect size that is practically zero.

The second simulation used Zwet et al.’s (2023) parameters of their model to compute the FDR and to compare it to their estimate of the Type-S error of 2% The model assumes that the distribution of effect sizes in the Cochrane data is a mixture of four normal distributions with mean 0 and standard deviations 0.61, 1.42, 2.16, 5.64. The first three components are weighted about equally, .32, .31, and .30, and the last component is weighted less, .07.

Figure 3 shows the distribution of the implied effect sizes for N = 100.

Although the model fixes the mode of effect sizes at zero and effect sizes close to zero are the most frequent effect sizes, the probability of an exact value of zero is zero. Thus, the model assumes that there can only be sign errors. The simulation reproduced vanZwet et al.’s estimate that the Type-S Error Rate is 2%. As there are no exact zero values, the FDR is also 2%.

I then repeated the simulation with effect sizes rounded to one decimal. The distribution of effect sizes is not notably different (Figure 4).

However, 17% of the effect sizes are now zero. This increases the FDR to 4%, while the Type-S error decreases to 0.8 percent because zero effect sizes do not produce sign errors in Gelman’s formula. In short, these results show that effect sizes close to zero can produce a difference in calculations of Type-S and FDR, but this difference is relatively small and does not explain the differences between a Type-S error rate of 2% and an FDR of 14%.

The real reason for the different results are the different model assumptions. The same density distribution can be fitted with different models that make different assumptions about the underlying components. The key difference between vanZwet et al.’s (2023) model and other models is that vanZwet et al.’s model assumes that there is not a large proportion of effect sizes close to zero. In contrast, other models allow for a large proportion of effect sizes to be zero. Figure 5 shows the fit of vanZwet.’s model and z-curve (Bartos and Schimmack, 2022) to the Cochrane data. Z-curve was slightly modified to have only four components with means of (0, 2, 4, and 6). This modification models low z-scores as a mixture of studies with type-I errors (z = 0) and modest power (50%), but does not allow for studies with low power (5% to 50%). This is of course an arbitrary assumption, but it is no more arbitrary than van Zwet et al.’s assumption that there are no effect sizes of zero (p (z = 0) = 0).

Both models fit equally to these data. However, the z-curve model allocated a weight of 57% to the component with a mean of zero. This implies a false discovery rate of 10%. Thus, the same data are compatible with a Type-S error of 2% and an FDR of 10%.

This brings up the question which of these estimates is closer to the truth. The honest answer is that we do not know. At least, the distribution of z-scores alone does not provide this answer. I have tried for a year to find a way to estimate the true FDR, but simulation studies showed that it is just not possible to do so.

To avoid the problem of overly precise estimates that are based on unproven and untestable assumptions, Bartos and Schimmack (2022) suggested to focus on the false discovery risk. The false discovery risk is the maximum rate of false discoveries that is consistent with the data. Z-curve2.0 relies on the discovery rate to determine the false discovery rate using a formula developed by Soric (1989).

Soric’s formula relies on the fact that the maximum false discovery rate for a given discovery rate occurs when all true hypotheses are tested with 100% power. With 29% significant results and no evidence of publication bias, the assumption is that the 29% significant results are produced by testing 25 true hypothesis with 100 power (25% significant results) and 75% null-hypotheses with a probability of 5% to produce a significant result (3.75% significant results). 

We realize that it is implausible to assume that true hypothesis in clinical trials are tested with 100% power. However, we do not know the average power of tests of true hypotheses. Thus, we do not know the real FDR. The benefit of estimating the false discovery risk is that we can say that there are no more than 14% false positive results. In contrast, vanZwet’s (2023) estimate of 2% sign errors is only correct when we assume that there are no effect sizes that are exactly zero and that there is only a small percentage of effect sizes close to zero. Thus, the problem with this result is that it depends on assumptions that may be false, whereas the FDR estimate of 14% is a worst case scenario. In this regard, the FDR is similar to the Type-1 error that is based on the worst case scenario that all false hypotheses have an effect size of zero. The risk of sign error decreases, the more effect sizes differ from zero. Not everybody might be happy with risk assessments that are conservative and based on worst case scenarios, but we think that this approach is useful to ensure credibility of scientific results.

We also agree with Goodman (2014) that it is less interesting to know the FDR with the conventional criterion of .05 to reject the null-hypothesis. A more important question is how the alpha criterion can be adjusted to ensure an acceptable maximum percentage of false discoveries. With 29% discoveries, alpha can be set to .01 to produce a false discover risk below 5%. It is often argued that many researchers confuse alpha with the FDR and assume that alpha = .05 ensures a false discovery risk of 5% or less. By setting alpha to .01, they can actually claim that the false discovery risk is below .05. Of course, any error rate allows for errors and a single significant result with alpha = .05 or alpha = .01 requires replication and honest reporting of replication failures. 

In conclusion, statistics is needed to make sense of data, especially when data are noisy and effect sizes are small. However, statistics can only produce useful results for applied researchers if the assumptions underlying statistical models are consistent with reality and when assumptions cannot be tested, it is important to conduct sensitivity analyses or to consider worst case scenarios. This blog post examined why vanZwet et al. (2023) estimated that only 2% of clinical trials produce a significant result with a sign error, whereas Schimmack and Bartos (2023) found a false discovery risk of 14%. These differences were not explained by the analysis of different data sets. After correcting for selection bias, Schimmack and Bartos found a bias-corrected/expected discovery rate that was identical to the discovery rate in vanZwert et al.’s (2023) data. Thus, both datasets implied a false discovery risk of 14, using Sorics’s (1989) formula. The distinction between sign-errors (Type-S error rates) and false discovery rates also did not explain the differences. The key factor was the specification of the mixture model. vanZwet et al.’s model made the assumption that effect sizes follow a normal distribution centered at zero. Thus, the model does not allow for a cluster of effect sizes close to zero. Other models allow for a large number of effect sizes close to zero and these models can fit the data equally well. Thus, it is impossible to determine the false discover rate in the Cochrane data. However, the discovery rate of 29% does not allow for more than 14% false discoveries with an effect size of zero or an effect size in the wrong direction. 

The 14% FDR is clearly inconsistent with Ioannidis’s (2005) scenario that clinical trials test only 1 out of 6 true hypothesis with only 20% power, which leads to the prediction that most results are false positives. The results are more consistent with Ioannidis’s scenario 1 with adequately powered clinical trials that have only a small amount of bias and test 1 true hypothesis for every false hypothesis. For this scenario, Ioannidis (2005) predicted 15% false discoveries. Thus, our results suggest that Cochrane reviews and abstracts in leading medical journals match this scenario. We hope that z-curve analyses with other types of studies can provide empirical tests of Ioannidis’s predictions for those studies. We are pleased to provide researchers interested in the credibility of science with a tool that can provide sound empirical evidence about the false discovery risk.

Leave a ReplyCancel reply