Bayes-Factors in Favor of the Nil-Hypothesis are Meaningless

Zoltan Dienes just published an article in the journal that is supposed to save psychological science; Advances in Methods and Practices in Psychological Science. It is a tutorial about Bayes-Factors, which are advocated by Dienes and others as a solution to alleged problems with null-hypothesis significance testing.

The advantage of Bayes-Factors is supposed to be its ability to provide evidence for the null-hypothesis, while NHST is supposed to be one-sided and can only reject the null-hypothesis. The claim is that led to the problem that authors only published articles with p-values less than .05.

“Significance testing is a tool that is commonly used for this purpose; however, nonsignificance is not itself evidence that something does not exist. On the other hand, a Bayes factor can provide a measure of evidence for a model of something existing versus a model of it not existing”

The problem with this attack on NHST is that it is false. The main reason why NHST is unable to provide evidence for the non-existence of an effect is that it is logically impossible to show with empirical data that something does not exist or that the difference between two populations is exactly zero. For this reason, it has been pointed out again and again that it is silly to test the nil-hypothesis that there is no effect or that a mean difference or correlation is exactly zero.

This does not mean that it is impossible to provide evidence for the absence of an effect. The solution is simply to specify a range of values that are sufficiently small to consider these differences meaningful. Once the null-hypothesis is specified as a region of values, it becomes empirically testable with NHST or with Bayesian method. However, neither NHST nor Bayesian methods can provide evidence for a point hypothesis, and the idea that Bayes-Factors can be used to do so is an illusion.

The real problem for demonstrations of the absence of an effect is that small samples with between-subject designs produce large regions of plausible values because small samples have large sampling errors. As a result, the mean differences or correlations move around considerably and it is difficult to say something about the effect size in the population. As a result, the population effect size may be within a region around zero (H0) or outside this region (H1).

Let’s illustrate this with Dienes’ first example. “Theory A claims that autistic subjects will perform worse on a novel task than control subjects will. Theory B claims that the two groups will perform the same.” A researcher tests these two “theories” in a study with 30 participants in each group.

The statistical results that serve as the input for NHST or Bayesian statistics are the effect size and sampling error, and the degrees of freedom.

The autistic group had a score of 8 percent with sampling error of 6 percentage points. The 95%CI ranges from -4 to 20.

The control group has a score of 10 with a sampling error of 5 percentage points. The 95%CI ranges from 0 to 20.

Evidently, the confidence intervals overlap, but they also allow for large differences between the two populations from which these small samples were recruited.

A comparison of the two groups, yields a standardized effect size of d = .05, se = .2, t = .05/.20 = 0.25. The 95%CI for the standardized mean difference between the two groups ranges from d = -.35 to .45, and includes values for a small negative (d = -.2) or a small positive effect (d = .2).

Nevertheless, the default prior that is advocated by Wagenmakers and Rouder yields a Bayes-Factor of 0.27, which is below the aribtrary and low criterion of 1/3 that is used to claim that the data favor the model that claims there is absolutely no performance difference. It is hard to reconcile this claim with the 95%CI that allows for values as large as d = .4. However, to maintain the illusion that Bayes-Factors can miraculously provide evidence for the nil-hypothesis Bayesian propaganda claims that confidence intervals are misleading. Even if we do not trust confidence intervals, we can ask how a study with four times as much sampling error (se = .2) than the effect size (d = .05) can assure us that the true population effect size is 0? It can not.

A standard NHST analysis produces an unimpressive p-value of .80. Everybody knows that this p-value cannot be used to claim that there is no effect, but few people know why this p-value is uninformative. First, it is uninformative because it used d = 0 as the null-hypothesis. We can never prove that this hypothesis is false. However, we could set d = .2 as the lowest effect size that we consider a meaningful difference. Thus, we can compute the t-value for a one-sided test whether the observed value of d = .05 is significantly below d = .2. This is standard NHST. We may also recognize that the sample size is rather small, and adjust our alpha criterion accordingly and allow for a 20% chance of falsely rejecting the null-hypothesis that the effect size is d = .2 or larger. As we are only expecting worse performance, this is a one-sided test.

pt(.05/.20,28,.20/.20) gives us a p-value of .226. Still not good enough to reject the null-hypothesis that the true performance difference in the population is less than d = .2. The problem is that the study with 30 participants in a between-subject design simply has too much sampling error to draw inferences about the population.

Thus, there are three reasons why psychologists rarely provide evidence for the absence of an effect. First, they always specify the null-hypothesis as a point value. This makes it impossible to provide evidence for the null-hypothesis. Second, the sampling error is typically to large to draw firm conclusions about the absence of an effect. What is the solution to improve psychological science? Theories need to be specified with some minimal effect size. For example, tests of ego-depletion, facial feedback, or terror management (to name just a few) need to make explicit predictions about effect sizes. If even small effects are considered theoretically meaningful, studies that aim to demonstrate these effects need to be powered accordingly. For example, to test an effect of d = .2 with an 80% chance of a successful outcome, if the theory is right, requires N = 788 participants. If this study were to produce a non-significant result, one would also be justified to infer that the population effect size is trivial (d < .20) with an error probability of 20%. So, true tests of theories require specification of a boundary effect size that distinguishes meaningful effects from negligible ones. And theorists who claim that their theory is meaningful even if effect sizes are small (e.g., Greenwald’s predictive validity of IAT scores) have to pay the price and conduct studies that can detect these effects.

In conclusion, how do we advance psychological science? With better data. Statisticians are getting paid for publishing statistics articles. They have been unhelpful in advancing statistics for the past one-hundred years in their in-fighting about finding the right statistical tool for inconclusive data (between-subject N = 30). Let them keep fighting, but let’s ignore them. We will only make progress by reducing sampling error so that we can see signals or the absence of signals clearly. And the only statistician you need to read is Jacob Cohen. The real enemy is not NHST or p-values, but sampling error.

Replicability-Index

Improving the replicability of empirical research

Bayes-Factors in Favor of the Nil-Hypothesis are Meaningless

Like this:

1 thought on “Bayes-Factors in Favor of the Nil-Hypothesis are Meaningless”

Leave a ReplyCancel reply

Share this:

Like this:

1 thought on “Bayes-Factors in Favor of the Nil-Hypothesis are Meaningless”

Leave a ReplyCancel reply

Discover more from Replicability-Index