I have written a few posts before that are critical of Bayesian Hypothesis Testing with Bayes Factors (Rouder et al.,. 2009; Wagenmakers et al., 2010, 2011).
The main problem with this approach is that it typically compares a single effect size (typically 0) with an alternative hypothesis that is a composite of all other effect sizes. The alternative is often specified as a weighted average with a Cauchy distribution to weight effect sizes. This leads to a comparison of H0:d=0 vs. H1:d=Cauchy(es,0,r) with r being a scaling factor that specifies the median absolute effect size for the alternative hypothesis.
It is well recognized by critics and proponents of this test that the comparison of H0 and H1 favors H0 more and more as the scaling factor is increased. This makes the test sensitive to the specification of H1.
Another problem is that Bayesian hypothesis testing either uses arbitrary cutoff values (BF > 3) to interpret the results of a study or asks readers to specify their own prior odds of H0 and H1. I have started to criticize this approach because the use of a subjective prior in combination with an objective specification of the alternative hypothesis can lead to false conclusions. If I compare H0:d = 0 with H1:d = .2, I am comparing two hypothesis with a single value. If I am very uncertain about the results of a study , I can assign an equal prior probability to both effect sizes and the prior odds of H0/H1 are .5/.5 = 1. Thus, a Bayes Factor can be directly interpreted as the posterior odds of H0 and H1 given the data.
Bayes Ratio (H0/H1) = Prior Odds (H0,H1) * Bayes Factor (H0/H1)
However, if I increase the range of possible effect sizes for H1 and I am uncertain about the actual effect sizes, the a priori probability increases, just like my odds of winning increases when I disperse my bet on several possible outcomes (lottery numbers, horses in the Kentucky derby, or numbers in a roulette game). Betting on effect sizes is no different and the prior odds in favor of H1 increase the more effect sizes I consider plausible.
I therefore propose to use the prior distribution of effect sizes to specify my uncertainty about what could happen in a study. If I think, the null-hypothesis is most likely, I can weight it more than other effect sizes (e.g., with a Cauchy or normal distribution centered at 0). I can then use this distribution to compute (a) the prior odds of H0 and H1, and (b) the conditional probabilities of the observed test statistic (e.g., a t-value) given H0 and H1.
Instead of interpreting Bayes Factors directly, which is not Bayesian, and confuses conditional probabilities of data given hypothesis with conditional probabilities of hypotheses given data, Bayes-Factors are multiplied with the prior odds, to get Bayes Ratios, which many Bayesians consider to be the answer to the real question researchers want to answer. How much should I believe H0 or H1 after I collected data and computed a test-statistic like a t-value?
This approach is more principled and Bayesian than the use of Bayes Factors with arbitrary cut-off values that are easily misinterpreted as evidence for H0 or H1.
One reason why this approach may not have been used before is that H0 is often specified as a point-value (d = 0) and the a priori probability of a single point effect size is 0. Thus, the prior odds (H0/H1) are zero and the Bayes Ratio is also zero. This problem can be avoided by restricting H1 to a reasonably small range of effect sizes and by specifying the null-hypothesis as a small range of effect sizes around zero. As a result, it becomes possible to obtain non-zero prior odds for H0 and to obtain interpretable Bayes Ratios.
The inferences based on Bayes Ratios are not only more principled than those based on Base Factors, they are also more in line with inferences that one would draw on the basis of other methods that can be used to test H0 and H1 such as confidence intervals or Bayesian credibility intervals.
For example, imagine a researcher who wants to provide evidence for the null-hypothesis that there are no gender differences in intelligence. The researcher decided a priori that small differences of less than 1.5 IQ points (0.1 SD) will be considered as sufficient to support the null-hypothesis. He collects data from 50 men and 50 women and finds a mean difference of 3 IQ points in one or the other direction (conveniently, it doesn’t matter in which direction).
The t-value with a standardized mean difference of d = 3/15d = .2, and sampling error of SE = 2/sqrt(100) = .2 is t = .2/2 = 1. A t-value of 1 is not statistically significant. Thus, it is clear that the data do not provide evidence against H0 that there are no gender differences in intelligence. However, do the data provide positive sufficient evidence for the null-hypothesis? p-values are not designed to answer this question. The 95%CI around the observed standardized effect size is -.19 to .59. This confidence interval is wide. It includes 0, but it also includes d = .2 (a small effect size) and d = .5 (a moderate effect size), which would translate into a difference by 7.5 IQ points. Based on this finding it would be questionable to interpret the data as support for the null-hypothesis.
With a default specification of the alternative hypothesis with a Cauchy distribution scaled to 1, the Bayes-Factor (H0/H1) favors H0 over H1 4.95:1. The most appropriate interpretation of this finding is that the prior odds should be updated by a factor of 5:1 in favor of H0, whatever these prior odds are. However, following Jeffrey’s many users who compute Bayes-Factors interpret Bayes-Factors directly with reference to Jeffrey’s criterion values and a value greater than 3 can be and has been used to suggest that the data provide support for the null-hypothesis.
This interpretation ignores that the a priori distribution of effect sizes allocates only a small probability (p = .07) to H0 and a much larger area to H1 (p = .93). When the Bayes Factor is combined with the prior odds (H0/H1) of .07/.93 = .075/1, the resulting Bayes Ratio shows that support for H0 increased, but that it is still more likely that H1 is true than that H0 is true, .075 * 4.95 = .37. This conclusion is consistent with the finding that the 95%CI overlaps with the region of effect sizes for H0 (d = -.1, .1).
We can increase the prior odds of H0 by restricting the range of effect sizes that are plausible under H1. For example, we can restrict effect sizes to 1 or we can set the scaling parameter of the Cauchy distribution to .5. This way, 50% of the distribution falls into the range between d = -.5 and .5.
The t-value and 95%CI remain unchanged because they do not require a specification of H1. By cutting the range of effect sizes for H1 roughly in half (from scaling parameter 1 to .5), the Bayes-Factor in favor of H0 is also cut roughly in half and is no longer above the criterion value of 3, BF (H0/H1) = 2.88.
The change of the alternative hypothesis has the opposite effect on prior odds. The probability of H0 nearly doubled (p = .13) and the prior odds are now .13/.87 = .15. The resulting Bayes Ratio in favor of H0 remains similar to the Bayes Ratio with the wider Cauchy distribution, Bayes Ratio = .15 * 2.88 = 0.45. In fact, it actually is a bit stronger than the Bayes Ratio with the wider specification of effect sizes (BR (H0/H1) = .45. However, both Bayes Ratios lead to the same conclusion that is also consistent with the observed effect size, d = .2, and the confidence interval around it, d = -.19 to d = .59. That is, given the small sample size, the observed effect size provides insufficient information to draw any firm conclusions about H0 or H1. More data are required to decide empirically which hypothesis is more likely to be true.
The example used an arbitrary observed effect size of d = .2. Evidently, effect sizes much larger than this would lead to the rejection of H0 with p-values, confidence intervals, Bayes Factor, or Bayes-Ratios. A more interesting question is what the results would be like if the observed effect size would have provided maximum support for the null-hypothesis, which assumes an observed effect size of 0, which also produces a t-value of 0. With the default prior of Cauchy(M=0,V=1), the Bayes-Factor in favor of H0 is 9.42, which is close to the next criterion value of BF > 10 that is sometimes used to stop data collection because the results are decisive. However, the Bayes Ratio is still slightly in favor of H1, BR (H1/H0) = 1.42. The 95%CI ranges from -.39 to .39 and overlaps with the criterion range of effect sizes in the range from -.1 to .1. Thus, the Bayes Ratio shows that even an observed effect size of 0 in a sample of N = 100 provides insufficient evidence to infer that the null-hypothesis is true.
When we increase sample size to N = 2,000, the 95%CI around d = 0 ranges from -.09 to .09. This finding means that the data support the null-hypothesis and that we would make a mistake in our inferences that use the same approach in no more than 5% of our tests (not just those that provide evidence for H0, but all tests that use this approach). The Bayes-Factor also favors H0 with a massive BF (H0/H1) = 711..27. The Bayes-Ratio also favors H0, with a Bayes-Ratio of 53.35. As Bayes-Ratios are the ratio of two complementary probabilities p(H0) + p(H1) = 1, we can compute the probability of H0 being true with the formula BR(H0/H1) / (Br(H0/H1) + 1), which yields a probability of 98%. We see how the Bayes-Ratio is consistent with the information provided by the confidence interval. The long-run error frequency for inferring H0 from the data was less than 5% and the probability of H1 being true given the data is 1-.98 = .02.
Bayesian Hypothesis Testing has received increased interest among empirical psychologists, especially in situations when researchers aim to demonstrate the lack of an effect. Increasingly, researchers use Bayes-Factors with criterion values to claim that their data provide evidence for the null-hypothesis. This is wrong for three reasons.
First, it is impossible to test a hypothesis that is specified as one effect size out of an infinite number of alternative effect sizes. Researchers appear to be confused that Bayes Factors in favor of H0 can be used to suggest that all other effect sizes are implausible. This is not the case because Bayes Factors do not compare H0 to all other effect sizes. They compare H0 to a composite hypotheses of all other effect sizes and Bayes Factors depend on the way the composite is created. Falsification of one composite does not ensure that the null-hypothesis is true (the only viable hypothesis still standing) because other composites can still fit the data better than H0.
Second, the use of Bayes-Factors with criterion values also suffers from the problem that it ignores the a priori odds of H0 and H1. A full Bayesian inferences requires to take the prior odds into account and to compute posterior odds or Bayes Ratios. The problem for the point-null hypothesis (d = 0) is that the prior odds for H0 over H1 is 0. The reason is that the prior distribution of effect sizes adds up to 1 (the true effect size has to be somewhere), leaving zero probability for d = 0. It is possible to compute Bayes-Factors for d = 0 because Bayes-Factors use densities. For the computation of Bayes Factors the distinction between densities and probabilities is not important, but the for the computation of prior odds, the distinction is important. A single effect size has a density on the Cauchy distribution, but it has zero probability.
The fundamental inferential problem of Bayes-Factors that compare H0:d=0 can be avoided by specifying H0 as a critical region around d=0. It is then possible to compute prior odds based on the area under the curve for H0 and the area under the curve for H1. It is also possible to compute Bayes Factors for H0 and H1 when H0 and H1 are specified as complementary regions of effect sizes. The two ratios can be multiplied to obtain a Bayes Ratio. Furthermore, Bayes Ratios can be used as the probability of H0 given the data and the probability of H1 given the data. The results of this test are consistent with other approaches to the testing of regional null-hypothesis and they are robust to misspecifications of the alternative hypothesis that allocate to much weight to large effect sizes. Thus, I recommend Bayes Ratios for principled Bayesian Hypothesis testing.
R-Code for the analyses reported in this post.
### set input
### What is the total sample size?
N = 2000
### How many groups? One sample or two sample?
gr = 2
### what is the observed effect size
obs.es = 0
### Set the range for H0, H1 is defined as all other effect sizes outside this range
H0.range = c(-.1,.1) #c(-.2,.2) # 0 for classic point null
### What is the limit for maximum effect size, d = 14 = r = .99
limit = 14
### What is the mode of the a priori distribution of effect sizes?
mode = 0
### What is the variability (SD for normal, scaling parameter for Cauchy) of the a priori distribution of effect sizes?
var = 1
### What is the shape of the a priori distribution of effect sizes
shape = “Cauchy” # Uniform, Normal, Cauchy Uniform needs limit
### End of Input
### R computes Likelihood ratios and Weighted Mean Likelihood Ratio (Bayes Factor)
prec = 100 #set precision, 100 is sufficient for 2 decimal
df = N-gr
se = gr/sqrt(N)
pop.es = mode
if (var > 0) pop.es = seq(-limit*prec,limit*prec)/prec
weights = 1
if (var > 0 & shape == “Cauchy”) weights = dcauchy(pop.es,mode,var)
if (var > 0 & shape == “Normal”) weights = dnorm(pop.es,mode,var)
if (var > 0 & shape == “Uniform”) weights = dunif(pop.es,-limit,limit)
H0.mat = cbind(0,1)
H1.mat = cbind(mode,1)
if (var > 0) H0.mat = cbind(pop.es,weights)[pop.es >= H0.range & pop.es <= H0.range,]
if (var > 0) H1.mat = cbind(pop.es,weights)[pop.es < H0.range | pop.es > H0.range,]
H0.mat = matrix(H0.mat,,2)
H1.mat = matrix(H1.mat,,2)
H0 = sum(dt(obs.es/se,df,H0.mat[,1]/se)*H0.mat[,2])/sum(H0.mat[,2])
H1 = sum(dt(obs.es/se,df,H1.mat[,1]/se)*H1.mat[,2])/sum(H1.mat[,2])
BF10 = H1/H0
BF01 = H0/H1
Pr.H0 = sum(H0.mat[,2]) / sum(weights)
Pr.H1 = sum(H1.mat[,2]) / sum(weights)
PriorOdds = Pr.H1/Pr.H0
Bayes.Ratio10 = PriorOdds*BF10
Bayes.Ratio01 = 1/Bayes.Ratio10
### R creates output file
text = c()
text = paste0(‘The observed t-value with d = ‘,obs.es,’ and N = ‘,N,’ is t(‘,df,’) = ‘,round(obs.es/se,2))
text = paste0(‘The 95% confidence interal is ‘,round(obs.es-1.96*se,2),’ to ‘,round(obs.es+1.96*se,2))
text = paste0(‘Weighted Mean Density(H0:d >= ‘,H0.range,’ & <= ‘,H0.range,’) = ‘,round(H0,5))
text = paste0(‘Weighted Mean Density(H1:d <= ‘,H0.range,’ | => ‘,H0.range,’) = ‘,round(H1,5))
text = paste0(‘Weighted Mean Likelihood Ratio (Bayes Factor) H0/H1: ‘,round(BF01,2))
text = paste0(‘Weighted Mean Likelihood Ratio (Bayes Factor) H1/H0: ‘,round(BF10,2))
text = paste0(‘The a priori likelihood ratio of H1/H0 is ‘,round(Pr.H1,2),’/’,round(Pr.H0,2),’ = ‘,round(PriorOdds,2))
text = paste0(‘The Bayes Ratio(H1/H0) (Prior Odds x Bayes Factor) is ‘,round(Bayes.Ratio10,2))
text = paste0(‘The Bayes Ratio(H0/H1) (Prior Odds x Bayes Factor) is ‘,round(Bayes.Ratio01,2))
### print output