I learned about Bayes theorem in the 1990s and I used Bayes’s famous formula in my first JPSP article (Schimmack & Reisenzein, 1997). When Wagenmakers et al. (2011) published their criticism of Bem (2011), I did not know about Bayesian statistics. I have since learned more about Bayesian statistics and I am aware that there are many different approaches to using priors in statistical inferences. This post is about a single Bayesian statistical approach, namely Bayesian Null-Hypothesis Testing (BNHT), which has been attributed to Jeffrey’s, introduced into psychology by Rouder, Speckman, and Sun (2009), and used by Wagenmakers et al. (2011) to suggest that Bem’s evidence for ESP was obtained by using flawed p-values, whereas Bayes-Factors showed no evidence for ESP, although they did not show evidence for the absence of ESP, either. Since then, I have learned more about Bayes Factors, in part from reading blog posts by Jeff Rouder, including R-Code to run my own simulation studies, and from discussions with Jeff Rouder on social media. I am not an expert on Bayesian modeling, but I understand the basic logic underlying Bayes-Factors.
Rouder et al.’s (2009) article has been cited over 800 times and was cited over 200 times in 2016 and 2017. An influential article like this cannot be ignored. Like all other inferential statistical methods, JBF (Jeffrey’s Bayes Factors or Jeff’s Bayes Factors) examine statistical properties of data (effect size, sampling error) in relation to sampling distributions of test statistics. Rouder et al. (2009) focused on t-distributions that are used for the comparison of means by means of t-tests. Although most research articles in psychology continue to use traditional significance testing. the use of Bayes-Factors is on the rise. It is therefore important to critically examine how Bayes-Factors are being used and whether inferences based on Bayes-Factors are valid.
Inferences about Sampling Error as Causes of Observed Effects.
The main objective of inferential statistics in psychological research is to rule out the possibility that an observed effect is merely a statistical fluke. If the evidence obtained in a study is strong enough given some specified criterion value, researchers are allowed to reject the hypothesis that an observed effect was merely produced by chance (a false positive effect) and interpret the result as being caused by some effect. Although Bayes-Factors could be reported without drawing conclusions (just like t-values or p-values could be reported without drawing inferences), most empirical articles that use Bayes-Factors use them to draw inferences about effects. Thus, the aim of this blog post is to examine whether empirical researchers use JBFs correctly.
Two Types of Errors
Inferential statistics are error prone. The outcome of empirical studies is not deterministic and results from samples may not generalize to populations. There are two possible errors that can occur, the so-called type-I errors and type-II errors. Type-I errors are false positive results. A false positive result occurs when there is no real effect in the population, but the results of a study led to the rejection of the null-hypothesis that sampling error alone caused the observed mean differences. The second error is the false inference that sampling error alone caused an observed difference in a sample, while a test of the entire population would show that there is an actual effect. This is called a false negative result. The main problem in assessing type-II errors (false negatives) is that the probability of a type-II error depends on the magnitude of the effect. Large effects can be easily observed even in small samples and the risk of a type-II error is small. However, as effect sizes become smaller and approach zero, it becomes harder and harder to distinguish the effect from pure sampling error. Once effect sizes become really small (say 0.0000001 percent of a standard deviation), it is practically impossible to distinguish results of a study with a tiny real effect from results of a study with no effect at all.
For reasons that are irrelevant here, psychologists have ignored type-II errors. A type-II error can only be made when researchers conclude that an effect is absent. However, empirical psychologists were trained to ignore non-significant results as inconclusive rather than drawing the inferences that an effect is absent, and risking making a type-II error. This led to the belief that p-values cannot be used to test the hypothesis that an effect is absent. This was not much of a problem because most of the time psychologists made predictions that an effect should occur (reward should increase behavior; learning should improve recall, etc.). However, it became a problem when Bem (2011) claimed to demonstrate that subliminal priming can influence behavior even if the prime is presented AFTER the behavior occurred. Wagenmakers et al. (2011) and others found this hypothesis implausible and the evidence for it unbelievable. However, rather than being satisfied with demonstrating that the evidence is flawed, it would be even better to demonstrate that this implausible effect does not exist. Traditional statistical methods that focus on rejecting the null-hypothesis did not allow this. Wagenmakers et al. (2011) suggested that Bayes-Factors solve this problem because they can be used to test the plausible hypothesis that time-reversed priming does not exist (the effect size is truly zero). Many subsequent articles have used JBFs for exactly this purpose; that is, to provide empirical evidence in support of the null-hypothesis that an observed mean difference is entirely due to sampling error. Like all inductive inferences, inferences in favor of H0 can be false. While psychologists have traditionally ignored type-II errors because they did not make inferences in favor of H0, the rise of inferences in favor of H0 by means of JBFs makes it necessary to examine the validity of these inferences.
The main problem of using JBFs to provide evidence for the null-hypothesis is that Bayes-Factors are ratios of two hypotheses. The data can be more or less compatible with each of the two hypotheses. Say, if the data favor H0 by a likelihood of .2 and H1 by a likelihood of .1, the ratio of the two likelihoods is .2/.1 = 2. The greater the likelihood in favor of H0, the more likely it is that an observed mean difference is purely sampling error. As JBFs are ratios of two likelihoods, they depend on the specification of H1. For t-tests with continuous variables, H1 is specified as a weighted distribution of effect sizes. Although H1 covers all possible effect sizes, it is possible to create an infinite number of alternative hypotheses (H1.1, H1.2, H1.3….H1.∞). The Bayes-Factor changes as a function of the way H1 is specified. Thus, while one specific H1 may produce a JBF of 1000000:1 in favor of H0, another one may produce a JBF of 1:1. It is therefore a logical fallacy to infer from a specific JBF for one particular H1 that H0 is supported, true, or that there is evidence for the absence of an effect. The logically correct inference is that, with extremely high probability), the alternative hypothesis is false, but that does not justify the inverse inference that H0 is true because H0 and H1 do not specify the full event space of all possible hypotheses that could be tested. It is easy to overlook this because every H1 covers the full range of effect sizes, but these effect sizes can be used to create an infinite number of alternative hypotheses.
To make it simple, let’s forget about sampling distributions and likelihoods. Using JBFs to claim that the data support the null-hypothesis in some absolute sense, is like a guessing game where you can pick any number you want, I guess that you picked 7 (because people like the number 7), you say it was not 7, and I now infer that you must have picked 0, as if 7 and 0 were the only options. If you think this is silly, you are right, and it is equally silly to infer from rejecting one out of an infinite number of possible H1s that H0 must be true.
So, a correct use of JBFs would be to state conclusions in terms of the H1 that was specified to compute the JBF. For example, in Wagenmakers et al’s analyses of Bem’s data, the authors specified H1 as a hypothesis that allocated 25% probability to effect sizes of d less than 1 (the opposite of the predicted effect) and 25% probability of a d greater than 1 (a very strong effect similar to gender differences in height). Even if the JBF would strongly favor H0, which it did not, it would not justify the inference that time-reversed priming does not exist. It would merely justify the inference that the effect size is unlikely to be greater than 1, one way or the other. However, if Wagenmakers et al. (2011) had presented their results correctly, nobody would have bothered to take notice of such a trivial inference. It was only the incorrect presentation of JBFs as a test of the null-hypothesis that led to the false belief that JBFs can provide evidence for the absence of an effect (e..g, the true effect size in Bem’s studies is zero). In fact, Wagenmakers played the game where H1 guessed that the effect size is 1, H1 was wrong, leading to the conclusion that the effect size must be 0. This is an invalid inference because there are still an infinite number of plausible effect sizes between 0 and 1.
There is nothing inherently wrong in calculating likelihood ratios and using them to test competing predictions. However, the use of Bayes-Factors as a way to provide evidence for the absence of an effect is misguided because it is logically impossible to provide evidence for one specific effect size out of an infinite set of possible effect sizes. It doesn’t matter whether the specific effect size is 0 or any other value. A likelihood ratio can only compare two hypothesis out of an infinite set of hypotheses. If one hypothesis is rejected, it does not justify inferring that the other hypotheses is true. This is the reason why we can falsify H0 because when we reject H0 we do not infer that one specific effect size is true; we merely infer that it is not 0, leaving open the possibility that it is any one of the other infinite number of effect sizes. We cannot reverse this because we cannot test the hypothesis that the effect is zero against a single hypothesis that covers all other effect sizes. JBFs can only test H0 against all other effect sizes by assigning weights to them. As there is an infinite number of ways to weight effect sizes, there is an infinite set of alternative hypothesis. Thus, we can reject the hypothesis that sampling error alone produced an effect but practically we can never demonstrate that sampling error alone caused the outcome of a study.
To demonstrate that an effect does not exist it is necessary to specify a region of effect sizes around zero. The smaller the region, the more resources are needed to provide evidence that the effect size is at best very small. One negative consequence of the JBF approach has been that small samples were used to claim support for the point null-hypothesis, with a high probability that this conclusion was a false negative result. Researchers should always report the 95% confidence interval around the observed effect size. If this interval includes effect sizes of .2 standard deviations, the inference in favor of a null-result is questionable because many effect sizes in psychology are small. Confidence intervals (or Bayesian credibility intervals with plausible priors) are more useful for claims about the absence of an effect than misleading statistics that pretend to provide strong evidence in favor of a false null-hypothesis.
5 thoughts on “The Deductive Fallacy in (some) Bayesian Inductive Inferences”
Schimmack’s thoughts about Bayes factor are serious and contemplative. So it is my pleasure to address them.
In 2001 I teamed up with Bayesian statisticians Paul Speckman and Dongchu Sun to develop Bayesian hierarchical models for RT, and published a number of hierarchical applications to solve difficult problems in cognition. We avoided model comparison at the time content to do parameter estimation. But by 2008, it was clear we needed more—estimation was fine but we needed some method of comparing models that made competing theoretical commitments. EJ and Dongchu were both pushing objective Bayes and Bayes factor toward 2007; Mike Lee had dabbled with them earlier, and we were all moved by Jay Myung’s seminal papers and talks from late 1990s on integrating out parameters. Morey was my student at the time and the journey was ours together. Working with Richard is something I cherish deeply; he is probably the most thoughtful and judicious person I have ever met.
This sets the stage for the 2009 paper, our first on Bayes factors. Not too surprisingly, our thoughts have evolved. I have mixed feelings when reading the 2009 paper. On the positive side, it is pretty good for a first foray, and it was far more successful than I could have imagined. Yet, I can see where our former frequentist training affected how we presented things. I am much happier with the feely available Collabra paper of 2016 and recommend it as an introduction to Bayes factor.
Shimmack’s main critique is that the default BF tends to misdiagnose small true effects as evidence for the point null. And this behavior may be amplified in moderately sized samples. Shimmack’s claim about the dependence of Bayes factor on hypothetical point truths is absolutely correct. Indeed, we highlight it in most of our publications including the 2009 one. My colleagues and I view it is a strong positive, and we think it should persuade you that BFs make sense. Here is why:
1. We need a more mature description of terms and goals than provided by Shimmack. Our goals are not to “rule out the possibility that an observed effect is merely a statistical fluke.” First, we can’t rule out possibilities, just by the act of rejecting a null model does not make it more or less plausible. Second, statistical fluke is a funny and imprecise term. If something is not a statistical fluke, what is it? We start with models (not hypotheses, which are also rather vague), and these consist of random variables and parameters. We take the one-sample case—data are distributed as a normal—and we believe the mean of this normal tells us something theoretically important about the phenomena under study. The mean is a parameter, called mu, and we wish to draw inferences about it. Since mu is theoretically important, different values of it convey different, competing theoretical positions. Inference on mu helps us understand the plausibility of these positions. The language here matters.
2. It is an abuse of language to think of these models and parameters as true or false. They are abstractions, much like a subway map is an abstraction of the subway. They are abstract descriptions, and we are aiming to get the best abstract description that we can given the data at hand. Our models describe some structure in the data that we think is important—for a subway map it is the ordering of stops. In our case, it is some statement about the value or ranges of mu. Whatever we infer, we know it is provisional on the data at hand, and it may be updated as new data come forth or if new theoretical statements become pertinent. Truth and falseness are unhelpful concepts (unless you want to say everything is false, which is equally unhelpful).
3. There are two possible critiques of Bayes factors that Shimmack conflates. The first is that the Bayes factor method of model comparison is flawed. This view is taken by Kruschke, Gelman, Senn, and Efron, though for differing reasons, among many others. None of these critiques, however, are about truth and misdiagnosis. The second is that Bayes factors are just fine, but the default models recommended by Rouder et al. are inappropriate to capture theoretically important statements. It is hard to tell which critique Shimmack is making. Perhaps it is both. I invite clarification.
4. We need to have a theoretical reason to do testing (or model comparison). A lot of people test because they think they have to, and the field can be described as having testing-fetishism (someone else came up with this term, but I can’t remember who). So, why would we test. Well, we think the null is a theoretically interesting model, and we think the alternative is theoretically interesting as well. (If we only think one is of theoretical interest, then, by default, it is the most useful description). OK, here we go.
5. Now, if you care about the null and the alternative as theoretically useful statements, you probably have a pretty good idea what an effect looks like should it exist. If you don’t know if your effect should be in microseconds, milliseconds, seconds, hours, days, years, or millennia, you have no business doing a test. As a rule of thumb, if you cant predict the order of magnitude of your effect before hand, you may be better off forgoing testing altogether.
6. So, the Shimmack claim is that evidence for the null is provisional, dependent on sample size. We think this is highly appropriate. As we note, if you have a moderately sized experiment capable of detecting moderate effects (say .3 in effect size), but you observed small effects (say .05 in effect size), then the null is often an excellent, parsimonious, theoretically relevant description. Nobody is so bold to claim the null is true—truth is a concept beyond our mortal abilities—but it is an excellent provisional description. Should we then increase the sample size so we have resolutions to detect effects as small as .01, then .05 is evidence for the alternative.
7. Shimmack, in my view, is trying to fit the round peg of Bayesian probability into the square hole of frequentist probability. We all do this because we all start with frequentist views of probability. In fact, there is an excellent monograph to be written on frequentist influences in stating and understanding Bayesian ideas. However, if you wish to understand and evaluate Bayesian probability, you need to do it a way that is more fair than saying it doesn’t meet frequentist goals. That gets us nowhere except reifying notions of truths, errors, decisions, long-term error rates and the like. If these things are important to you, then be a frequentist. But clinging to them is not a critique of Bayesian logic. The reader of Shimmack’s critique should be aware that some parts of the description of what a Bayes factor is in not sufficiently precise. The Bayes factor is not a likelihood ratio except in very trivial cases. Likelihoods are functions of parameters and not bounded between 0 and 1, nor do they obey the probability axioms. Bayes factors are the ratio of proper probabilities, and they are not a function of parameters. The Bayes factor is the ratio of marginal probabilities of data under models—that is, they are predictions about where the data should fall for given models. And that is pretty intuitive and neat.
8. If you use BFs, you make a commitment to well-specified alternatives. They must be specified as a proper distribution. In exchange, you actually predict where data should be conditional on a model, and you use these predictions to draw inferences. Inferences are not decisions, they are statements of evidence, or how much better one model predicted the data than another. Your inferences are based on the data at hand. They are statements about this experiment directly rather than statements about a sequence of hypothetical experiments like this one. If you unprepared to make this commitment, you may not be reading for testing or model comparison. My sense, however, is that if you know just the order-of-magnitudes of the effects you expect if there is an effect, then you are ready.
It is not true that “The logically correct inference is that, with extremely high probability), the alternative hypothesis is false”. Where does that come from?
Remember that BFs are comparative and do not provide evidence or inference for a claim, but rather, that one claim is favored over the other, given a certain measure of favoring. The real problem Schimmack is leading up to, Itake it, is that the method BFs does not have error control. They can’t say there’s a low probability of finding a BF favoring A over B, even if A is false. But BF advocates will happily deny they are interested in error probabilities of methods. That’s the essential difference between error statistical and BF accounts. Now in the special case of predesignated point vs point hypotheses, there is pre-data error control, but even it is vitiated by data-dependent selections. Finally, the Bayesian might say, as they often do, that they are unconcerned with error control because that would only matter if one were interested in long-run performance, whereas they are measuring comparative degrees of belief (or the like). That, again, is a very useful way to draw out the philosophical distinction between the two approaches. However, the error statistician can also deny her interest is in long-run performance, and argue that poor error control prevents considering a resulting inference as well tested. That is my argument.
I am not a philosopher or a statistician, but I have some formal training in logical reasoning and probability theory.
I learned that probabilies for a set of possible and mutually exclusive events add up to 1.
So, if we have two options (the famous coin flip), the probability (p) of one event (heads) is 1-p of the other event. Or stated differently
p(heads) + p(tails) = 1
This implies that as the probability of one event decreases to 0, the other probability approaches 1. In induction it is never zero, but we can have a criterion close to 0, to say we reject this possibility. If there are only two events, the rejection of one possible outcome means we automatically accept the other.
When we apply this to a statistical problem of determining whether a mean difference or a correlatoin is positive or negative, we again have only two possible outcomes (the boundary 0 is irrelevant as it is never really 0 and we cannot really show that it is 0 when it is). So, if we have a correlation of r = 8 in a sample of N = 100,000, we can reject H0: r < 0 with p 0.
Note that in this case of a one-sided test, p-values and BF are not in conflict.
Now we come to the scenario where we have more than two events. Once more they have to add up to 1.
A classic example would be a die.
p(1) + p(2) + p(3) + p(4) + p(5) + p(6) = 1
Let’s say we role a die 1000 times and never observe 1. This would be highly improbable if the die were a normal and fair die. The probability of this event is 0. We therefore reject the hypothesis that this die is fair. We can also accept the hypothesis that the die will never produce a 1. We can do so, by combining 2-6 as a single (not 1) event and apply the logic of the two events. However, now we are contrasting the event 1 to a compound hypothesis 2-6. As a result, rejection of event “1” allows us to accept the compound event “2-5” but we cannot say anything about the probabilities of 2, 3, 4, 5, or 6.
In this example, we could still accept a single event if the data support only a single event (the die always shows 6).
However, once we move to effect sizes (correlations, standardized mean difference), the event space is no longer clearly defined because there are infinite events (all numbers between -Inf to +Inf).
We can still reject specific values, if they are in a region of highly unlikely events, but we can no longer accept a single event (e.g. the value is exactly 0 = 1/Inf) because there are always other values outside the region of rejection that are still possible.
As a result, Bayes-Factors in favor of the point null have to be interpreted as (not yet rejected, but probably will be rejected eventually as sample sizes increase further).,
This is at the core of the Jeffreys/Lindley paradox. This “paradox” is not a paradox at all. It arises from the misinterpretation of Bayes-Factors as supporting H0: no difference in population, when the proper interpretation is “given the current sample size, there is no evidence to reject H0, but we do not know what will happen when the sample size increases further.” Any other conclusion is a “fallacy of acceptance.” (Spanos, 2013).
The only and simple solution to this “paradox” is to define H0 as a region around zero, which produces two clearly defined and distinct events: The population parameter falls into the region around zero (H0 is supported and H1 is rejected) or it does not fall into this region (H1 is supported and H0 is rejected).
Once we have clearly defined regions, we can test the hypotheses with one-sided tests (either just boundary for positive effects or boundaries for positive and negative effects) and p-values and BF agree in conclusions.
So, the paradox arises from the ill-defined problem of testing a point value (typically 0) against a region (all other values).