Even statistically sophisticated psychologists struggle with the interpretation of replication studies (Maxwell et al., 2015). This article gives a basic introduction to the interpretation of statistical results within the Neyman Pearson approach to statistical inferences.
I make two important points and correct some potential misunderstandings in Maxwell et al.’s discussion of replication failures. First, there is a difference between providing sufficient evidence for the null-hypothesis (evidence of absence) and providing insufficient evidence against the null-hypothesis (absence of evidence). Replication studies are useful even if they simply produce absence of evidence without evidence that an effect is absent. Second, I point out that publication bias undermines the credibility of significant results in original studies. When publication bias is present, open replication studies are valuable because they provide an unbiased test of the null-hypothesis, while original studies are rigged to reject the null-hypothesis.
DEFINITION OF REPLICATING A STATISTICAL RESULT
Replicating something means to get the same result. If I make the first free throw, replicating this outcome means to also make the second free throw. When we talk about replication studies in psychology we borrow from the common meaning of the term “to replicate.”
If we conduct psychological studies, we can control many factors, but some factors are not under our control. Participants in two independent studies differ from each other and the variation in the dependent variable across samples introduces sampling error. Hence, it is practically impossible to get identical results, even if the two studies are exact copies of each other. It is therefore more complicated to compare the results of two studies than to compare the outcome of two free throws.
To determine whether the results of two studies are identical or not, we need to focus on the outcome of a study. The most common outcome in psychological studies is a significant or non-significant result. The goal of a study is to produce a significant result and for this reason a significant result is often called a success. A successful replication study is a study that also produces a significant result. Obtaining two significant results is akin to making two free throws. This is one of the few agreements between Maxwell and me.
“Generally speaking, a published original study has in all likelihood demonstrated a statistically significant effect. In the current zeitgeist, a replication study is usually interpreted as successful if it also demonstrates a statistically significant effect.” (p. 488)
The more interesting and controversial scenario is a replication failure. That is, the original study produced a significant result (success) and the replication study produced a non-significant result (failure).
I propose that a lot of confusion arises from the distinction between original and replication studies. If a replication study is an exact copy of the first study, the outcome probabilities of original and replication studies are identical. Otherwise, the replication study is not really a replication study.
There are only three possible outcomes in a set of two studies: (a) both studies are successful, (b) one study is a success and one is a failure, or (c) both studies are failures. The probability of these outcomes depends on whether the significance criterion (the type-I error probability) when the null-hypothesis is true and the statistical power of a study when the null-hypothesis is false.
Table 1 shows the probability of the outcomes in two studies. The uncontroversial scenario of two significant results is very unlikely, if the null-hypothesis is true. With conventional alpha = .05, the probability is .0025 or 1 out of 400 attempts. This shows the value of replication studies. False positives are unlikely to repeat themselves and a series of replication studies with significant results is unlikely to occur by chance alone.
2 sig, 0 ns
1 sig, 1 ns
0 sig, 2 ns
H0 is True
H1 is True
The probability of a successful replication of a true effect is a function of statistical power (1 – type-II error probability). High power is needed to get significant results in a pair of studies (an original study and a replication study). For example, if power is only 50%, the chance of this outcome is only 25% (Schimmack, 2012). Even with conventionally acceptable power of 80%, only 2/3 (64%) of replication attempts would produce this outcome. However, studies in psychology do not have 80% power and estimates of power can be as low as 37% (OSC, 2015). With 40% power, a pair of studies would produce significant results in no more than 16 out of 100 attempts. Although successful replications of true effects with low power are unlikely, they are still much more likely then significant results when the null-hypothesis is true (16/100 vs. 1/400 = 64:1). It is therefore reasonable to infer from two significant results that the null-hypothesis is false.
If the null-hypothesis is true, it is extremely likely that both studies produce a non-significant result (.95^2 = 90.25%). In contrast, it is unlikely that even a study with modest power would produce two non-significant results. For example, if power is 50%, there is a 75% chance that at least one of the two studies produces a significant result. If power is 80%, the probability of obtaining two non-significant results is only 4%. This means, it is much more likely (22.5 : 1) that the null-hypothesis is true than that the alternative hypothesis is true. This does not mean that the null-hypothesis is true in an absolute sense because power depends on the effect size. For example, if 80% power were obtained with a standardized effect size of Cohen’s d = .5, two non-significant results would suggest that the effect size is smaller than .5, but it does not warrant the conclusion that H0 is true and the effect size is exactly 0. Once more, it is important to distinguish between the absence of evidence for an effect and the evidence of absence of an effect.
The most controversial scenario assumes that the two studies produced inconsistent outcomes. Although theoretically there is no difference between the first and the second study, it is common to focus on a successful outcome followed by a replication failure (Maxwell et al., 2015). When the null-hypothesis is true, the probability of this outcome is low; .05 * (1-.05) = .0425. The same probability exists for the reverse pattern that a non-significant result is followed by a significant one. A probability of 4.25% shows that it is unlikely to observe a significant result followed by a non-significant result when the null-hypothesis is true. However, the low probability is mostly due to the low probability of obtaining a significant result in the first study, while the replication failure is extremely likely.
Although inconsistent results are unlikely when the null-hypothesis is true, they can also be unlikely when the null-hypothesis is false. The probability of this outcome depends on statistical power. A pair of studies with very high power (95%) is very unlikely to produce an inconsistent outcome because both studies are expected to produce a significant result. The probability of this rare event can be as low, or lower, than the probability with a true null effect; .95 * (1-.95) = .0425. Thus, an inconsistent result provides little information about the probability of a type-I or type-II error and is difficult to interpret.
In conclusion, a pair of significance tests can produce three outcomes. All three outcomes can occur when the null-hypothesis is true and when it is false. Inconsistent outcomes are likely unless the null-hypothesis is true or the null-hypothesis is false and power is very high. When two studies produce inconsistent results, statistical significance provides no basis for statistical inferences.
The counting of successes and failures is an old way to integrate information from multiple studies. This approach has low power and is no longer used. A more powerful approach is effect size meta-analysis. Effect size meta-analysis was one way to interpret replication results in the Open Science Collaboration (2015) reproducibility project. Surprisingly, Maxwell et al. (2015) do not consider this approach to the interpretation of failed replication studies. To be clear, Maxwell et al. (2015) mention meta-analysis, but they are talking about meta-analyzing a larger set of replication studies, rather than meta-analyzing the results of an original and a replication study.
“This raises a question about how to analyze the data obtained from multiple studies. The natural answer is to use meta-analysis.” (p. 495)
I am going to show that effect-size meta-analysis solves the problem of interpreting inconsistent results in pairs of studies. Importantly, effect size meta-analysis does not care about significance in individual studies. A meta-analysis of a pair of studies with inconsistent results is no different from a meta-analysis of a pair of studies with consistent results.
Maxwell et al.’s (2015) introduced an example of a between-subject (BS) design with n = 40 per group (total N = 80) and a standardized effect size of Cohen’s d = .5 (a medium effect size). This study has 59% power to obtain a significant result. Thus, it is quite likely that a pair of studies produces inconsistent results (48.38%). However, a pair of studies with N = 80 has the power of a total sample size of N = 160, which means a fixed-effects meta-analysis will produce a significant result in 88% of all attempts. Thus, it is not difficult at all to interpret the results of pairs of studies with inconsistent results if the studies have acceptable power (> 50%). Even if the results are inconsistent, a meta-analysis will provide the correct answer that there is an effect most of the time.
A more interesting scenario are inconsistent results when the null-hypothesis is true. I turned to simulations to examine this scenario more closely. The simulation showed that a meta-analysis of inconsistent studies produced a significant result in 34% of all cases. The percentage slightly varies as a function of sample size. With a small sample of N = 40, the percentage is 35%. With a large sample of 1,000 participants it is 33%. This finding shows that in two-thirds of attempts, a failed replication reverses the inference about the null-hypothesis based on a significant original study. Thus, if an original study produced a false-positive results, a failed replication study corrects this error in 2 out of 3 cases. Importantly, this finding does not warrant the conclusion that the null-hypothesis is true. It merely reverses the result of the original study that falsely rejected the null-hypothesis.
In conclusion, meta-analysis of effect sizes is a powerful tool to interpret the results of replication studies, especially failed replication studies. If the null-hypothesis is true, failed replication studies can reduce false positives by 66%.
DIFFERENCES IN SAMPLE SIZES
We can all agree that, everything else being equal, larger samples are better than smaller samples (Cohen, 1990). This rule applies equally to original and replication studies. Sometimes it is recommended that replication studies should use much larger samples than original studies, but it is not clear to me why researchers who conduct replication studies should have to invest more resources than original researchers. If original researchers conducted studies with adequate power, an exact replication study with the same sample size would also have adequate power. If the original study was a type-I error, the replication study is unlikely to replicate the result no matter what the sample size. As demonstrated above, even a replication study with the same sample size as the original study can be effective in reversing false rejections of the null-hypothesis.
From a meta-analytic perspective, it does not matter whether a replication study had a larger or smaller sample size. Studies with larger sample sizes are given more weight than studies with smaller samples. Thus, researchers who invest more resources are rewarded by giving their studies more weight. Large original studies require large replication studies to reverse false inferences, whereas small original studies require only small replication studies to do the same. Nevertheless, failed replications with larger samples are more likely to reverse false rejections of the null-hypothesis, but there is no magical number about the size of a replication study to be useful.
I simulated a scenario with a sample size of N = 80 in the original study and a sample size of N = 200 in the replication study (a factor of 2.5). In this simulation, only 21% of meta-analyses produced a significant result. This is 13 percentage points lower than in the simulation with equal sample sizes (34%). If the sample size of the replication study is 10 times larger (N = 80 and N = 800), the percentage of remaining false positive results in the meta-analysis shrinks to 10%.
The main conclusion is that even replication studies with the same sample size as the original study have value and can help to reverse false positive findings. Larger sample sizes simply give replication studies more weight than original studies, but it is by no means necessary to increase sample sizes of replication studies to make replication failures meaningful. Given unlimited resources, larger replications are better, but these analysis show that large replication studies are not necessary. A replication study with the same sample size as the original study is more valuable than no replication study at all.
CONFUSING ABSENCE OF EVIDENCE WITH EVIDENCE OF ABSENCE
One problem in Maxwell et al’s (2015) article is to conflate two possible goals of replication studies. One goal is to probe the robustness of the evidence against the null-hypothesis. If the original result was a false positive result, an unsuccessful replication study can reverse the initial inference and produce a non-significant result in a meta-analysis. This finding would mean that evidence for an effect is absent. The status of a hypothesis (e.g., humans have supernatural abilities; Bem, 2011) is back to where it was before the original study found a significant result and the burden of proof is shifted back to proponents of the hypothesis to provide unbiased credible evidence for it.
Another goal of replication studies can be to provide conclusive evidence that an original study reported a false positive result (i..e, humans do not have supernatural abilities). Throughout their article, Maxwell et al. assume that the goal of replication studies is to prove the absence of an effect. They make many correct observations about the difficulties of achieving this goal, but it is not clear why replication studies have to be conclusive when original studies are not held to the same standard.
This makes it easy to produce (potentially false) positive results and very hard to remove false positive results from the literature. It also creates a perverse incentive to conduct underpowered original studies and to claim victory when a large replication study finds a significant result with an effect size that is 90% smaller than the effect size in an original study. The authors of the original article may claim that they do not care about effect sizes and that their theoretical claim was supported. To avoid this problem that replication researchers have to invest large amount of resources for little gain, it is important to realize that even a failure to replicate an original finding with the same sample size can undermine original claims and force researchers to provide stronger evidence for their original ideas in original articles. If they are right and the evidence is strong, others will be able to replicate the result in an exact replication study with the same sample size.
THE DIRTY BIG SECRET
The main problem of Maxwell et al.’s (2015) article is that the authors blissfully ignore the problem of publication bias. They mention publication bias twice to warn readers that publication bias inflates effect sizes and biases power analyses, but they completely ignore the influence of publication bias on the credibility of successful original results (Schimmack, 2012; Sterling; 1959; Sterling et al., 1995).
It is hard to believe that Maxwell is unaware of this problem, if only because Maxwell was action editor of my article that demonstrated how publication bias undermines the credibility of replication studies that are selected for significance (Schimmack, 2012).
I used Bem’s infamous article on supernatural abilities as an example, which appeared to show 8 successful replications of supernatural abilities. Ironically, Maxwell et al. (2015) also cites Bem’s article to argue that failed replication studies can be misinterpreted as evidence of absence of an effect.
“Similarly, Ritchie, Wiseman, and French (2012) state that their failure to obtain significant results in attempting to replicate Bem (2011) “leads us to favor the ‘experimental artifacts’ explanation for Bem’s original result” (p. 4)”
This quote is not only an insult to Ritchie et al.; it also ignores the concerns that have been raised about Bem’s research practices. First, Ritchie et al. do not claim that they have provided conclusive evidence against ESP. They merely express their own opinion that they “favor the ‘experimental artifacts’ explanation. There is nothing wrong with this statement, even if it is grounded in a healthy skepticism about supernatural abilities.
More important, Maxwell et al. ignore the broader context of these studies. Schimmack (2012) discussed many questionable practices in Bem’s original studies and I presented statistical evidence that the significant results in Bem’s article were obtained with the help of questionable research practices. Given this wider context, it is entirely reasonable to favor the experimental artifact explanation over the alternative hypothesis that learning after an exam can still alter the exam outcome.
It is not clear why Maxwell et al. (2015) picked Bem’s article to discuss problems with failed replication studies and ignores that questionable research practices undermine the credibility of significant results in original research articles. One reason why failed replication studies are so credible is that insiders know how incredible some original findings are.
Maxwell et al. (2015) were not aware that in the same year, the OSC (2015) reproducibilty project would replicate only 37% of statistically significant results in top psychology journals, while the apparent success rate in these journals is over 90%. The stark contrast between the apparent success rate and the true power to produce successful outcomes in original studies provided strong evidence that psychology is suffering from a replication crisis. This does not mean that all failed replications are false positives, but it does mean that it is not clear which findings are false positives and which findings are not. Whether this makes things better is a matter of opinion.
Publication bias also undermines the usefulness of meta-analysis for hypothesis testing. In the OSC reproducibility project, a meta-analysis of original and replication studies produced 68% significant results. This result is meaningless because publication bias inflates effect sizes and the probability of obtaining a false positive result in the meta-analysis. Thus, when publication bias is present, unbiased replication studies provide the most credible evidence and the large number of replication failures means that more replication studies with larger samples are needed to see which hypothesis predict real effects with practical significance.
DOES PSYCHOLOGY HAVE A REPLICATION CRISIS?
Maxwell et al.’s (2015) answer to this question is captured in this sentence. “Despite raising doubts about the extent to which apparent failures to replicate necessarily reveal that psychology is in crisis,we do not intend to dismiss concerns about documented methodological flaws in the field.” (p. 496). The most important part of this quote is “raising doubt,” the rest is Orwellian double-talk.
The whole point of Maxwell et al.’s article is to assure fellow psychologists that psychology is not in crisis and that failed replication studies should not be a major concern. As I have pointed out, this conclusion is based on some misconceptions about the purpose of replication studies and by blissful ignorance about publication bias and questionable research practices that made it possible to publish successful replications of supernatural phenomena, while discrediting authors who spend time and resources on demonstrating that unbiased replication studies fail.
The real answer to Maxwell et al.’s question was provided by the OSC (2015) finding that only 37% of published significant results could be replicated. In my opinion that is not only a crisis, but a scandal because psychologists routinely apply for funding with power analyses that claim 80% power. The reproducibilty project shows that the true power to obtain significant results in original and replication studies is much lower than this and that the 90% success rate is no more meaningful than 90% votes for a candidate in communist elections.
In the end, Maxwell et al. draw the misleading conclusion that “the proper design and interpretation of replication studies is less straightforward than conventional practice would suggest.” They suggest that “most importantly, the mere fact that a replication study yields a nonsignificant statistical result should not by itself lead to a conclusion that the corresponding original study was somehow deficient and should no longer be trusted.”
As I have demonstrated, this is exactly the conclusion that readers should draw from failed replication studies, especially if (a) the original study was not preregistered, (b) the original study produced weak evidence (e.g., p = .04), the original study was published in a journal that only publishes significant results, (d) the replication study had a larger sample, (e) the replication study would have been published independent of outcome, and (f) the replication study was preregistered.
We can only speculate why the American Psychologists published a flawed and misleading article that gives original studies the benefit of the doubt and casts doubt on the value of replication studies when they fail. Fortunately, APA can no longer control what is published because scientists can avoid the censorship of peer-reviewed journals by publishing blogs and by criticize peer-reviewed articles in open post-publication peer review on social media.
Long life the replicability revolution. !!!
Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45(12), 1304-1312.
Maxwell, S.E, Lau, M. Y., & Howard, G. S. (2015). Is psychology suffering from a replication crisis? What does ‘failure to replicate’ really mean? American Psychologist, 70, 487-498. http://dx.doi.org/10.1037/a0039400.
Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551-566. http://dx.doi.org/10.1037/a0029487
D1: In plain English, statistical power is the likelihood that a study will detect an effect when there is an effect there to be detected. If statistical power is high, the probability of making a Type II error, or concluding there is no effect when, in fact, there is one, goes down (first hit on Google)
D2: The power or sensitivity of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis (H0) when the alternative hypothesis (H1) is true. (Wikipedia)
D3: The probability of not committing a Type II error is called the power of a hypothesis test. (Stat Trek)
The concept of statistical power arose from Neyman and Pearson’s approach to statistical inferences. Neyman and Pearson distinguished between two types of errors that could occur when a researcher draws conclusions about a population from observations in a sample. The first error (type-I error) is to infer a systematic relationship (in tests of causality this is an effect) when no relationship (no effect) exists. This error is also known as a false-positive as in a pregnancy test that shows a positive result (pregnant) when a women is not pregnant. The second error (type-II error) is to fail to detect a systematic relationship that actually exists. This error is also known as a false negative as when a pregnancy shows a negative result (not pregnant) when a woman is actually pregnant.
Ideally researchers would never make type-I or type-II errors, but it is inevitable that researchers will make both types of mistakes. However, researchers have some control over the probability of making these two mistakes. Statistical power is simply the probability of not making a type-II mistake; that is to avoid negative results when effects are present.
Many definitions of statistical power imply that the probability of avoiding a type-II error is equivalent to the long-run frequency of statistical significant results because statistical significance is used to decide whether an effect is present or not. By definition statistically non-significant results are negative results when an effect exists in the population. However, it does not automatically follow that all significant results are positive results when an effect is present. Significant results and positive results are only identical in one-sided hypotheses tests. For example, if the hypothesis is that men are taller than women and a one-sided statistical tests is used only significant results that show a greater mean for men than for women will be significant. A study that shows a large difference in the opposite direction would not produce a significant result no matter how large the difference is.
The equivalence between significant results and positive results no longer holds in the more commonly used two-tailed tests of statistical significance. In this case, the relationship in the population is either positive or negative. It cannot be both positive or negative. Only significant results that also show the correct direction of the effect (either as predicted by a correct prediction or as demonstrated by consistency with the majority of other significant results) are positive results. Significant results in the other direction are false positive results in that they show a false effect, which becomes only visible in a two-tailed test when the sign of the effect is taken into account.
How important is the distinction between the rate of positive results and the rate of significant results in a two-tailed test? Actually it is not very important. The largest number of false positive results is obtained when no effect exists at all. If the 5% significance criterion is used, no more than 5% of tests will produce false positive results. It will also become apparent after some time that there is no effect because half the studies will show a positive effect and the other half will show a negative effect. The inconsistency in the sign of the effect shows that significant results are not caused by a systematic relation. As the power of a test increases, more and more significant results will have the correct sign and fewer and fewer results will be false positives. The picture on top shows an example with 13% power. As can be seen most of this percentage comes from the fat right tail of the blue distribution. However, a small portion comes from the left tail that is more extreme than the criterion for significance (the green line).
For a study with 50% power to produce a true positive result (a significant result with the correct sign) is 50%. The probability of a false-positive result (a significant result with the wrong sign) is 0 to the second decimal, but not exactly zero (~0.05%). In other words, even in studies with modes power, false positive results have a negligible effect. A much bigger concern is that 50% of results are expected to be false negative results.
In conclusion, the sign of an effect matters. Two-tailed significance testing ignores the sign of an effect. Power is the long-run probability of obtaining a significant result with the correct sign. This probability is identical to the probability of a statistically significant result in a one-tailed test. It is not identical to the probability of a statistically significant results in a two-tailed test, but for practical purposes the difference is negligible. Nevertheless, it is probably most accurate to use a definition that is equally applicable to one-tailed and two-tailed tests.
D4: Statistical power is the probability of drawing the correct conclusion from a statistically significant result when an effect is present. If the effect is positive, the correct inference is that a positive effect exists. If an effect is negative, the correct inference is that a negative effect exists. When the inference is that the effect is negative (positive), but the effect is positive (negative), a statistically significant result does not count towards the power of a statistical test.
This definition differs from other definitions of power because it distinguishes between true positive and false positive results. Other definitions of power treat all non-negative results (false positive and true positive) as equivalent.
The scientific method is well-equipped to demonstrate regularities in nature as well as human behaviors. It works by repeating a scientific procedure (experiment or natural observation) many times. In the absence of a regular pattern, the empirical data will follow a random pattern. When a systematic pattern exists, the data will deviate from the pattern predicted by randomness. The deviation of an observed empirical result from a predicted random pattern is often quantified as a probability (p-value). The p-value itself is based on the ratio of the observed deviation from zero (effect size) and the amount of random error. As the signal-to-noise ratio increases, it becomes increasingly unlikely that the observed effect is simply a random event. As a result, it becomes more likely that an effect is present. The amount of noise in a set of observations can be reduced by repeating the scientific procedure many times. As the number of observations increases, noise decreases. For strong effects (large deviations from randomness), a relative small number of observations can be sufficient to produce extremely low p-values. However, for small effects it may require rather large samples to obtain a high signal-to-noise ratio that produces a very small p-value. This makes it difficult to test the null-hypothesis that there is no effect. The reason is that it is always possible to find an effect size that is so small that the noise in a study is too large to determine whether a small effect is present or whether there is really no effect at all; that is, the effect size is exactly zero (1 / infinity).
The problem that it is impossible to demonstrate scientifically that an effect is absent may explain why the scientific method has been unable to resolve conflicting views around controversial topics such as the existence of parapsychological phenomena or homeopathic medicine that lack a scientific explanation, but are believed by many to be real phenomena. The scientific method could show that these phenomena are real, if they were real, but the lack of evidence for these effects cannot rule out the possibility that a small effect may exist. In this post, I explore two statistical solutions to the problem of demonstrating that an effect is absent.
Neyman-Pearson Significance Testing (NPST)
The first solution is to follow Neyman-Pearsons’s orthodox significance test. NPST differs from the widely practiced null-hypothesis significance test (NHST) in that non-significant results are interpreted as evidence for the null-hypothesis. Thus, using the standard criterion of p = .05 as the criterion for significance, a p-value below .05 is used to reject the null-hypothesis and to infer that an effect is present. Importantly, if the p-value is greater than .05 the results are used to accept the null-hypothesis; that is, the hypothesis that there is no effect is true. As all statistical inferences, it is possible that the evidence is misleading and leads to the wrong conclusion. NPST distinguishes between two types or errors that are called type-I and type-II error. Type-I errors are errors when a p-value is below the criterion value (p < .05), but the null-hypothesis is actually true; that is there is no effect and the observed effect size was caused by a rare random event. Type-II errors are made when the null-hypothesis is accepted, but the null-hypothesis is false; there actually is an effect. The probability of making a type-II error depends on the size of the effect and the amount of noise in the data. Strong effects are unlikely to produce a type-II error even with noise data. Studies with very little noise are also unlikely to produce type-II errors because even small effects can still produce a high signal-to-noise ratio and significant results (p-values below the criterion value). Type-II error rates can be very high in studies with small effects and a large amount of noise. NPST makes it possible to quantify the probability of a type-II error for a given effect size. By investing a large amount of resources, it is possible to reduce noise to a level that is sufficient to have a very low type-II error probability for very small effect sizes. The only requirement for using NPST to provide evidence for the null-hypothesis is to determine a margin of error that is considered acceptable. For example, it may be acceptable to infer that a weight-loss-medication has no effect on weight if weight loss is less than 1 pound over a one month period. It is impossible to demonstrate that the medication has absolutely no effect, but it is possible to demonstrate with high probability that the effect is unlikely to be more than 1 pound.
The main difference between Bayes-Factors and NPST is that NPST yields type-II error rates for an a priori effect size. In contrast, Bayes-Factors do not postulate a single effect size, but use an a priori distribution of effect sizes. Bayes-Factors are based on the probability that the observed effect sizes is based on a true effect size of zero relative to the probability that the observed effect size was based on a true effect size within a range of a priori effect sizes. Bayes-Factors are the ratio of the probabilities for the two hypotheses. It is arbitrary, which hypothesis is in the numerator and which hypothesis is in the denominator. When the null-hypothesis is placed in the numerator and the alternative hypothesis is placed in the denominator, Bayes-Factors (BF01) decrease towards zero the more the data suggest that an effect is present. In this way, Bayes-Factors behave very much like p-values. As the signal-to-noise ratio increases, p-values and BF01 decrease.
There are two practical problems in the use of Bayes-Factors. One problem is that Bayes-Factors depend on the specification of the a priori distribution of effect sizes. It is therefore important that results can never be interpreted as evidence for the null-hypothesis or against the null-hypothesis per se. A Bayes-Factor that favors the null-hypothesis in the comparison to one a priori distribution can favor the alternative hypothesis for another a priori distribution of effect sizes. This makes Bayes-Factors impractical for the purpose of demonstrating that an effect does not exist (e.g., a drug does not have positive treatment effects). The second problem is that Bayes-Factors only provide quantitative information about the two hypotheses. Without a clear criterion value, Bayes-Factors cannot be used to claim that an effect is present or absent.
Selecting a Criterion Value for Bayes-Factors
A number of criterion values seem plausible. NPST always leads to a decision depending on the criterion for p-values. An equivalent criterion value for Bayes-Factors would be a value of 1. Values greater than 1 favor the null-hypothesis over the alternative, whereas values less than 1 favor the alternative hypothesis. This criterion avoids inconclusive results. The disadvantage with this criterion is that Bayes-Factors close to 1 are very variable and prone to have high type-I and type-II error rates. To avoid this problem, it is possible to use more stringent criterion values. This reduces the type-I and type-II error rates, but it also increases the rate of inconclusive results in noisy studies. Bayes-Factors of 3 (a 3 to 1 ratio in favor of the null over an alternative hypothesis) are often used to suggest that the data favor one hypothesis over another, and Bayes-Factors of 10 or more are often considered strong support. One problem with these criterion values is that there have been no systematic studies of the type-I and type-II error rates for these criterion values. Moreover, there have been no systematic sensitivity studies; that is, the ability of studies to reach a criterion value for different signal-to-noise ratios.
Wagenmakers et al. (2011) argued that p-values can be misleading and that Bayes-Factors provide more meaningful results. To make their point, they investigated Bem’s (2011) controversial studies that seemed to demonstrate the ability to anticipate random events in the future (time –reversed causality). Using a significance criterion of p < .05 (one-tailed), 9 out of 10 studies showed evidence of an effect. For example, in Study 1, participants were able to predict the location of erotic pictures 54% of the time, even before a computer randomly generated the location of the picture. Using a more liberal type-I error rate of p < .10 (one-tailed), all 10 studies produced evidence for extrasensory perception.
Wagenmakers et al. (2011) re-examined the data with Bayes-Factors. They used a Bayes-Factor of 3 as the criterion value. Using this value, six tests were inconclusive, three provided substantial support for the null-hypothesis (the observed effect was just due to noise in the data) and only one test produced substantial support for ESP. The most important point here is that the authors interpreted their results using a Bayes-Factor of 3 as criterion. If they had used a Bayes-Factor of 10 as criterion, they would have concluded that all studies were inconclusive. If they had used a Bayes-Factor of 1 as criterion, they would have concluded that 6 studies favored the null-hypothesis and 4 studies favored the presence of an effect.
Matzke, Nieuwenhuis, van Rijn, Slagter, van der Molen, and Wagenmakers used Bayes-Factors in a design with optional stopping. They agreed to stop data-collection when the Bayes-Factor reached a criterion value of 10 in favor of either hypothesis. The implementation of a decision to stop data collection suggests that a Bayes-Factor of 10 was considered decisive. One reason for this stopping rule would be that it is extremely unlikely that a Bayes-Factor might swing to favoring the alternative hypothesis if more data were collected. By the same logic, a Bayes-Factor of 10 that favors the presence of an effect in an ESP effect would suggest that further data collection would be unnecessary because the evidence already shows rather strong evidence that an effect is present.
Tan, Dienes, Jansari, and Goh, (2014) report a Bayes-Factor of 11.67 and interpret as being “greater than 3 and strong evidence for the alternative over the null” (p. 19). Armstrong and Dienes (2013) report a Bayes-Factor of 0.87 and state that no conclusion follows from this finding because the Bayes-Factor is between 3 and 1/3. This statement implies that Bayes-Factors that meet the criterion value are conclusive.
In sum, a criterion-value of 3 has often been used to interpret empirical data and a criterion of 10 has been used as strong evidence in favor of an effect or in favor of the null-hypothesis.
Meta-Analysis of Multiple Studies
As sample sizes increase, noise decreases and the signal-to-noise ratio increases. Rather than increasing the sample size of a single study, it is also possible to conduct multiple smaller studies and to combine the evidence of studies in a meta-analysis. The effect is the same. A meta-analysis based on several original studies reduces random noise in the data and can produce higher signal-to-noise ratios when an effect is present. On the flip side, a low signal-to-noise ratio in a meta-analysis implies that the signal is very weak and that the true effect size is close to zero. As the evidence in a meta-analysis is based on the aggregation of several smaller studies, the results should be consistent. That is, the effect size in the smaller studies and the meta-analysis is the same. The only difference is that aggregation of studies reduces noise, which increases the signal-to-noise ratio. A meta-analysis therefore can highlight the problem of interpreting a low signal-to-noise ratio (BF10 < 1, p > .05) in small studies as evidence for the null-hypothesis. In NPST this result would be flagged as not trustworthy because the type-II error probability is high. For example, a non-significant result with a type-II error of 80% (20% power) is not particularly interesting and nobody would want to accept the null-hypothesis with such a high error probability. Holding the effect size constant, the type-II error probability decreases as the number of studies in a meta-analysis increases and it becomes increasingly more probable that the true effect size is below the value that was considered necessary to demonstrate an effect. Similarly, Bayes-Factors can be misleading in small samples and they become more conclusive as more information becomes available.
A simple demonstration of the influence of sample size on Bayes-Factors comes from Rouder and Morey (2011). The authors point out that it is not possible to combine Bayes-Factors by multiplying Bayes-Factors of individual studies. To address this problem, they created a new method to combine Bayes-Factors. This Bayesian meta-analysis is implemented in the Bayes-Factor r-package. Rouder and Morey (2011) applied their method to a subset of Bem’s data. However, they did not use it to examine the combined Bayes-Factor for the 10 studies that Wagenmakers et al. (2011) examined individually. I submitted the t-values and sample sizes of all 10 studies to a Bayesian meta-analysis and obtained a strong Bayes-Factor in favor of an effect, BF10 = 16e7, that is, 16 million to 1 in favor of ESP. Thus, a meta-analysis of all 10 studies strongly suggests that Bem’s data are not random.
Another way to meta-analyze Bem’s 10 studies is to compute a Bayes-Factor based on the finding that 9 out of 10 studies produced a significant result. The p-value for this outcome under the null-hypothesis is extremely small; 1.86e-11, that is p < .00000000002. It is also possible to compute a Bayes-Factor for the binomial probability of 9 out of 10 successes with a probability of 5% to have a success under the null-hypothesis. The alternative hypothesis can be specified in several ways, but one common option is to use a uniform distribution from 0 to 1 (beta(1,1). This distribution allows for the power of a study to range anywhere from 0 to 1 and makes no a priori assumptions about the true power of Bem’s studies. The Bayes-Factor strongly favors the presence of an effect, BF10 = 20e9. In sum, a meta-analysis of Bem’s 10 studies strongly supports the presence of an effect and rejects the null-hypothesis.
The meta-analytic results raise concerns about the validity of Wagenmakers et al.’s (2011) claim that Bem presented weak evidence and that p-values misleading information. Instead, Wagenmakers et al.’s Bayes-Factors are misleading and fail to detect an effect that is clearly present in the data.
The Devil is in the Priors: What is the Alternative Hypothesis in the Default Bayesian t-test?
Wagenmakers et al. (2011) computed Bayes-Factors using the default Bayesian t-test. The default Bayesian t-test uses a Cauchy distribution centered over zero as the alternative hypothesis. The Cauchy distribution has a scaling factor. Wagenmakers et al. (2011) used a default scaling factor of 1. Since then, the default scaling parameter has changed to .707.Figure 1 illustrates Cauchi distributions with scaling factors .2, .5, .707, and 1.
The black line shows the Cauchy distribution with a scaling factor of d = .2. A scaling factor of d = .2 implies that 50% of the density of the distribution is in the interval between d = -.2 and d = .2. As the Cauchy-distribution is centered over 0, this specification also implies that the null-hypothesis is considered much more likely than many other effect sizes, but it gives equal weight to effect sizes below and above an absolute value of d = .2. As the scaling factor increases, the distribution gets wider. With a scaling factor of 1, 50% of the density distribution is within the range from -1 to 1 and 50% covers effect sizes greater than 1. The choice of the scaling parameter has predictable consequences on the Bayes-Factor. As long as the true effect size is more extreme than the scaling parameter, Bayes-Factors will favor the alternative hypothesis and Bayes-Factors will increase towards infinity as sampling error decreases. However, for true effect sizes that are below the scaling parameter, Bayes-Factors may initially favor the null-hypothesis because the alternative hypothesis includes effect sizes that are more extreme than the alternative hypothesis. As sample sizes increase, the Bayes-Factor will change from favoring the null-hypothesis to favoring the alternative hypothesis. This can explain why Wagenmakers et al. (2011) found no support for ESP when Bem’s studies were examined individually, but a meta-analysis of all studies shows strong evidence in favor of an effect.
The effect of the scaling parameter on Bayes-Factors is illustrated in the following Figure.
The straight lines show Bayes-Factors (y-axis) as a function of sample size for a scaling parameter of 1. The black line shows Bayes-Factors favoring an effect of d = .2 when the effect size is actually d = .2 (BF10) and the red line shows Bayes-Factor favoring the null-hypothesis when the effect size is actually 0. The green line implies a criterion value of 3 to suggest “substantial” support for either hypothesis (Wagenmakers et al., 2011). The figure shows that Bem’s sample sizes of 50 to 150 participants could never produce substantial evidence for an effect when the observed effect size is d = .2. In contrast, an effect size of 0 would produce provide substantial support for the null-hypothesis. Of course, actual effect sizes in samples will deviated from these hypothetical values, but sampling error will average out. Thus, for studies that occasionally show support for an effect there will also be studies that underestimate support for an effect. The dotted lines illustrate how the choice of the scaling factor influences Bayes-Factors. With a scaling factor of d = .2, Bayes-Factors would never favor the null-hypothesis. They would also not support the alternative hypothesis in studies with less than 150 participants and even in these studies the Bayes-Factor is likely to be just above 3.
Figure 2 explains why Wagenmakers et al.’s (2011) did mainly find inconclusive results. On the one hand, the effect size was typically around d = .2. As a result, the Bayes-Factor did not provide clear support for the null-hypothesis. On the other hand, an effect size of d = .2 in studies with 80% power is insufficient to produce Bayes-Factors favoring the presence of an effect, when the alternative hypothesis is specified as a Cauchy distribution centered over 0. This is especially true when the scaling parameter is larger, but even for a seemingly small scaling parameter Bayes-Factors would not provide strong support for a small effect. The reason is that the alternative hypothesis is centered over 0. As a result, it is difficult to distinguish the null-hypothesis from the alternative hypothesis.
A True Alternative Hypothesis: Centering the Prior Distribution over a Non-Null Effect Size
A Cauchy-distribution is just one possible way to formulate an alternative hypothesis. It is also possible to formulate alternative hypothesis as (a) a uniform distribution of effect sizes in a fixed range (e.g., the effect size is probably small to moderate, d = .2 to .5) or as a normal distribution centered over an effect size (e.g., the effect is most likely to be small, but there is some uncertainty about how small, d = 2 +/- SD = .1) (Dienes, 2014).
Dienes provided an online app to compute Bayes-Factors for these prior distributions. I used the posted r-code by John Christie to create the following figure. It shows Bayes-Factors for three a priori uniform distributions. Solid lines show Bayes-Factors for effect sizes in the range from 0 to 1. Dotted lines show effect sizes in the range from 0 to .5. The dot-line pattern shows Bayes-Factors for effect sizes in the range from .1 to .3. The most noteworthy observation is that prior distributions that are not centered over zero can actually provide evidence for a small effect with Bem’s (2011) sample sizes. The second observation is that these priors can also favor the null-hypothesis when the true effect size is zero (red lines). Bayes-Factors become more conclusive for more precisely formulate alternative hypotheses. The strongest evidence is obtained by contrasting the null-hypothesis with a narrow interval of possible effect sizes in the .1 to .3 range. The reason is that in this comparison weak effects below .1 clearly favor the null-hypothesis. For an expected effect size of d = .2, a range of values from 0 to .5 seems reasonable and can produce Bayes-Factors that exceed a value of 3 in studies with 100 to 200 participants. Thus, this is a reasonable prior for Bem’s studies.
It is also possible to formulate alternative hypotheses with normal distributions around an a priori effect size. Dienes recommends setting the mean to 0 and to set the standard deviation of the expected effect size. The problem with this approach is again that the alternative hypothesis is centered over 0 (in a two-tailed test). Moreover, the true effect size is not known. Like the scaling factor in the Cauchy distribution, using a higher value leads to a wider spread of alternative effect sizes and makes it harder to show evidence for small effects and easier to find evidence in favor of H0. However, the r-code also allows specifying non-null means for the alternative hypothesis. The next figure shows Bayes-Factors for three normally distributed alternative hypotheses. The solid lines show Bayes-Factors with mean = 0 and SD = .2. The dotted line shows Bayes-Factors for d = .2 (a small effect and the effect predicted by Bem) and a relatively wide standard deviation of .5. This means 95% of effect sizes are in the range from -.8 to 1.2. The broken (dot/dash) line shows Bayes-Factors with a mean of d = .2 and a narrower SD of d = .2. The 95% CI still covers a rather wide range of effect sizes from -.2 to .6, but due to the normal distribution effect sizes close to the expected effect size of d = .2 are weighted more heavily.
The first observation is that centering the normal distribution over 0 leads to the same problem as the Cauchy-distribution. When the effect size is really 0, Bayes-Factors provide clear support for the null-hypothesis. However, when the effect size is small, d = .2, Bayes-Factors fail to provide support for the presence for samples with fewer than 150 participants (this is a ones-sample design, the equivalent sample size for between-subject designs is N = 600). The dotted line shows that simply moving the mean from d = 0 to d = .2 has relatively little effect on Bayes-Factors. Due to the wide range of effect sizes, a small effect is not sufficient to produce Bayes-Factors greater than 3 in small samples. The broken line shows more promising results. With d = .2 and SD = .2, Bayes-Factors in small samples with less than 100 participants are inconclusive. For sample sizes of more than 100 participants, both lines are above the criterion value of 3. This means, a Bayes-Factor of 3 or more can support the null-hypothesis when it is true and it can show that a small effect is present when an effect is present.
Another way to specify the alternative hypothesis is to use a one-tailed alternative hypothesis (a half-normal). The mode (the center of the normal-distribution) of the distribution is 0. The solid line shows a standard deviation of .8. The dotted line shows results with standard deviation = .5 and the broken line shows results for a standard deviation of d = .2. The solid line favors the null-hypothesis and it requires sample sizes of more than 130 participants before an effect size of d = .2 produces a Bayes-Factor of 3 or more. In contrast, the broken line discriminates against the null-hypothesis and practically never supports the null-hypothesis when it is true. The dotted line with a standard deviation of .5 works best. It always shows support for the null-hypothesis when it is true and it can produce Bayes-Factors greater than 3 with a bit more than 100 participants.
In conclusion, the simulations show that Bayes-Factors depend on the specification of the prior distribution and sample size. This has two implications. Unreasonable priors will lower the sensitivity/power of Bayes-Factors to support either the null-hypothesis or the alternative hypothesis when these hypotheses are true. Unreasonable priors will also bias the results in favor of one of the two hypotheses. As a result, researchers need to justify the choice of their priors and they need to be careful when they interpret results. It is particularly difficult to interpret Bayes-Factors when the alternative hypothesis is diffuse and the null-hypothesis is supported. In this case, the evidence merely shows that the null-hypothesis fits the data better than the alternative, but the alternative is a composite of many effect sizes and some of these effect sizes may fit the data better than the null-hypothesis.
Comparison of Different Prior Distributions with Bem’s (2011) ESP Experiments
To examine the influence of prior distributions on Bayes-Factors, I computed Bayes-Factors using several prior distributions. I used a d~Cauchy(1) distribution because this distribution was used by Wagenmakers et al. (2011). I used three uniform prior distributions with ranges of effect sizes from 0 to 1, 0 to .5, and .1 to .3. Based on Dienes recommendation, I also used a normal distribution centered on zero with the expected effect size as the standard deviation. I used both two-tailed and one-tailed (half-normal) distributions. Based on a twitter-recommendation by Alexander Etz, I also centered the normal distribution on the effect size, d = .2, with a standard deviation of d = .2.
The d~Cauchy(1) prior used by Wagenmakers et al. (2011) gives the weakest support for an effect. The table also includes the product of Bayes-Factors. The results confirm that the product is not a meaningful statistic that can be used to conduct a meta-analysis with Bayes-Factors. The last column shows Bayes-Factors based on a traditional fixed-effect meta-analysis of effect sizes in all 10 studies. Even the d~Cauchy(1) prior now shows strong support for the presence of an effect even though it often favored the null-hypotheses for individual studies. This finding shows that inferences about small effects in small samples cannot be trusted as evidence that the null-hypothesis is correct.
Table 1 also shows that all other prior distributions tend to favor the presence of an effect even in individual studies. Thus, these priors show consistent results for individual studies and for a meta-analysis of all studies. The strength of evidence for an effect is predictable from the precision of the alternative hypothesis. The uniform distribution with a wide range of effect sizes from 0 to 1, gives the weakest support, but it still supports the presence of an effect. This further emphasizes how unrealistic the Cauchy-distribution with a scaling factor of 1 is for most studies in psychology. For most studies in psychology effect sizes greater than 1 are rare. Moreover, effect sizes greater than one do not need fancy statistics. A simple visual inspection of a scatter plot is sufficient to reject the null-hypothesis. The strongest support for an effect is obtained for the uniform distribution with a range of effect sizes from .1 to .3. The advantage of this range is that the lower bound is not 0. Thus, effect sizes below the lower bound provide evidence for H0 and effect sizes above the lower bound provide evidence for an effect. The lower bound can be set by a meaningful consideration of what effect sizes might be theoretically or practically so small that they would be rather uninteresting even if they are real. Personally, I find uniform distributions appealing because they best express uncertainty about an effect size. Most theories in psychology do not make predictions about effect sizes. Thus, it seems impossible to say that an effect is expected to be small (d = .2) or moderate (d = .5). It seems easier to say that an effect is expected to be small (d = .1 to .3) or moderate (.3 to .6) or large (.6 to 1). Cohen used fixed values only because power analysis requires a single value. As Bayesian statistics allows the specification of ranges, it makes sense to specify a range of values with the need to make predictions which values in this range are more likely. However, results for the normal distribution provide similar results. Again, the strength of evidence of an effect increases with the precision of the predicted effect. The weakest support for an effect is obtained with a normal distribution centered over 0 and a two-tailed test. This specification is similar to a Cauchy distribution but it uses the normal distribution. However, by setting the standard deviation to the expected effect sizes, Bayes-Factors show evidence for an effect. The evidence for an effect becomes stronger by centering the distribution over the expected effect size or by using a half-normal (one-tailed) test that makes predictions about the direction of the effect.
To summarize, the main point is that Bayes-Factors depend on the choice of the alternative distribution. Bayesian statisticians are of course well aware of this fact. However, in practical applications of Bayesian statistics, the importance of the prior distribution is often ignored, especially when Bayes-Factors favor the null-hypothesis. Although this finding only means that the data support the null-hypothesis more than the alternative hypothesis, the alternative hypothesis is often described in vague terms as a hypothesis that predicted an effect. However, the alternative hypothesis does not just predict that there is an effect. It makes predictions about the strength of effects and it is always possible to specify an alternative that predicts an effect that is still consistent with the data by choosing a small effect size. Thus, Bayesian statistics can only produce meaningful results if researchers specify a meaningful alternative hypothesis. It is therefore surprising how little attention Bayesian statisticians have devoted to the issue of specifying the prior distribution. The most useful advice comes from Dienes recommendation to specify the prior distribution as a normal distribution centered over 0 and to set the standard deviation to the expected effect size. If researchers are uncertain about the effect size, they could try different values for small (d = .2), moderate (d = .5), or large (d = .8) effect sizes. Researchers should be aware that the current default setting of .707 in Rouder’s online app implies an expectation of a strong effect and that this setting will make it harder to show evidence for small effects and inflates the risk of obtaining false support for the null-hypothesis.
Why Psychologists Should not Change the Way They Analyze Their Data
Wagenmakers et al. (2011) did not simply use Bayes-Factors to re-examine Bem’s claims about ESP. Like several other authors, they considered Bem’s (2011) article an example of major flaws in psychological science. Thus, they titled their article with the rather strong admonition that “Psychologists Must Change The Way They Analyze Their Data.” They blame the use of p-values and significance tests as the root cause of all problems in psychological science. “We conclude that Bem’s p values do not indicate evidence in favor of precognition; instead, they indicate that experimental psychologists need to change the way they conduct their experiments and analyze their data” (p. 426). The crusade against p-values starts with the claim that it is easy to obtain data that reject the null-hypothesis even when the null-hypothesis is true. “These experiments highlight the relative ease with which an inventive researcher can produce significant results even when the null hypothesis is true” (p. 427). However, this statement is incorrect. The probability of getting significant results is clearly specified by the type-I error rate. When the null-hypothesis is true, a significant result will emerge only 5% of the time; that is in 1 out of 20 studies. The probability of making a type-I error repeatedly decrease exponentially. For two studies, the probability to obtain two type-I errors is only p = .0025 or 1 out of 400 (20 * 20 studies). If some non-significant results are obtained, the binomial probability gives the probability that the frequency of significant results that could have been obtained if the null-hypothesis were true. Bem obtained 9 out of 10 significant results. With a probability of p = .05, the binomial probability is 18e-10. Thus, there is strong evidence that Bem’s results are not type-I errors. He did not just go in his lab and run 10 studies and obtained 9 significant results by chance alone. P-values correctly quantify how unlikely this event is in a single study and how this probability decrease as the number of studies increases. The table also shows that all Bayes-Factors confirm this conclusion when the results of all studies are combined in a meta-analysis. It is hard to see how p-values can be misleading when they lead to the same conclusion as Bayes-Factors. The combined evidence presented by Bem cannot be explained by random sampling error. The data are inconsistent with the null-hypothesis. The only misleading statistic is provided by a Bayes-Factor with an unreasonable prior distribution of effect sizes in small samples. All other statistics agree that the data show an effect.
Wagenmakers et al. (2011) next argument is that p-values only consider the conditional probability when the null-hypothesis is true, but that it is also important to consider the conditional probability if the alternative hypothesis is true. They fail to mention, however, that this alternative hypothesis is equivalent to the concept of statistical power. A p-values of less than .05 means that a significant result would be obtained only 5% of the time when the null-hypothesis is true. The probability of a significant result when an effect is present depends on the size of the effect and sampling error and can be computed using standard tools for power analysis. Importantly, Bem (2011) actually carried out an a priori power analysis and planned his studies to have 80% power. In a one-sample t-test, standard error is defined as 1/sqrt(N). Thus, with 100 participants, the standard error is .1. With an effect size of d = .2, the signal-to-noise ratio is .2/.1 = 2. Using a one-tailed significance test, the criterion value for significance is 1.66. The implied power is 63%. Bem used an effect size of d = .25 to suggest that he has 80% power. Even with a conservative estimate of 50% power, the likelihood ratio of obtaining a significant is .50/.05 = 10. This likelihood ratio can be interpreted like Bayes-Factors. Thus, in a study with 50% power, it is 10 times more likely to obtain a significant result when an effect is present than when the null-hypothesis is true. Thus, even in studies with modest power, favors the alternative hypothesis much more than the null-hypothesis. To argue that p-values provide weak evidence for an effect implies that a study had very low power to show an effect. For example, if a study has only 10% power, the likelihood ratio is only 2 in favor of an effect being present. Importantly, low power cannot explain Bem’s results because low power would imply that most studies produced non-significant results. However, he obtained 9 significant results in 10 studies. This success rate is itself an estimate of power and would suggest that Bem had 90% power in his studies. With 90% power, the likelihood ratio is .90/.05 = 18. The Bayesian argument against p-values is only valid for the interpretation of p-values in a single study in the absence of any information about power. Not surprisingly, Bayesians often focus on Fisher’s use of p-values. However, Neyman-Pearson emphasized the need to also consider type-II error rates and Cohen has emphasized the need to conduct power analysis to ensure that small effects can be detected. In recent years, there has been an encouraging trend to increase power of studies. One important consequence of high powered studies is that significant results increase the evidential value of significant results because a significant result is much more likely to emerge when an effect is present than when it is not present. However, it is important to note that the most likely outcome in underpowered studies is a non-significant result. Thus, it is unlikely that a set of studies can produce false evidence for an effect because a meta-analysis would reveal that most studies fail to show an effect. The main reason for the replication crisis in psychology is the practice not to report non-significant results. This is not a problem of p-values, but a problem of selective reporting. However, Bayes-Factors are not immune to reporting biases. As Table 1 shows, it would have been possible to provide strong evidence for ESP using Bayes-Factors as well.
To demonstrate the virtues of Bayesian statistics, Wagenmakers et al. (2011) then presented their Bayesian analyses of Bem’s data. What is important here, is how the authors explain the choice of their priors and how the authors interpret their results in the context of the choice of their priors. The authors state that they “computed a default Bayesian t test” (p. 430). The important word is default. This word makes it possible to present a Bayesian analysis without a justification of the prior distribution. The prior distribution is the default distribution, a one-size-fits-all prior that does not need any further elaboration. The authors do note that “more specific assumptions about the effect size of psi would result in a different test.” (p. 430). They do not mention that these different tests would also lead to different conclusions because the conclusion is always relative to the specified alternative hypothesis. Even less convincing is their claim that “we decided to first apply the default test because we did not feel qualified to make these more specific assumptions, especially not in an area as contentious as psi” (p. 430). It is true that the authors are not experts on PSI, but that is hardly necessary when Bem (2011) presented a meta-analysis and made an a prior prediction about effect size. Moreover, they could have at least used a half-Cauchy given that Bem used one-tailed tests.
The results of the default t-test are then used to suggest that “a default Bayesian test confirms the intuition that, for large sample sizes, one-sided p values higher than .01 are not compelling” (p. 430). This statement ignores their own critique of p-values that the compelingness of p-values depends on the power of a study. A p-value of .01 in a study with 10% power is not compelling because it is very unlikely outcome no matter whether an effect is present or not. However, in a study with 50% power, a p-value of .01 is very compelling because the likelihood ratio is 50. That is, it is 50 times more likely to get a significant result at p = .01 in a study with 50% power when an effect is present than when an effect is not present.
The authors then emphasize that they “did not select priors to obtain a desired result” (p. 430). This statement can be confusing to non-Bayesian readers. What this statement means is that Bayes-Factors do not entail statements about the probability that ESP exists or does not exist. However, Bayes-Factors do require specification of a prior distribution. Thus, the authors did select a prior distribution, namely the default distribution, and Table 1 shows that their choice of the prior distribution influenced the results.
The authors do directly address the choice of the prior distribution and state “we also examined other options, however, and found that our conclusions were robust. For a wide range of different non-default prior distributions on effect sizes, the evidence for precognition is either non-existent or negligible” (p. 430). These results are reported in a supplementary document. In these materials., the authors show how the scaling factor clearly influences results and that small scaling factors suggest an effect is present whereas larger scaling factors favor the null-hypothesis. However, Bayes-Factors in favor of an effect are not very strong. The reason is that the prior distribution is centered over 0 and a two-tailed test is being used. This makes it very difficult to distinguish the null-hypothesis from the alternative hypothesis. As shown in Table 1, priors that contrast the null-hypothesis with an effect provide much stronger evidence for the presence of an effect. In their conclusion, the authors state “In sum, we conclude that our results are robust to different specifiications of the scale parameter for the effect size prior under H1 “ This statement is more correct than the statement in the article, where they claim that they considered a wide range of non-default prior distributions. They did not consider a wide range of different distributions. They considered a wide range of scaling parameters for a single distribution; a Cauchy-distribution centered over 0. If they had considered a wide range of prior distributions, like I did in Table 1, they would have found that Bayes-Factors for some prior distributions suggest that an effect is present.
The authors then deal with the concern that Bayes-Factors depend on sample size and that larger samples might lead to different conclusions, especially when smaller samples favor the null-hypothesis. “At this point, one may wonder whether it is feasible to use the Bayesian t test and eventually obtain enough evidence against the null hypothesis to overcome the prior skepticism outlined in the previous section.” The authors claimed that they are biased against the presence of an effect by a factor of 10e-24. Thus, it would require a Bayes-Factor greater than 10e24 to sway them that ESP exists. They then point out that the default Bayesian t-test, a Cauchi(0,1) prior distribution, would produce this Bayes-Factor in a sample of 2,000 participants. They then propose that a sample size of N = 2,000 is excessive. This is not a principled robustness analysis. A much easier way to examine what would happen in a larger sample, is to conduct a meta-analysis of the 10 studies, which already included 1,196 participants. As shown in Table 1, the meta-analysis would have revealed that even the default t-test favors the presence of an effect over the null-hypothesis by a factor of 6.55e10. This is still not sufficient to overcome prejudice against an effect of a magnitude of 10e-24, but it would have made readers wonder about the claim that Bayes-Factors are superior than p-values. There is also no need to use Bayesian statistics to be more skeptical. Skeptical researchers can also adjust the criterion value of a p-value if they want to lower the risk of a type-I error. Editors could have asked Bem to demonstrate ESP with p < .001 rather than .05 in each study, but they considered 9 out of 10 significant results at p < .05 (one-tailed) sufficient. As Bayesians provide no clear criterion values when Bayes-Factors are sufficient, Bayesian statistics does not help editors in the decision process how strong evidence has to be.
Does This Mean ESP Exists?
As I have demonstrated, even Bayes-Factors using the most unfavorable prior distribution favors the presence of an effect in a meta-analysis of Bem’s 10 studies. Thus, Bayes-Factors and p-values strongly suggest that Bem’s data are not the result of random sampling error. It is simply too improbable that 9 out of 10 studies produce significant results when the null-hypothesis is true. However, this does not mean that Bem’s data provide evidence for a real effect because there are two explanations for systematic deviations from a random pattern (Schimmack, 2012). One explanation is that a true effect is present and that a study had good statistical power to produce a signal-to-noise ratio that produces a significant outcome. The other explanation is that no true effect is present, but that the reported results were obtained with the help of questionable research practices that inflate the type-I error rate. In a multiple study article, publication bias cannot explain the result because all studies were carried out by the same researcher. Publication bias can only occur when a researcher conducts a single study and reports a significant result that was obtained by chance alone. However, if a researcher conducts multiple studies, type-I errors will not occur again and again and questionable research practices (or fraud) are the only explanation for significant results when the null-hypothesis is actually true.
There have been numerous analyses of Bem’s (2011) data that show signs of questionable research practices (Francis, 2012; Schimmack, 2012; Schimmack, 2015). Moreover, other researchers have failed to replicate Bem’s results. Thus, there is no reason to believe in ESP based on Bem’s data even though Bayes-Factors and p-values strongly reject the hypothesis that sample means are just random deviations from 0. However, the problem is not that the data were analyzed with the wrong statistical method. The reason is that the data are not credible. It would be problematic to replace the standard t-test with the default Bayesian t-test because the default Bayesian t-test gives the right answer with questionable data. The reason is that it would give the wrong answer with credible data, namely it would suggest that no effect is present when a researcher conducts 10 studies with 50% power and honestly reports 5 non-significant results. Rather than correctly inferring from this pattern of results that an effect is present, the default-Bayesian t-test, when applied to each study individually, would suggest that the evidence is inconclusive.
There are many ways to analyze data. There are also many ways to conduct Bayesian analysis. The stronger the empirical evidence is, the less important the statistical approach will be. When different statistical approaches produce different results, it is important to carefully examine the different assumptions of statistical tests that lead to the different conclusions based on the same data. There is no superior statistical method. Never trust a statistician who tells you that you are using the wrong statistical method. Always ask for an explanation why one statistical method produces one result and why another statistical method produces a different result. If one method seems to make more reasonable assumptions than another (data are not normally distributed, unequal variances, unreasonable assumptions about effect size), use the more reasonable statistical method. I have repeatedly asked Dr. Wagenmakers to justify his choice of the Cauchi(0,1) prior, but he has not provide any theoretical or statistical arguments for this extremely wide range of effect sizes.
So, I do not think that psychologists need to change the way they analyze their data. In studies with reasonable power (50% or more), significant results are much more likely to occur when an effect is present than when an effect is not present, and likelihood ratios will show similar results as Bayes-Factors with reasonable priors. Moreover, the probability of a type-I errors in a single study is less important for researchers and science than long-term rate of type-II errors. Researchers need to conduct many studies to build up a CV, get jobs, grants, and take care of their graduate students. Low powered studies will lead to many non-significant results that provide inconclusive results. Thus, they need to conduct powerful studies to be successful. In the past, researchers often used questionable research practices to increase power without declaring the increased risk of a type-I error. However, in part due to Bem’s (2011) infamous article, questionable research practices are becoming less acceptable and direct replication attempts more quickly reveal questionable evidence. In this new culture of open science, only researchers who carefully plan studies will be able to provide consistent empirical support for a theory because the theory actually makes correct predictions. Once researchers report all of the relevant data, it is less important how these data are analyzed. In this new world of psychological science, it will be problematic to ignore power and to use the default Bayesian t-test because it will typically show no effect. Unless researches are planning to build a career on confirming the absence of effects, they should conduct studies with high-power and control type-I error rates by replicating and extending their own work.