Why Psychologists Should Not Change The Way They Analyze Their Data: The Devil is in the Default Prior

The scientific method is well-equipped to demonstrate regularities in nature as well as human behaviors. It works by repeating a scientific procedure (experiment or natural observation) many times. In the absence of a regular pattern, the empirical data will follow a random pattern. When a systematic pattern exists, the data will deviate from the pattern predicted by randomness. The deviation of an observed empirical result from a predicted random pattern is often quantified as a probability (p-value). The p-value itself is based on the ratio of the observed deviation from zero (effect size) and the amount of random error. As the signal-to-noise ratio increases, it becomes increasingly unlikely that the observed effect is simply a random event. As a result, it becomes more likely that an effect is present. The amount of noise in a set of observations can be reduced by repeating the scientific procedure many times. As the number of observations increases, noise decreases. For strong effects (large deviations from randomness), a relative small number of observations can be sufficient to produce extremely low p-values. However, for small effects it may require rather large samples to obtain a high signal-to-noise ratio that produces a very small p-value. This makes it difficult to test the null-hypothesis that there is no effect. The reason is that it is always possible to find an effect size that is so small that the noise in a study is too large to determine whether a small effect is present or whether there is really no effect at all; that is, the effect size is exactly zero (1 / infinity).

The problem that it is impossible to demonstrate scientifically that an effect is absent may explain why the scientific method has been unable to resolve conflicting views around controversial topics such as the existence of parapsychological phenomena or homeopathic medicine that lack a scientific explanation, but are believed by many to be real phenomena. The scientific method could show that these phenomena are real, if they were real, but the lack of evidence for these effects cannot rule out the possibility that a small effect may exist. In this post, I explore two statistical solutions to the problem of demonstrating that an effect is absent.

Neyman-Pearson Significance Testing (NPST)

The first solution is to follow Neyman-Pearsons’s orthodox significance test. NPST differs from the widely practiced null-hypothesis significance test (NHST) in that non-significant results are interpreted as evidence for the null-hypothesis. Thus, using the standard criterion of p = .05 as the criterion for significance, a p-value below .05 is used to reject the null-hypothesis and to infer that an effect is present. Importantly, if the p-value is greater than .05 the results are used to accept the null-hypothesis; that is, the hypothesis that there is no effect is true. As all statistical inferences, it is possible that the evidence is misleading and leads to the wrong conclusion. NPST distinguishes between two types or errors that are called type-I and type-II error. Type-I errors are errors when a p-value is below the criterion value (p < .05), but the null-hypothesis is actually true; that is there is no effect and the observed effect size was caused by a rare random event. Type-II errors are made when the null-hypothesis is accepted, but the null-hypothesis is false; there actually is an effect. The probability of making a type-II error depends on the size of the effect and the amount of noise in the data. Strong effects are unlikely to produce a type-II error even with noise data. Studies with very little noise are also unlikely to produce type-II errors because even small effects can still produce a high signal-to-noise ratio and significant results (p-values below the criterion value).   Type-II error rates can be very high in studies with small effects and a large amount of noise. NPST makes it possible to quantify the probability of a type-II error for a given effect size. By investing a large amount of resources, it is possible to reduce noise to a level that is sufficient to have a very low type-II error probability for very small effect sizes. The only requirement for using NPST to provide evidence for the null-hypothesis is to determine a margin of error that is considered acceptable. For example, it may be acceptable to infer that a weight-loss-medication has no effect on weight if weight loss is less than 1 pound over a one month period. It is impossible to demonstrate that the medication has absolutely no effect, but it is possible to demonstrate with high probability that the effect is unlikely to be more than 1 pound.

Bayes-Factors

The main difference between Bayes-Factors and NPST is that NPST yields type-II error rates for an a priori effect size. In contrast, Bayes-Factors do not postulate a single effect size, but use an a priori distribution of effect sizes. Bayes-Factors are based on the probability that the observed effect sizes is based on a true effect size of zero relative to the probability that the observed effect size was based on a true effect size within a range of a priori effect sizes. Bayes-Factors are the ratio of the probabilities for the two hypotheses. It is arbitrary, which hypothesis is in the numerator and which hypothesis is in the denominator. When the null-hypothesis is placed in the numerator and the alternative hypothesis is placed in the denominator, Bayes-Factors (BF01) decrease towards zero the more the data suggest that an effect is present. In this way, Bayes-Factors behave very much like p-values. As the signal-to-noise ratio increases, p-values and BF01 decrease.

There are two practical problems in the use of Bayes-Factors. One problem is that Bayes-Factors depend on the specification of the a priori distribution of effect sizes. It is therefore important that results can never be interpreted as evidence for the null-hypothesis or against the null-hypothesis per se. A Bayes-Factor that favors the null-hypothesis in the comparison to one a priori distribution can favor the alternative hypothesis for another a priori distribution of effect sizes. This makes Bayes-Factors impractical for the purpose of demonstrating that an effect does not exist (e.g., a drug does not have positive treatment effects). The second problem is that Bayes-Factors only provide quantitative information about the two hypotheses. Without a clear criterion value, Bayes-Factors cannot be used to claim that an effect is present or absent.

Selecting a Criterion Value for Bayes-Factors

A number of criterion values seem plausible. NPST always leads to a decision depending on the criterion for p-values. An equivalent criterion value for Bayes-Factors would be a value of 1. Values greater than 1 favor the null-hypothesis over the alternative, whereas values less than 1 favor the alternative hypothesis. This criterion avoids inconclusive results. The disadvantage with this criterion is that Bayes-Factors close to 1 are very variable and prone to have high type-I and type-II error rates. To avoid this problem, it is possible to use more stringent criterion values. This reduces the type-I and type-II error rates, but it also increases the rate of inconclusive results in noisy studies. Bayes-Factors of 3 (a 3 to 1 ratio in favor of the null over an alternative hypothesis) are often used to suggest that the data favor one hypothesis over another, and Bayes-Factors of 10 or more are often considered strong support. One problem with these criterion values is that there have been no systematic studies of the type-I and type-II error rates for these criterion values. Moreover, there have been no systematic sensitivity studies; that is, the ability of studies to reach a criterion value for different signal-to-noise ratios.

Wagenmakers et al. (2011) argued that p-values can be misleading and that Bayes-Factors provide more meaningful results. To make their point, they investigated Bem’s (2011) controversial studies that seemed to demonstrate the ability to anticipate random events in the future (time –reversed causality). Using a significance criterion of p < .05 (one-tailed), 9 out of 10 studies showed evidence of an effect. For example, in Study 1, participants were able to predict the location of erotic pictures 54% of the time, even before a computer randomly generated the location of the picture. Using a more liberal type-I error rate of p < .10 (one-tailed), all 10 studies produced evidence for extrasensory perception.

Wagenmakers et al. (2011) re-examined the data with Bayes-Factors. They used a Bayes-Factor of 3 as the criterion value. Using this value, six tests were inconclusive, three provided substantial support for the null-hypothesis (the observed effect was just due to noise in the data) and only one test produced substantial support for ESP.   The most important point here is that the authors interpreted their results using a Bayes-Factor of 3 as criterion. If they had used a Bayes-Factor of 10 as criterion, they would have concluded that all studies were inconclusive. If they had used a Bayes-Factor of 1 as criterion, they would have concluded that 6 studies favored the null-hypothesis and 4 studies favored the presence of an effect.

Matzke, Nieuwenhuis, van Rijn, Slagter, van der Molen, and Wagenmakers used Bayes-Factors in a design with optional stopping. They agreed to stop data-collection when the Bayes-Factor reached a criterion value of 10 in favor of either hypothesis. The implementation of a decision to stop data collection suggests that a Bayes-Factor of 10 was considered decisive. One reason for this stopping rule would be that it is extremely unlikely that a Bayes-Factor might swing to favoring the alternative hypothesis if more data were collected. By the same logic, a Bayes-Factor of 10 that favors the presence of an effect in an ESP effect would suggest that further data collection would be unnecessary because the evidence already shows rather strong evidence that an effect is present.

Tan, Dienes, Jansari, and Goh, (2014) report a Bayes-Factor of 11.67 and interpret as being “greater than 3 and strong evidence for the alternative over the null” (p. 19). Armstrong and Dienes (2013) report a Bayes-Factor of 0.87 and state that no conclusion follows from this finding because the Bayes-Factor is between 3 and 1/3. This statement implies that Bayes-Factors that meet the criterion value are conclusive.

In sum, a criterion-value of 3 has often been used to interpret empirical data and a criterion of 10 has been used as strong evidence in favor of an effect or in favor of the null-hypothesis.

Meta-Analysis of Multiple Studies

As sample sizes increase, noise decreases and the signal-to-noise ratio increases. Rather than increasing the sample size of a single study, it is also possible to conduct multiple smaller studies and to combine the evidence of studies in a meta-analysis. The effect is the same. A meta-analysis based on several original studies reduces random noise in the data and can produce higher signal-to-noise ratios when an effect is present. On the flip side, a low signal-to-noise ratio in a meta-analysis implies that the signal is very weak and that the true effect size is close to zero. As the evidence in a meta-analysis is based on the aggregation of several smaller studies, the results should be consistent. That is, the effect size in the smaller studies and the meta-analysis is the same. The only difference is that aggregation of studies reduces noise, which increases the signal-to-noise ratio.   A meta-analysis therefore can highlight the problem of interpreting a low signal-to-noise ratio (BF10 < 1, p > .05) in small studies as evidence for the null-hypothesis. In NPST this result would be flagged as not trustworthy because the type-II error probability is high. For example, a non-significant result with a type-II error of 80% (20% power) is not particularly interesting and nobody would want to accept the null-hypothesis with such a high error probability. Holding the effect size constant, the type-II error probability decreases as the number of studies in a meta-analysis increases and it becomes increasingly more probable that the true effect size is below the value that was considered necessary to demonstrate an effect. Similarly, Bayes-Factors can be misleading in small samples and they become more conclusive as more information becomes available.

A simple demonstration of the influence of sample size on Bayes-Factors comes from Rouder and Morey (2011). The authors point out that it is not possible to combine Bayes-Factors by multiplying Bayes-Factors of individual studies. To address this problem, they created a new method to combine Bayes-Factors. This Bayesian meta-analysis is implemented in the Bayes-Factor r-package. Rouder and Morey (2011) applied their method to a subset of Bem’s data. However, they did not use it to examine the combined Bayes-Factor for the 10 studies that Wagenmakers et al. (2011) examined individually. I submitted the t-values and sample sizes of all 10 studies to a Bayesian meta-analysis and obtained a strong Bayes-Factor in favor of an effect, BF10 = 16e7, that is, 16 million to 1 in favor of ESP. Thus, a meta-analysis of all 10 studies strongly suggests that Bem’s data are not random.

Another way to meta-analyze Bem’s 10 studies is to compute a Bayes-Factor based on the finding that 9 out of 10 studies produced a significant result. The p-value for this outcome under the null-hypothesis is extremely small; 1.86e-11, that is p < .00000000002. It is also possible to compute a Bayes-Factor for the binomial probability of 9 out of 10 successes with a probability of 5% to have a success under the null-hypothesis. The alternative hypothesis can be specified in several ways, but one common option is to use a uniform distribution from 0 to 1 (beta(1,1). This distribution allows for the power of a study to range anywhere from 0 to 1 and makes no a priori assumptions about the true power of Bem’s studies. The Bayes-Factor strongly favors the presence of an effect, BF10 = 20e9. In sum, a meta-analysis of Bem’s 10 studies strongly supports the presence of an effect and rejects the null-hypothesis.

The meta-analytic results raise concerns about the validity of Wagenmakers et al.’s (2011) claim that Bem presented weak evidence and that p-values misleading information. Instead, Wagenmakers et al.’s Bayes-Factors are misleading and fail to detect an effect that is clearly present in the data.

The Devil is in the Priors: What is the Alternative Hypothesis in the Default Bayesian t-test?

Wagenmakers et al. (2011) computed Bayes-Factors using the default Bayesian t-test. The default Bayesian t-test uses a Cauchy distribution centered over zero as the alternative hypothesis. The Cauchy distribution has a scaling factor. Wagenmakers et al. (2011) used a default scaling factor of 1. Since then, the default scaling parameter has changed to .707.Figure 1 illustrates Cauchi distributions with scaling factors .2, .5, .707, and 1.

WagF1

The black line shows the Cauchy distribution with a scaling factor of d = .2. A scaling factor of d = .2 implies that 50% of the density of the distribution is in the interval between d = -.2 and d = .2. As the Cauchy-distribution is centered over 0, this specification also implies that the null-hypothesis is considered much more likely than many other effect sizes, but it gives equal weight to effect sizes below and above an absolute value of d = .2.   As the scaling factor increases, the distribution gets wider. With a scaling factor of 1, 50% of the density distribution is within the range from -1 to 1 and 50% covers effect sizes greater than 1.   The choice of the scaling parameter has predictable consequences on the Bayes-Factor. As long as the true effect size is more extreme than the scaling parameter, Bayes-Factors will favor the alternative hypothesis and Bayes-Factors will increase towards infinity as sampling error decreases. However, for true effect sizes that are below the scaling parameter, Bayes-Factors may initially favor the null-hypothesis because the alternative hypothesis includes effect sizes that are more extreme than the alternative hypothesis. As sample sizes increase, the Bayes-Factor will change from favoring the null-hypothesis to favoring the alternative hypothesis.   This can explain why Wagenmakers et al. (2011) found no support for ESP when Bem’s studies were examined individually, but a meta-analysis of all studies shows strong evidence in favor of an effect.

The effect of the scaling parameter on Bayes-Factors is illustrated in the following Figure.

WagF2

The straight lines show Bayes-Factors (y-axis) as a function of sample size for a scaling parameter of 1. The black line shows Bayes-Factors favoring an effect of d = .2 when the effect size is actually d = .2 (BF10) and the red line shows Bayes-Factor favoring the null-hypothesis when the effect size is actually 0. The green line implies a criterion value of 3 to suggest “substantial” support for either hypothesis (Wagenmakers et al., 2011). The figure shows that Bem’s sample sizes of 50 to 150 participants could never produce substantial evidence for an effect when the observed effect size is d = .2. In contrast, an effect size of 0 would produce provide substantial support for the null-hypothesis. Of course, actual effect sizes in samples will deviated from these hypothetical values, but sampling error will average out. Thus, for studies that occasionally show support for an effect there will also be studies that underestimate support for an effect. The dotted lines illustrate how the choice of the scaling factor influences Bayes-Factors. With a scaling factor of d = .2, Bayes-Factors would never favor the null-hypothesis. They would also not support the alternative hypothesis in studies with less than 150 participants and even in these studies the Bayes-Factor is likely to be just above 3.

Figure 2 explains why Wagenmakers et al.’s (2011) did mainly find inconclusive results. On the one hand, the effect size was typically around d = .2. As a result, the Bayes-Factor did not provide clear support for the null-hypothesis. On the other hand, an effect size of d = .2 in studies with 80% power is insufficient to produce Bayes-Factors favoring the presence of an effect, when the alternative hypothesis is specified as a Cauchy distribution centered over 0. This is especially true when the scaling parameter is larger, but even for a seemingly small scaling parameter Bayes-Factors would not provide strong support for a small effect. The reason is that the alternative hypothesis is centered over 0. As a result, it is difficult to distinguish the null-hypothesis from the alternative hypothesis.

A True Alternative Hypothesis: Centering the Prior Distribution over a Non-Null Effect Size

A Cauchy-distribution is just one possible way to formulate an alternative hypothesis. It is also possible to formulate alternative hypothesis as (a) a uniform distribution of effect sizes in a fixed range (e.g., the effect size is probably small to moderate, d = .2 to .5) or as a normal distribution centered over an effect size (e.g., the effect is most likely to be small, but there is some uncertainty about how small, d = 2 +/- SD = .1) (Dienes, 2014).

Dienes provided an online app to compute Bayes-Factors for these prior distributions. I used the posted r-code by John Christie to create the following figure. It shows Bayes-Factors for three a priori uniform distributions. Solid lines show Bayes-Factors for effect sizes in the range from 0 to 1. Dotted lines show effect sizes in the range from 0 to .5. The dot-line pattern shows Bayes-Factors for effect sizes in the range from .1 to .3. The most noteworthy observation is that prior distributions that are not centered over zero can actually provide evidence for a small effect with Bem’s (2011) sample sizes. The second observation is that these priors can also favor the null-hypothesis when the true effect size is zero (red lines). Bayes-Factors become more conclusive for more precisely formulate alternative hypotheses. The strongest evidence is obtained by contrasting the null-hypothesis with a narrow interval of possible effect sizes in the .1 to .3 range. The reason is that in this comparison weak effects below .1 clearly favor the null-hypothesis. For an expected effect size of d = .2, a range of values from 0 to .5 seems reasonable and can produce Bayes-Factors that exceed a value of 3 in studies with 100 to 200 participants. Thus, this is a reasonable prior for Bem’s studies.

WagF3

It is also possible to formulate alternative hypotheses with normal distributions around an a priori effect size. Dienes recommends setting the mean to 0 and to set the standard deviation of the expected effect size. The problem with this approach is again that the alternative hypothesis is centered over 0 (in a two-tailed test).   Moreover, the true effect size is not known. Like the scaling factor in the Cauchy distribution, using a higher value leads to a wider spread of alternative effect sizes and makes it harder to show evidence for small effects and easier to find evidence in favor of H0.   However, the r-code also allows specifying non-null means for the alternative hypothesis.   The next figure shows Bayes-Factors for three normally distributed alternative hypotheses. The solid lines show Bayes-Factors with mean = 0 and SD = .2. The dotted line shows Bayes-Factors for d = .2 (a small effect and the effect predicted by Bem) and a relatively wide standard deviation of .5. This means 95% of effect sizes are in the range from -.8 to 1.2. The broken (dot/dash) line shows Bayes-Factors with a mean of d = .2 and a narrower SD of d = .2. The 95% CI still covers a rather wide range of effect sizes from -.2 to .6, but due to the normal distribution effect sizes close to the expected effect size of d = .2 are weighted more heavily.

WagF4

The first observation is that centering the normal distribution over 0 leads to the same problem as the Cauchy-distribution. When the effect size is really 0, Bayes-Factors provide clear support for the null-hypothesis. However, when the effect size is small, d = .2, Bayes-Factors fail to provide support for the presence for samples with fewer than 150 participants (this is a ones-sample design, the equivalent sample size for between-subject designs is N = 600). The dotted line shows that simply moving the mean from d = 0 to d = .2 has relatively little effect on Bayes-Factors. Due to the wide range of effect sizes, a small effect is not sufficient to produce Bayes-Factors greater than 3 in small samples. The broken line shows more promising results. With d = .2 and SD = .2, Bayes-Factors in small samples with less than 100 participants are inconclusive. For sample sizes of more than 100 participants, both lines are above the criterion value of 3. This means, a Bayes-Factor of 3 or more can support the null-hypothesis when it is true and it can show that a small effect is present when an effect is present.

Another way to specify the alternative hypothesis is to use a one-tailed alternative hypothesis (a half-normal).   The mode (the center of the normal-distribution) of the distribution is 0. The solid line shows a standard deviation of .8. The dotted line shows results with standard deviation = .5 and the broken line shows results for a standard deviation of d = .2. The solid line favors the null-hypothesis and it requires sample sizes of more than 130 participants before an effect size of d = .2 produces a Bayes-Factor of 3 or more. In contrast, the broken line discriminates against the null-hypothesis and practically never supports the null-hypothesis when it is true. The dotted line with a standard deviation of .5 works best. It always shows support for the null-hypothesis when it is true and it can produce Bayes-Factors greater than 3 with a bit more than 100 participants.

WagF5

In conclusion, the simulations show that Bayes-Factors depend on the specification of the prior distribution and sample size. This has two implications. Unreasonable priors will lower the sensitivity/power of Bayes-Factors to support either the null-hypothesis or the alternative hypothesis when these hypotheses are true. Unreasonable priors will also bias the results in favor of one of the two hypotheses. As a result, researchers need to justify the choice of their priors and they need to be careful when they interpret results. It is particularly difficult to interpret Bayes-Factors when the alternative hypothesis is diffuse and the null-hypothesis is supported. In this case, the evidence merely shows that the null-hypothesis fits the data better than the alternative, but the alternative is a composite of many effect sizes and some of these effect sizes may fit the data better than the null-hypothesis.

Comparison of Different Prior Distributions with Bem’s (2011) ESP Experiments

To examine the influence of prior distributions on Bayes-Factors, I computed Bayes-Factors using several prior distributions. I used a d~Cauchy(1) distribution because this distribution was used by Wagenmakers et al. (2011). I used three uniform prior distributions with ranges of effect sizes from 0 to 1, 0 to .5, and .1 to .3. Based on Dienes recommendation, I also used a normal distribution centered on zero with the expected effect size as the standard deviation. I used both two-tailed and one-tailed (half-normal) distributions. Based on a twitter-recommendation by Alexander Etz, I also centered the normal distribution on the effect size, d = .2, with a standard deviation of d = .2.

Wag1 Table

The d~Cauchy(1) prior used by Wagenmakers et al. (2011) gives the weakest support for an effect. The table also includes the product of Bayes-Factors. The results confirm that the product is not a meaningful statistic that can be used to conduct a meta-analysis with Bayes-Factors. The last column shows Bayes-Factors based on a traditional fixed-effect meta-analysis of effect sizes in all 10 studies. Even the d~Cauchy(1) prior now shows strong support for the presence of an effect even though it often favored the null-hypotheses for individual studies. This finding shows that inferences about small effects in small samples cannot be trusted as evidence that the null-hypothesis is correct.

Table 1 also shows that all other prior distributions tend to favor the presence of an effect even in individual studies. Thus, these priors show consistent results for individual studies and for a meta-analysis of all studies. The strength of evidence for an effect is predictable from the precision of the alternative hypothesis. The uniform distribution with a wide range of effect sizes from 0 to 1, gives the weakest support, but it still supports the presence of an effect. This further emphasizes how unrealistic the Cauchy-distribution with a scaling factor of 1 is for most studies in psychology. For most studies in psychology effect sizes greater than 1 are rare. Moreover, effect sizes greater than one do not need fancy statistics. A simple visual inspection of a scatter plot is sufficient to reject the null-hypothesis. The strongest support for an effect is obtained for the uniform distribution with a range of effect sizes from .1 to .3. The advantage of this range is that the lower bound is not 0. Thus, effect sizes below the lower bound provide evidence for H0 and effect sizes above the lower bound provide evidence for an effect. The lower bound can be set by a meaningful consideration of what effect sizes might be theoretically or practically so small that they would be rather uninteresting even if they are real. Personally, I find uniform distributions appealing because they best express uncertainty about an effect size. Most theories in psychology do not make predictions about effect sizes. Thus, it seems impossible to say that an effect is expected to be small (d = .2) or moderate (d = .5). It seems easier to say that an effect is expected to be small (d = .1 to .3) or moderate (.3 to .6) or large (.6 to 1). Cohen used fixed values only because power analysis requires a single value. As Bayesian statistics allows the specification of ranges, it makes sense to specify a range of values with the need to make predictions which values in this range are more likely. However, results for the normal distribution provide similar results. Again, the strength of evidence of an effect increases with the precision of the predicted effect. The weakest support for an effect is obtained with a normal distribution centered over 0 and a two-tailed test. This specification is similar to a Cauchy distribution but it uses the normal distribution. However, by setting the standard deviation to the expected effect sizes, Bayes-Factors show evidence for an effect. The evidence for an effect becomes stronger by centering the distribution over the expected effect size or by using a half-normal (one-tailed) test that makes predictions about the direction of the effect.

To summarize, the main point is that Bayes-Factors depend on the choice of the alternative distribution. Bayesian statisticians are of course well aware of this fact. However, in practical applications of Bayesian statistics, the importance of the prior distribution is often ignored, especially when Bayes-Factors favor the null-hypothesis. Although this finding only means that the data support the null-hypothesis more than the alternative hypothesis, the alternative hypothesis is often described in vague terms as a hypothesis that predicted an effect. However, the alternative hypothesis does not just predict that there is an effect. It makes predictions about the strength of effects and it is always possible to specify an alternative that predicts an effect that is still consistent with the data by choosing a small effect size. Thus, Bayesian statistics can only produce meaningful results if researchers specify a meaningful alternative hypothesis. It is therefore surprising how little attention Bayesian statisticians have devoted to the issue of specifying the prior distribution. The most useful advice comes from Dienes recommendation to specify the prior distribution as a normal distribution centered over 0 and to set the standard deviation to the expected effect size. If researchers are uncertain about the effect size, they could try different values for small (d = .2), moderate (d = .5), or large (d = .8) effect sizes. Researchers should be aware that the current default setting of .707 in Rouder’s online app implies an expectation of a strong effect and that this setting will make it harder to show evidence for small effects and inflates the risk of obtaining false support for the null-hypothesis.

Why Psychologists Should not Change the Way They Analyze Their Data

Wagenmakers et al. (2011) did not simply use Bayes-Factors to re-examine Bem’s claims about ESP. Like several other authors, they considered Bem’s (2011) article an example of major flaws in psychological science. Thus, they titled their article with the rather strong admonition that “Psychologists Must Change The Way They Analyze Their Data.”   They blame the use of p-values and significance tests as the root cause of all problems in psychological science. “We conclude that Bem’s p values do not indicate evidence in favor of precognition; instead, they indicate that experimental psychologists need to change the way they conduct their experiments and analyze their data” (p. 426). The crusade against p-values starts with the claim that it is easy to obtain data that reject the null-hypothesis even when the null-hypothesis is true. “These experiments highlight the relative ease with which an inventive researcher can produce significant results even when the null hypothesis is true” (p. 427). However, this statement is incorrect. The probability of getting significant results is clearly specified by the type-I error rate. When the null-hypothesis is true, a significant result will emerge only 5% of the time; that is in 1 out of 20 studies. The probability of making a type-I error repeatedly decrease exponentially. For two studies, the probability to obtain two type-I errors is only p = .0025 or 1 out of 400 (20 * 20 studies).   If some non-significant results are obtained, the binomial probability gives the probability that the frequency of significant results that could have been obtained if the null-hypothesis were true. Bem obtained 9 out of 10 significant results. With a probability of p = .05, the binomial probability is 18e-10. Thus, there is strong evidence that Bem’s results are not type-I errors. He did not just go in his lab and run 10 studies and obtained 9 significant results by chance alone. P-values correctly quantify how unlikely this event is in a single study and how this probability decrease as the number of studies increases. The table also shows that all Bayes-Factors confirm this conclusion when the results of all studies are combined in a meta-analysis.   It is hard to see how p-values can be misleading when they lead to the same conclusion as Bayes-Factors. The combined evidence presented by Bem cannot be explained by random sampling error. The data are inconsistent with the null-hypothesis. The only misleading statistic is provided by a Bayes-Factor with an unreasonable prior distribution of effect sizes in small samples. All other statistics agree that the data show an effect.

Wagenmakers et al. (2011) next argument is that p-values only consider the conditional probability when the null-hypothesis is true, but that it is also important to consider the conditional probability if the alternative hypothesis is true. They fail to mention, however, that this alternative hypothesis is equivalent to the concept of statistical power. A p-values of less than .05 means that a significant result would be obtained only 5% of the time when the null-hypothesis is true. The probability of a significant result when an effect is present depends on the size of the effect and sampling error and can be computed using standard tools for power analysis. Importantly, Bem (2011) actually carried out an a priori power analysis and planned his studies to have 80% power. In a one-sample t-test, standard error is defined as 1/sqrt(N). Thus, with 100 participants, the standard error is .1. With an effect size of d = .2, the signal-to-noise ratio is .2/.1 = 2. Using a one-tailed significance test, the criterion value for significance is 1.66. The implied power is 63%. Bem used an effect size of d = .25 to suggest that he has 80% power. Even with a conservative estimate of 50% power, the likelihood ratio of obtaining a significant is .50/.05 = 10. This likelihood ratio can be interpreted like Bayes-Factors. Thus, in a study with 50% power, it is 10 times more likely to obtain a significant result when an effect is present than when the null-hypothesis is true. Thus, even in studies with modest power, favors the alternative hypothesis much more than the null-hypothesis. To argue that p-values provide weak evidence for an effect implies that a study had very low power to show an effect. For example, if a study has only 10% power, the likelihood ratio is only 2 in favor of an effect being present. Importantly, low power cannot explain Bem’s results because low power would imply that most studies produced non-significant results. However, he obtained 9 significant results in 10 studies. This success rate is itself an estimate of power and would suggest that Bem had 90% power in his studies. With 90% power, the likelihood ratio is .90/.05 = 18. The Bayesian argument against p-values is only valid for the interpretation of p-values in a single study in the absence of any information about power. Not surprisingly, Bayesians often focus on Fisher’s use of p-values. However, Neyman-Pearson emphasized the need to also consider type-II error rates and Cohen has emphasized the need to conduct power analysis to ensure that small effects can be detected. In recent years, there has been an encouraging trend to increase power of studies. One important consequence of high powered studies is that significant results increase the evidential value of significant results because a significant result is much more likely to emerge when an effect is present than when it is not present. However, it is important to note that the most likely outcome in underpowered studies is a non-significant result. Thus, it is unlikely that a set of studies can produce false evidence for an effect because a meta-analysis would reveal that most studies fail to show an effect. The main reason for the replication crisis in psychology is the practice not to report non-significant results. This is not a problem of p-values, but a problem of selective reporting. However, Bayes-Factors are not immune to reporting biases. As Table 1 shows, it would have been possible to provide strong evidence for ESP using Bayes-Factors as well.

To demonstrate the virtues of Bayesian statistics, Wagenmakers et al. (2011) then presented their Bayesian analyses of Bem’s data. What is important here, is how the authors explain the choice of their priors and how the authors interpret their results in the context of the choice of their priors.   The authors state that they “computed a default Bayesian t test” (p. 430). The important word is default. This word makes it possible to present a Bayesian analysis without a justification of the prior distribution. The prior distribution is the default distribution, a one-size-fits-all prior that does not need any further elaboration. The authors do note that “more specific assumptions about the effect size of psi would result in a different test.” (p. 430). They do not mention that these different tests would also lead to different conclusions because the conclusion is always relative to the specified alternative hypothesis. Even less convincing is their claim that “we decided to first apply the default test because we did not feel qualified to make these more specific assumptions, especially not in an area as contentious as psi” (p. 430). It is true that the authors are not experts on PSI, but that is hardly necessary when Bem (2011) presented a meta-analysis and  made an a prior prediction about effect size. Moreover, they could have at least used a half-Cauchy given that Bem used one-tailed tests.

The results of the default t-test are then used to suggest that “a default Bayesian test confirms the intuition that, for large sample sizes, one-sided p values higher than .01 are not compelling” (p. 430). This statement ignores their own critique of p-values that the compelingness of p-values depends on the power of a study. A p-value of .01 in a study with 10% power is not compelling because it is very unlikely outcome no matter whether an effect is present or not. However, in a study with 50% power, a p-value of .01 is very compelling because the likelihood ratio is 50. That is, it is 50 times more likely to get a significant result at p = .01 in a study with 50% power when an effect is present than when an effect is not present.

The authors then emphasize that they “did not select priors to obtain a desired result” (p. 430). This statement can be confusing to non-Bayesian readers. What this statement means is that Bayes-Factors do not entail statements about the probability that ESP exists or does not exist. However, Bayes-Factors do require specification of a prior distribution. Thus, the authors did select a prior distribution, namely the default distribution, and Table 1 shows that their choice of the prior distribution influenced the results.

The authors do directly address the choice of the prior distribution and state “we also examined other options, however, and found that our conclusions were robust. For a wide range of different non-default prior distributions on effect sizes, the evidence for precognition is either non-existent or negligible” (p. 430). These results are reported in a supplementary document. In these materials., the authors show how the scaling factor clearly influences results and that small scaling factors suggest an effect is present whereas larger scaling factors favor the null-hypothesis. However, Bayes-Factors in favor of an effect are not very strong. The reason is that the prior distribution is centered over 0 and a two-tailed test is being used. This makes it very difficult to distinguish the null-hypothesis from the alternative hypothesis. As shown in Table 1, priors that contrast the null-hypothesis with an effect provide much stronger evidence for the presence of an effect. In their conclusion, the authors state “In sum, we conclude that our results are robust to different specifiications of the scale parameter for the effect size prior under H1 “ This statement is more correct than the statement in the article, where they claim that they considered a wide range of non-default prior distributions. They did not consider a wide range of different distributions. They considered a wide range of scaling parameters for a single distribution; a Cauchy-distribution centered over 0.   If they had considered a wide range of prior distributions, like I did in Table 1, they would have found that Bayes-Factors for some prior distributions suggest that an effect is present.

The authors then deal with the concern that Bayes-Factors depend on sample size and that larger samples might lead to different conclusions, especially when smaller samples favor the null-hypothesis. “At this point, one may wonder whether it is feasible to use the Bayesian t test and eventually obtain enough evidence against the null hypothesis to overcome the prior skepticism outlined in the previous section.” The authors claimed that they are biased against the presence of an effect by a factor of 10e-24. Thus, it would require a Bayes-Factor greater than 10e24 to sway them that ESP exists. They then point out that the default Bayesian t-test, a Cauchi(0,1) prior distribution, would produce this Bayes-Factor in a sample of 2,000 participants. They then propose that a sample size of N = 2,000 is excessive. This is not a principled robustness analysis. A much easier way to examine what would happen in a larger sample, is to conduct a meta-analysis of the 10 studies, which already included 1,196 participants. As shown in Table 1, the meta-analysis would have revealed that even the default t-test favors the presence of an effect over the null-hypothesis by a factor of 6.55e10.   This is still not sufficient to overcome prejudice against an effect of a magnitude of 10e-24, but it would have made readers wonder about the claim that Bayes-Factors are superior than p-values. There is also no need to use Bayesian statistics to be more skeptical. Skeptical researchers can also adjust the criterion value of a p-value if they want to lower the risk of a type-I error. Editors could have asked Bem to demonstrate ESP with p < .001 rather than .05 in each study, but they considered 9 out of 10 significant results at p < .05 (one-tailed) sufficient. As Bayesians provide no clear criterion values when Bayes-Factors are sufficient, Bayesian statistics does not help editors in the decision process how strong evidence has to be.

Does This Mean ESP Exists?

As I have demonstrated, even Bayes-Factors using the most unfavorable prior distribution favors the presence of an effect in a meta-analysis of Bem’s 10 studies. Thus, Bayes-Factors and p-values strongly suggest that Bem’s data are not the result of random sampling error. It is simply too improbable that 9 out of 10 studies produce significant results when the null-hypothesis is true. However, this does not mean that Bem’s data provide evidence for a real effect because there are two explanations for systematic deviations from a random pattern (Schimmack, 2012). One explanation is that a true effect is present and that a study had good statistical power to produce a signal-to-noise ratio that produces a significant outcome. The other explanation is that no true effect is present, but that the reported results were obtained with the help of questionable research practices that inflate the type-I error rate. In a multiple study article, publication bias cannot explain the result because all studies were carried out by the same researcher. Publication bias can only occur when a researcher conducts a single study and reports a significant result that was obtained by chance alone. However, if a researcher conducts multiple studies, type-I errors will not occur again and again and questionable research practices (or fraud) are the only explanation for significant results when the null-hypothesis is actually true.

There have been numerous analyses of Bem’s (2011) data that show signs of questionable research practices (Francis, 2012; Schimmack, 2012; Schimmack, 2015). Moreover, other researchers have failed to replicate Bem’s results. Thus, there is no reason to believe in ESP based on Bem’s data even though Bayes-Factors and p-values strongly reject the hypothesis that sample means are just random deviations from 0. However, the problem is not that the data were analyzed with the wrong statistical method. The reason is that the data are not credible. It would be problematic to replace the standard t-test with the default Bayesian t-test because the default Bayesian t-test gives the right answer with questionable data. The reason is that it would give the wrong answer with credible data, namely it would suggest that no effect is present when a researcher conducts 10 studies with 50% power and honestly reports 5 non-significant results. Rather than correctly inferring from this pattern of results that an effect is present, the default-Bayesian t-test, when applied to each study individually, would suggest that the evidence is inconclusive.

Conclusion

There are many ways to analyze data. There are also many ways to conduct Bayesian analysis. The stronger the empirical evidence is, the less important the statistical approach will be. When different statistical approaches produce different results, it is important to carefully examine the different assumptions of statistical tests that lead to the different conclusions based on the same data. There is no superior statistical method. Never trust a statistician who tells you that you are using the wrong statistical method. Always ask for an explanation why one statistical method produces one result and why another statistical method produces a different result. If one method seems to make more reasonable assumptions than another (data are not normally distributed, unequal variances, unreasonable assumptions about effect size), use the more reasonable statistical method. I have repeatedly asked Dr. Wagenmakers to justify his choice of the Cauchi(0,1) prior, but he has not provide any theoretical or statistical arguments for this extremely wide range of effect sizes.

So, I do not think that psychologists need to change the way they analyze their data. In studies with reasonable power (50% or more), significant results are much more likely to occur when an effect is present than when an effect is not present, and likelihood ratios will show similar results as Bayes-Factors with reasonable priors. Moreover, the probability of a type-I errors in a single study is less important for researchers and science than long-term rate of type-II errors. Researchers need to conduct many studies to build up a CV, get jobs, grants, and take care of their graduate students. Low powered studies will lead to many non-significant results that provide inconclusive results. Thus, they need to conduct powerful studies to be successful. In the past, researchers often used questionable research practices to increase power without declaring the increased risk of a type-I error. However, in part due to Bem’s (2011) infamous article, questionable research practices are becoming less acceptable and direct replication attempts more quickly reveal questionable evidence. In this new culture of open science, only researchers who carefully plan studies will be able to provide consistent empirical support for a theory because the theory actually makes correct predictions. Once researchers report all of the relevant data, it is less important how these data are analyzed. In this new world of psychological science, it will be problematic to ignore power and to use the default Bayesian t-test because it will typically show no effect. Unless researches are planning to build a career on confirming the absence of effects, they should conduct studies with high-power and control type-I error rates by replicating and extending their own work.

19 thoughts on “Why Psychologists Should Not Change The Way They Analyze Their Data: The Devil is in the Default Prior

  1. BF-hacker

    I also developed a Bayes-Factor hacking tool for r. You need to download the Bayes-Factor package for it to work.

    You have to specify the critical BF you want (for BF of 3 in favor of null-hypothesis keep default value of 1/3), the p-value you got, whether it was a one-tailed or two-tailed test, Total sample size, and number of groups (1 or 2).

    The code computes the scaling factor for the Cauchy distribution that you need to put in to get support for the null-hypothesis. Read out rsc and put in as a priori default value.

    Now you are ready to go to find strong support for the null-hypothesis, no matter what your actual results are.

    ### settings
    crit_bf = 1/3
    p = .02
    tail = 1
    N = 200
    G = 1
    ### computation
    se = G / sqrt(N)
    df = (2 / se)^2 – G
    t = qt(1-(p/tail),df)
    input = cbind(t,df)
    input
    rsc = .1
    bf = 1
    while(bf > crit_bf) {
    rsc = rsc + .1
    bf <- getBF(input)
    }
    ### result
    rsc

  2. The prior distribution expresses your beliefs about the effect size before the data are taken into account. You can’t just pick and choose one in order to arrive at a preferred conclusion. You don’t have a range of priors, you only have the one.

    The only thing your R script is useful for is to determine how crazy your prior convictions must have been for you to see the observed evidence as compelling.

  3. Dear Joachim,

    thank you for your comment. I fully agree that it would be a questionable research practice to pick a prior to obtain a preferred result. That is exactly the point of my post. If you compute Bayes-Factors, report Bayes-Factors, and interpret Bayes-Factors, you need to provide a theoretical account of the prior distribution because your interpretation holds only for the chosen prior.

    My problem is with application of Bayes-Factors where researchers pick a prior that seems unreasonable, like a Cauchy-distribution with a scaling factor of 1, and provide no explanation for this choice and explicitly write on twitter that they have a subjective preference for this prior. In my opinion, a Cauchy-distribution with scaling factor of 1 or even .707 is rarely justified in psychological research where effect sizes are in the 0 to .8 range with a mean of .5 and a mode of .3. The consequence of the default prior is only to make it more likely to provide support for the null-hypothesis and harder to find evidence for an effect.

    My post is not Anti-Bayesian. It is clearly aimed at the use of a theory that is supposed to make researchers think about their priors with a default prior that seems to suggest we can arrive at an objective conclusion about the null-hypothesis without thinking.

    Sincerely, Dr. R

    1. I think few people would argue against this. In fact, the choice of prior is what decades of literature on Bayesian hypothesis testing has been about and I bet every Bayesian you will ask will tell you that it is a critical consideration.

      I am not going to argue about the default prior – it’s up to those who proposed these priors, say Jeff Rouder and Richard Morey, whether they want to comment on this. The way I understand this that default prior is aimed to be widely usable for situations typically encountered in psychology research. If you can want to use a different scaling factor than the default of 0.7ish then you should justify this – otherwise use the default. Similarly, if you want to use a non-zero mean prior you should have a good justification for that. The point of the default is that it should be applicable to most situations. This does not preclude you from proposing the use of a different prior if you feel that the default is inappropriate.

      I can see how it would be justified to use a non-zero mean prior in some situations with small effect sizes. I don’t think this is the case for Bem’s psi results though. When it comes to psi people frequently make the argument that you merely need to show evidence for an effect, regardless of how small. I think this is incorrect. What these researchers should be doing is to make explicit predictions of what effect size you should expect under a theoretical account for things like precognition and telepathy. My guess is that for most proposals (say quantum entanglement) the expected effect size should be orders of magnitude *smaller* than the effects these experiments actually observe. As such the experimental evidence actually supports more parsimonious explanations like QRPs or experimental artifacts.

      Of course, making such predictions is not always easy or even possible. This is what the default prior is for. If you don’t like that you can use my bootstrapped evidence methods :P. It doesn’t need a prior, just the uncertainty with which the effect size can be estimated under either hypothesis. I’ve reanalysed Bem’s data with that and find the same pattern of results as Wagenmakers et al. Only for experiment 9 the data supports H1 – but in this case the evidence is just barely at the criterion for what I’d call conclusive evidence.

      Whatever you do, extraordinary claims require extraordinary evidence, and I think only under indefensible choices do Bem’s data support H1 sufficiently.

      1. Dear Sam,

        thank you for your comment. I agree that many Bayesians are aware of the critical importance of choosing the right prior. However, many non-Bayesians do not understand how priors influence results. I disagree that the default prior is a reasonable prior for most research questions in psychology. As I show in Table 1, the prior is biased against finding an effect even when studies have reasonable power to detect an effect. I don’t think this is desirable. If a significant result in a study with 50% power does not count as reasonable evidence for an effect, we are essentially reducing power by increasing the criterion for evidence to be strong enough to show an effect. I think everybody needs to justify their priors and interpret Bayes-Factors as relative to the chosen prior.

        I also think you are confusing Bayes-Factors with the a priori hypothesis that an effect should be observed. If you are skeptical about an effect, you may need very strong evidence to convince you that an effect is present. The Bayes-Factor tells you how strong the evidence is given the data. This has nothing to do with your skepticism about an effect. Every skeptical will have to justify why they refute evidence that supports the presence of an effect by a ratio of a billion to 1. You do not have to revise your skeptical opinion, but you cannot reject the evidence as weak. I feel free to be skeptical about PSI because bias tests reveal that the reported results are biased and because replication studies often fail to show the effect.

        Sincerely, Dr. R

      2. Sam: “If you can want to use a different scaling factor than the default of 0.7ish then you should justify this – otherwise use the default.”

        I don’t see why the default should require any less justification than the other values. More broadly I think our goals are perhaps misunderstood here; I am not in the business of arguing that the Bayes factors we advocate — the default families of Bayes factors — somehow yield objectively better answers than all other Bayes factors one could compute. As one who advocates that analysts think about the meaning of the priors, I am in a funny position: I can honestly say that I hope that people evaluate them and reject them if they feel they don’t line up well enough with the problem they’re trying to solve. The best thing about this is that in *rejecting* the prior, they have thought about it!

        I certainly do not want my methods to become universal, and I sincerely hope that in 10 or 15 years no one is using them, because we’ve moved beyond them. One might consider them like Bayesian training wheels; they provide enough constraint that one gets meaningful answers from them, they can be understood by everyone, and they have good, basic relationships to test statistics that people are familiar with. But no one should use training wheels forever, and anyone who teaches someone else to ride a bike hopes that those training wheels come off. The best outcome for our methods is that people eventually reject them for more flexible Bayes factors, as they learn to better instantiate their theoretical positions as reasonable statistical models.

      3. Dear Dr. Morey,

        I think we are in agreement that Bayes-Factors should not be used without thinking about the prior and how the choice of a prior influences results. Moreover, publications should clear explain why a particular prior was selected.

        My blog points out that it is not useful to use default or objective priors as if it would not be necessary to think about the effect of the default prior on results.

        Your analogy of default priors and training wheels is interesting. It might make sense to use training wheels while you are still learning to ride a bike, but it doesn’t make sense to use training wheels in the Tour de France.

        Wagenmakers et al. (2011) used the default setting to critique empirical studies in a top journal and in the process advocate that we should all use “training wheel Bayesian statistics” when we analyze our data. My post shows that conducting our analyses in this way would lead to false conclusions.

        I thin the next step for advocates of Bayesian statistics is to develop clear guidelines how priors should be selected. When is a Cauchy distribution appropriate and when is it not appropriate? When should the prior be a full or a half distribution? When is a uniform distribution better than a normal or Cauchy distribution? How does the choice of a prior influence sensitivity and type-I and type-II error rates? To my knowledge these questions have not been addressed.

        One advantage of p-values is that these questions do not arise because everybody gets the same p-value for the same set of data.

        Sincerely, Dr. R

    2. I agree that this post is not anti-Bayesian. In fact, by recognizing the importance of the prior, you are moving closer to what I think of Bayesianism. That said, there is a focus on objectivity in it that I think is a little misguided. We’ve known for a while that inference is necessarily subjective, and the fact that you need a prior to do inference is simply the formalization of that.

      Some people used a particular prior in their analysis of ESP data. You find this prior unappealing. So did Bem, Utts, and Johnson. This is perfectly fine — inference is conditional on a prior, and priors are subjective. You can have yours, and if your prior is that the effect is going to be small, you are more likely to conclude that there is a small effect if the effect is indeed small.

      I’m not objecting to any of these statements, but your post does read like a lot of huffing and lamenting of something that is simply the case whether one is a Bayesian or not. Inference is subjective. Even the pitiful p value has a subjective element; two researchers can collect the exact same data set and (correctly) find different p values. Lindley gives a canonical example of such a situation (http://onlinelibrary.wiley.com/doi/10.1111/j.1467-9639.1993.tb00252.x/abstract), but two researchers using the same data and coming to different conclusions is so common it is almost the rule rather than the exception (e.g., shall we assume homoskedasticity or not? Approximate Likert scales with a normal or not? Do we use the RT data or do the analysis only on accuracy? etc.)

  4. Anyway this is both the beauty and a curse of Bayesian inference! You need to justify your prior belief to evaluate the statistical evidence. It is perfectly fine to say that you don’t believe the default prior of mu=0, sigma=0.7 is adequate. You are free to propose an alternative, for the given experimental data or even for “default” situations.

    All that this does is give a formal statistical framework for the discussions about the appropriateness of empirical evidence that are central to all science. Instead just saying that a p<0.05 is unlikely due to confounding reason X you have the statistical tools to directly question whether the evidence is consistent with a particularly hypothesis as specified by the prior.

    I think this is a beautiful feature of the approach. Some regard this as a curse but it isn't in most situations. However, for those cases where there is a clear lack of consensus you can use my method instead ;).

    I also think you are confusing Bayes-Factors with the a priori hypothesis that an effect should be observed.

    I don’t know why you would take that from what I wrote. My point is that if you have a specific hypothesis for a tiny effect size, your BF, a measure of the statistical evidence, will show that the data don’t support that hypothesis when the effect size is larger than tiny (even if it is small). In Bem’s case, I think a specific prior of d=0.2 is indefensibly large.

    1. Dear Sam,
      this is not so much a reply to your post as a further elaboration on what a prior Cauchy(0,1) really means. It allows for 50% of effect sizes to be greater than 1. It is easy to compute the predicted t-values for these effect sizes with N = 100 in a one-sample t-test. t-values are greater than 10. Now I don’t know what literature you are reading, but in my experience t-values are in the 2 to 5 range. Moreover, t = 2.8 corresponds to approximately 80% power. t-values of 10 correspond to 99.999% power. I am fairly confident that most personality and social psychologists would not consider this a reasonable prior.

      Bayes-Factors might be useful if we would have a principled account of setting priors. My main point is that there is no such approach and that it is not helpful to avoid this problem by introducing a default prior.

      1. As I said, it’s not my place to argue for or against this particular default prior. I assume this is the very reason why the default scale factor was changed from 1 to 0.707 because the former may be too strong for unrealistically large effect sizes.

        The new scale factor seems to make sense to me. This is what I used for comparisons with my BSE procedure and they produce qualitatively fairly similar results. This agreement suggests that when you don’t have a precise prior for the effect size, the default seems to produce reasonable results. Unless you have a sufficiently large sample size, a tiny effect size is support for the null hypothesis.

        You can get away from that by setting a precise prior but greater precision requires stronger justification. So this is essentially your choice. You can say that Bem’s data have a prior N(0.2,0.1) but then I would expect a thorough theoretical argument for why I should expect sizes like that and not d=0.001 which I think it closer to what I would expect if his theory were true.

        Alternatively, you can define a fairly imprecise prior that covers many effect sizes. The default is a way to do just that but you are welcome to choose another one. I do however doubt that a uniform prior like you suggest is very realistic.

  5. “Wagenmakers et al. (2011) used the default setting to critique empirical studies in a top journal and in the process advocate that we should all use “training wheel Bayesian statistics” when we analyze our data.”

    You missed the point. The “training wheel” metaphor doesn’t refer to the statistics – the results aren’t any less credible, if the priors are acceptable to the analyst. The training wheel metaphor refers to the method of *choosing* the prior. As we’ve written, there’s an inherent tradeoff in all analytical techniques between ease of use and interpretability. ALWAYS. Bayes factors are not any different in this regard. We’ve made a particular choice with respect to this, but we’ve been clear about this all along. Your argument rests on a straw man version of our methods, as I’ve pointed out before. Snake oil indeed.

    If you’re unhappy with EJ’s conclusions, that’s fine, but you can’t at once critique the default prior as being thoughtless and then critique EJ for having chosen it. It’s not like EJ doesn’t know how to choose a different prior, if he liked. He knows how to compute other relevant Bayes factors, but he made a choice. Also, Sam is telling you that he thinks that a particular scaling is reasonable, and you’re telling him it’s not. Who’s the one trying to keep people from coming to their own conclusions?

    Finally, I like Dienes work, but are you seriously suggesting that you don’t have an intellectual responsibility to read and understand the relevant literature on this topic before making such sweeping claims? This is a topic with many, many decades of history and many threads of argumentation, and I’m a bit embarrassed for you that you think you’ve met your obligation as an scholar by just a few papers by psychologists.

  6. Dear Dr. Morey,

    I read more than you think. For example, I just read your comment “The humble Bayesian: Model checking from a fully Bayesian perspective”

    drsmorey.org/bibtex/upload/Morey:etal:2012.pdf

    In the article, you describe a Bayesian who uncritically accepts the results of a Bayesian analysis as an overconfident Bayesian.

    Quote” To our mind, the overconfident Bayesian is an extreme point in the spectrum of Bayesians. We ourselves routinely perform model checks in our own work … and we believe that most practicing Bayesian statisticians worry about the appropriateness of their models and hence engage in model checking”

    The problem is that a robustness check will often show prior distributions that lead to different conclusions; at least some conclusive results become inconclusive (BF ~ 1).

    My main objection to the use of Bayes-Factors is that in small samples Bayes-Factors will never produce robust results that are the same for a wide range of priors and I think it is important for reviewers and editors to be aware of this fact.

    As sample sizes increase (higher sensitivity), the prior becomes less important. As I showed in my table, even the Cauchy(1) prior shows strong evidence for ESP in a meta-analysis. I also found it interesting to read that your Bayesian meta-analysis favored the presence of ESP by a factor of 6 billion to 1.

    http://www.ncbi.nlm.nih.gov/pubmed/23294092

    With a high signal-to-noise ratio (t or z ~ 5; 5 sigma rule), BF will agree with p-values even if BF are based on bad priors (Cauchy(1) for small effects).

    http://www.physics.org/article-questions.asp?id=103
    [5 sigma rule is used in partical physics]

    In conclusion, I think empirical Bayesians need to think more about robustness before they conduct a study and plan sample sizes so that Bayes-Factors can produce robust results. Otherwise, humble Bayesians will have to conclude that their results are inconclusive and we do not need more inconclusive results.

    At a minimum, my post shows how inadequate sample size of N = 100 would be with a Cauchy(1) prior to demonstrate small effects. And this is for a one-sample t-test. For a standard between-subject study in social psychology, N = 400 would produce the same standard error and the same low sensitivity.

    So, maybe we can agree that empirical scientists need to pay attention to the planning of sample sizes for their studies to produce robust results that can be replicated by other scientists no matter how they analyze their data.

    Sincerely, Dr. R

    1. This is going to be my last comment here. I have more important battles for fight, such as preventing Brexit… 😛

      But to wrap up my involvement, I still think you are missing the point. First of all, as has already been pointed out repeatedly, nobody proposes using a Cauchy(1) prior as the scaling factor is now 0.707.

      Secondly, such a default does a reasonable job at describing uncertain effects. It’s places more expectation on small effects while still permitting large ones. If your alternative hypothesis is vague, which seems to be the case based on what most people in our field are saying, then this is an appropriate choice to make. However, if you believe that a small effect in a narrow range is likely, you are free to use such a prior instead – just don’t be surprised if people start arguing with such an overconfident prediction. Alternatively, if you agree that the hypothesis is vague, you are free to use other imprecise priors, like the ones suggested by Zoltan Dienes or even a uniform one. I think uniform priors are too vague for most purposes however.

      What this whole discussion really reminds me of is precisely the problem I have with Bem’s results and psi research in general: We should think more about our hypotheses and find the clearest prediction they can make. A hypothesis that simply predicts the existence of an effect is not nearly as useful as one that predicts what the effect will be. We need more of that and if there is one thing I take home from all this it is that Bayesian model comparison is an excellent way to encourage this.

      1. As long as everybody understands that a Cauchy distribution with .707 scaling factors is a prior with 50% of the distribution on the right side of d = .707 and that this prior thinks it is as reasonable to expect effects greater than .7 as it is to expect effects in the range from 0 to .7, you can use this default prior. However, meta-analysis of social psychological studies put the typical effect size in the range from 0 to .8. It would make more sense to chose a prior that is sensitive to the typical range of effect sizes in a research area.

        In general, I think everybody needs to think more about effect sizes and sample sizes prior to conducting a study.

Leave a Reply