Category Archives: Bayesian Statistics

The Fallacy of Placing Confidence in Bayesian Salvation

Richard D. Morey, Rink Hoekstra, Jeffrey N. Rouder, Michael D. Lee, and
Eric-Jan Wagenmakers (2016), henceforce psycho-Baysians, have a clear goal.  They want psychologists to change the way they analyze their data.

Although this goal motivates the flood of method articles by this group, the most direct attack on other statistical approaches is made in the article “The fallacy of placing confidence in confidence intervals.”   In this article, the authors claim that everybody, including textbook writers in statistics, misunderstood Neyman’s classic article on interval estimation.   What are the prior odds that after 80 years, a group of psychologists discover a fundamental flaw in the interpretation of confidence intervals (H1) versus a few psychologists are either unable or unwilling to understand Neyman’s article?

Underlying this quest for change in statistical practices lies the ultimate attribution error that Fisher’s p-values or Neyman-Pearsons significance testing with or without confidence intervals are responsible for the replication crisis in psychology (Wagenmakers et al., 2011).

This is an error because numerous articles have argued and demonstrated that questionable research practices undermine the credibility of the psychological literature.  The unprincipled use of p-values (undisclosed multiple testing), also called p-hacking, means that many statistically significant results have inflated error rates and the long-run probabilities of false positives are not 5%, as stated in each article, but could be 100% (Rosenthal, 1979; Sterling, 1959; Simmons, Nelson, & Simonsohn, 2011).

You will not find a single article by Psycho-Bayesians that will acknowledge the contribution of unprincipled use of p-values to the replication crisis. The reason is that they want to use the replication crisis as a vehicle to sell Bayesian statistics.

It is hard to believe that classic statistics are fundamentally flawed and misunderstood because they are used in industry to  produce SmartPhones and other technology that requires tight error control in mass production of technology. Nevertheless, this article claims that everybody misunderstood Neyman’s seminal article on confidence intervals.

The authors claim that Neyman wanted us to compute confidence intervals only before we collect data, but warned readers that confidence intervals provide no useful information after the data are collected.

Post-data assessments of probability have never been an advertised feature of CI theory. Neyman, for instance, said “Consider now the case when a sample…is already drawn and the [confidence interval] given…Can we say that in this particular case the probability of the true value of [the parameter] falling between [the limits] is equal to [X%]? The answer is obviously in the negative”

This is utter nonsense. Of course, Neyman was asking us to interpret confidence intervals after we collected data because we need a sample to compute confidence interval. It is hard to believe that this could have passed peer-review in a statistics journal and it is not clear who was qualified to review this paper for Psychonomic Bullshit Review.

The way the psycho-statisticians use Neyman’s quote is unscientific because they omit the context and the following statements.  In fact, Neyman was arguing against Bayesian attempts of estimate probabilities that can be applied to a single event.

It is important to notice that for this conclusion to be true, it is not necessary that the problem of estimation should be the same in all the cases. For instance, during a period of time the statistician may deal with a thousand problems of estimation and in each the parameter M  to be estimated and the probability law of the X’s may be different. As far as in each case the functions L and U are properly calculated and correspond to the same value of alpha, his steps (a), (b), and (c), though different in details of sampling and arithmetic, will have this in common—the probability of their resulting in a correct statement will be the same, alpha. Hence the frequency of actually correct statements will approach alpha. It will be noticed that in the above description the probability statements refer to the problems of estimation with which the statistician will be concerned in the future. In fact, I have repeatedly stated that the frequency of correct results tend to alpha.*

Consider now the case when a sample, S, is already drawn and the calculations have given, say, L = 1 and U = 2. Can we say that in this particular case the probability of the true value of M falling between 1 and 2 is equal to alpha? The answer is obviously in the negative.  

The parameter M is an unknown constant and no probability statement concerning its value may be made, that is except for the hypothetical and trivial ones P{1 < M < 2}) = 1 if 1 < M < 2) or 0 if either M < 1 or 2 < M) ,  which we have decided not to consider. 

The full quote makes it clear that Neyman is considering the problem of quantifying the probability that a population parameter is in a specific interval and dismisses it as trivial because it doesn’t solve the estimation problem.  We don’t even need observe data and compute a confidence interval.  The statement that a specific unknown number is between two other numbers (1 and 2) or not is either TRUE (P = 1) or FALSE (P = 0).  To imply that this trivial observation leads to the conclusion that we cannot make post-data  inferences based on confidence intervals is ridiculous.

Neyman continues.

The theoretical statistician [constructing a confidence interval] may be compared with the organizer of a game of chance in which the gambler has a certain range of possibilities to choose from while, whatever he actually chooses, the probability of his winning and thus the probability of the bank losing has permanently the same value, 1 – alpha. The choice of the gambler on what to bet, which is beyond the control of the bank, corresponds to the uncontrolled possibilities of M having this or that value. The case in which the bank wins the game corresponds to the correct statement of the actual value of M. In both cases the frequency of “ successes ” in a long series of future “ games ” is approximately known. On the other hand, if the owner of the bank, say, in the case of roulette, knows that in a particular game the ball has stopped at the sector No. 1, this information does not help him in any way to guess how the gamblers have betted. Similarly, once the boundaries of the interval are drawn and the values of L and U determined, the calculus of probability adopted here is helpless to provide answer to the question of what is the true value of M.

What Neyman was saying is that population parameters are unknowable and remain unknown even after researchers compute a confidence interval.  Moreover, the construction of a confidence interval doesn’t allow us to quantify the probability that an unknown value is within the constructed interval. This probability remains unspecified. Nevertheless, we can use the property of the long-run success rate of the method to place confidence in the belief that the unknown parameter is within the interval.  This is common sense. If we place bets in roulette or other random events, we rely on long-run frequencies of winnings to calculate our odds of winning in a specific game.

It is absurd to suggest that Neyman himself argued that confidence intervals provide no useful information after data are collected because the computation of a confidence interval requires a sample of data.  That is, while the width of a confidence interval can be determined a priori before data collection (e.g. in precision planning and power calculations),  the actual confidence interval can only be computed based on actual data because the sample statistic determines the location of the confidence interval.

Readers of this blog may face a dilemma. Why should they place confidence in another psycho-statistician?   The probability that I am right is 1, if I am right and 0 if I am wrong, but this doesn’t help readers to adjust their beliefs in confidence intervals.

The good news is that they can use prior information. Neyman is widely regarded as one of the most influential figures in statistics.  His methods are taught in hundreds of text books, and statistical software programs compute confidence intervals. Major advances in statistics have been new ways to compute confidence intervals for complex statistical problems (e.g., confidence intervals for standardized coefficients in structural equation models; MPLUS; Muthen & Muthen).  What are the a priori chances that generations of statisticians misinterpreted Neyman and committed the fallacy of interpreting confidence intervals after data are obtained?

However, if readers need more evidence of psycho-statisticians deceptive practices, it is important to point out that they omitted Neyman’s criticism of their favored approach, namely Bayesian estimation.

The fallacy article gives the impression that Neyman’s (1936) approach to estimation is outdated and should be replaced with more modern, superior approaches like Bayesian credibility intervals.  For example, they cite Jeffrey’s (1961) theory of probability, which gives the impression that Jeffrey’s work followed Neyman’s work. However, an accurate representation of Neyman’s work reveals that Jeffrey’s work preceded Neyman’s work and that Neyman discussed some of the problems with Jeffrey’s approach in great detail.  Neyman’s critical article was even “communicated” by Jeffreys (these were different times where scientists had open conflict with honor and integrity and actually engaged in scientific debates).

Jeffrey.png

Given that Jeffrey’s approach was published just one year before Neyman’s (1936) article, Neyman’s article probably also offers the first thorough assessment of Jeffrey’s approach. Neyman first gives a thorough account of Jeffrey’s approach (those were the days).

Neyman.png

Neyman then offers his critique of Jeffrey’s approach.

It is known that, as far as we work with the conception of probability as adopted in
this paper, the above theoretically perfect solution may be applied in practice only
in quite exceptional cases, and this for two reasons. 

Importantly, he does not challenge the theory.  He only points out that the theory is not practical because it requires knowledge that is often not available.  That is, to estimate the probability that an unknown parameter is within a specific interval, we need to make prior assumptions about unknown parameters.   This is the problem that has plagued subjective Bayesians approaches.

Neyman then discusses Jeffrey’s approach to solving this problem.  I am not claiming that I am a statistical expert to decide whether Neyman or Jeffrey’s are right. Even statisticians have been unable to resolve these issues and I believe the consensus is that Bayesian credibility intervals and Neyman’s confidence intervals are both mathematically viable approaches to interval estimation with different strengths and weaknesses.

Neyman2.png

I am only trying to point out to unassuming readers of the fallacy article that both approaches are as old as statistics and that the presentation of the issue in this article is biased and violates my personal, and probably idealistic, standards of scientific integrity.   Using a selective quote by Neyman to dismiss confidence intervals and then to omit Neyman’s critic of Bayesian credibility intervals is deceptive and shows an unwillingness or inability to engage in open scientific examination of scientific arguments for and against different estimation methods.

It is sad and ironic that Wagenmakers’ efforts to convert psychologists into Bayesian statisticians is similar to Bem’s (2011) attempt to convert psychologists into believers in parapsychology; or at least in parapsychology as a respectable science. While Bem fudged data to show false empirical evidence, Wagenmakers is misrepresenting the way classic statistics works and ignoring the key problem of Bayesian statistics, namely that Bayesian inferences are contingent on prior assumptions that can be gamed to show what a researcher wants to show.  Wagenmaker used this flexibility in Bayesian statistics to suggest that Bem (2011) presented weak evidence for extra-sensory perception.  However, a rebuttle by Bem showed that Bayesian statistics also showed support for extra-sensory perception with different and more reasonable priors.  Thus, Wagenmakers et al. (2011) were simply wrong to suggest that Bayesian methods would have prevented Bem from providing strong evidence for an incredible phenomenon.

The problem with Bem’s article is not the way he “analyzed” the data. The problem is that Bem violated basic principles of science that are required to draw valid statistical inferences from data.  It would be a miracle if Bayesian methods that assume unbiased data could correct for data falsification.   The problem with Bem’s data has been revealed using statistical tools for the detection of bias (Francis, 2012; Schimmack, 2012, 2015, 2118). There has been no rebuttal from Bem and he admits to the use of practices that invalidate the published p-values.  So, the problem is not the use of p-values, confidence intervals, or Bayesian statistics.  The problem is abuse of statistical methods. There are few cases of abuse of Bayesian methods simply because they are used rarely. However, Bayesian statistics can be gamed without data fudging by specifying convenient priors and failing to inform readers about the effect of priors on results (Gronau et al., 2017).

In conclusion, it is not a fallacy to interpret confidence intervals as a method for interval estimation of unknown parameter estimates. It would be a fallacy to cite Morey et al.’s article as a valid criticism of confidence intervals.  This does not mean that Bayesian credibility intervals are bad or could not be better than confidence intervals. It only means that this article is so blatantly biased and dogmatic that it does not add to the understanding of Neyman’s or Jeffrey’s approach to interval estimation.

P.S.  More discussion of the article can be found on Gelman’s blog.

Andrew Gelman himself comments:

My current favorite (hypothetical) example is an epidemiology study of some small effect where the point estimate of the odds ratio is 3.0 with a 95% conf interval of [1.1, 8.2]. As a 95% confidence interval, this is fine (assuming the underlying assumptions regarding sampling, causal identification, etc. are valid). But if you slap on a flat prior you get a Bayes 95% posterior interval of [1.1, 8.2] which will not in general make sense, because real-world odds ratios are much more likely to be near 1.1 than to be near 8.2. In a practical sense, the uniform prior is causing big problems by introducing the possibility of these high values that are not realistic.

I have to admit some Schadenfreude when I see one Bayesian attacking another Bayesian for the use of an ill-informed prior.  While Bayesians are still fighting over the right priors, practical researchers may be better off to use statistical methods that do not require priors, like, hm, confidence intervals?

P.P.S.  Science requires trust.  At some point, we cannot check all assumptions. I trust Neyman, Cohen, and Muthen and Muthen’s confidence intervals in MPLUS.

 

 

 

 

 

 

 

 

 

 

Can the Bayesian Mixture Model Estimate the Percentage of False Positive Results in Psychology Journals?

A method revolution is underway in psychological science.  In 2011, an article published in JPSP-ASC made it clear that experimental social psychologists were publishing misleading p-values because researchers violated basic principles of significance testing  (Schimmack, 2012; Wagenmakers et al., 2011).  Deceptive reporting practices led to the publication of mostly significant results, while many non-significant results were not reported.  This selective publishing of results dramatically increases the risk of a false positive result from the nominal level of 5% that is typically claimed in publications that report significance tests  (Sterling, 1959).

Although experimental social psychologists think that these practices are defensible, no statistician would agree with them.  In fact, Sterling (1959) already pointed out that the success rate in psychology journals is too high and claims about statistical significance are meaningless.  Similar concerns were raised again within psychology (Rosenthal, 1979), but deceptive practices remain acceptable until today (Kitayama, 2018). As a result, most published results in social psychology do not replicate and cannot be trusted (Open Science Collaboration, 2015).

For non-methodologists it can be confusing to make sense of the flood of method papers that have been published in the past years.  It is therefore helpful to provide a quick overview of methodological contributions concerned with detection and correction of biases.

First, some methods focus on effect sizes, (pcurve2.0; puniform), whereas others focus on strength of evidence (Test of Excessive Significance; Incredibility Index; R-Index, Pcurve2.1; Pcurve4.06; Zcurve).

Another important distinction is between methods that assume a fixed parameter and methods that assume heterogeneity.   If all studies have a common effect size or the same strength of evidence,  it is relatively easy to demonstrate bias and to correct for bias (Pcurve2.1; Puniform; TES).  However, heterogeneity in effect sizes or sampling error produces challenges.  Relatively few methods have been developed for this challenging, yet realistic scenario.  For example, Ioannidis and Trikalonis (2005) developed a method to reveal publication bias that assumes a fixed effect size across studies, while allowing for variation in sampling error, but this method can be biased if there is heterogeneity in effect sizes.  In contrast, I developed the Incredibilty-Index (also called Magic Index) to allow for heterogeneity in effect sizes and sampling error (Schimmack, 2012).

Following my work on bias detection in heterogeneous sets of studies, I started working with Jerry Brunner on methods that can estimate average power of a heterogeneous set of studies that are selected for significance.  I first published this method on my blog in June 2015, when I called it post-hoc power curves.   These days, the term Zcurve is used more often to refer to this method.  I illustrated the usefulness of Zcurve in various posts in the Psychological Methods Discussion Group.

In September, 2015 I posted replicability rankings of social psychology departments using this method. the post generated a lot of discussions and a question about the method.  Although the details were still unpublished, I described the main approach of the method.  To deal with heterogeneity, the method uses a mixture model.

EJ.Mixture.png

In 2016, Jerry Brunner and I submitted a manuscript for publication that compared four methods for estimating average power of heterogeneous studies selected for significance (Puniform1.1; Pcurve2.1; Zcurve & a Maximul Likelihood Method).  In this article, the mixture model, Zcurve, outperformed other methods, including a maximum-likelihood method developed by Jerry Brunner. The manuscript was rejected from Psychological Methods.

In 2017, Gronau, Duizer, Bakker, and Eric-Jan Wagenmakers published an article titled “A Bayesian Mixture Modeling of Significant p Values: A Meta-Analytic Method to Estimate the Degree of Contamination From H0”  in the Journal of Experimental Psychology: General.  The article did not mention z-curve, presumably because it was not published in a peer-reviewed journal.

Although a reference to our mixture model would have been nice, the Bayesian Mixture Model differs in several ways from Zcurve.  This blog post examines the similarities and differences between the two mixture models, it shows that BMM fails to provide useful estimates with simulations and social priming studies, and it explains why BMM fails. It also shows that Zcurve can provide useful information about replicability of social priming studies, while the BMM estimates are uninformative.

Aims

The Bayesian Mixture Model (BMM) and Zcurve have different aims.  BMM aims to estimate the percentage of false positives (significant results with an effect size of zero). This percentage is also called the False Discovery Rate (FDR).

FDR = False Positives / (False Positives + True Positives)

Zcurve aims to estimate the average power of studies selected for significance. Importantly, Brunner and Schimmack use the term power to refer to the unconditional probability of obtaining a significant result and not the common meaning of power as being conditional on the null-hypothesis being false. As a result, Zcurve does not distinguish between false positives with a 5% probability of producing a significant result (when alpha = .05) and true positives with an average probability between 5% and 100% of producing a significant result.

Average unconditional power is simply the percentage of false positives times alpha plus the average conditional power of true positive results (Sterling et al., 1995).

Unconditional Power = False Positives * Alpha + True Positives * Mean(1 – Beta)

Zcurve therefore avoids the thorny issue of defining false positives and trying to distinguish between false positives and true positives with very small effect sizes and low power.

Approach 

BMM and zcurve use p-values as input.  That is, they ignore the actual sampling distribution that was used to test statistical significance.  The only information that is used is the strength of evidence against the null-hypothesis; that is, how small the p-value actually is.

The problem with p-values is that they have a specified sampling distribution only when the null-hypothesis is true. When the null-hypothesis is true, p-values have a uniform sampling distribution.  However, this is not useful for a mixture model, because a mixture model assumes that the null-hypothesis is sometimes false and the sampling distribution for true positives is not defined.

Zcurve solves this problem by using the inverse normal distribution to convert all p-values into absolute z-scores (abs(z) = -qnorm(p/2).  Absolute z-scores are used because F-tests or two-sided t-tests do not have a sign and a test score of 0 corresponds to a probability of 1.  Thus, the results do not say anything about the direction of an effect, while the size of the p-value provides information about the strength of evidence.

BMM also transforms p-values. The only difference is that BMM uses the full normal distribution with positive and negative z-scores  (z = qnorm(p)). That is, a p-value of .5 corresponds to a z-score of zero; p-values greater than .5 would be positive, and p-values less than .5 are assigned negative z-scores.  However, because only significant p-values are selected, all z-scores are negative in the range from -1.65 (p = .05, one-tailed) to negative infinity (p = 0).

The non-centrality parameter (i.e., the true parameter that generates the sampling dstribution) is simply the mean of the normal distribution. For the null-hypothesis and false positives, the mean is zero.

Zcurve and BMM differ in the modeling of studies with true positive results that are heterogeneous.  Zcurve uses several normal distributions with a standard deviation of 1 that reflects sampling error for z-tests.  Heterogeneity in power is modeled by varying means of normal distributions, where power increases with increasing means.

BMM uses a single normal distribution with varying standard deviation.  A wider distribution is needed to predict large observed z-scores.

The main difference between Zcurve and BMM is that Zcurve either does not have fixed means (Brunner & Schimmack, 2016) or has fixed means, but does not interpret the weight assigned to a mean of zero as an estimate of false positives (Schimmack & Brunner, 2018).  The reason is that the weights attached to individual components are not very reliable estimates of the weights in the data-generating model.  Importantly, this is not relevant for the goal of zurve to estimate average power because the weighted average of the components of the model is a good estimate of the average true power in the data-generating model, even if the weights do not match the weights of the data-generating model.

For example, Zcurve does not care whether 50% average power is produced by a mixture of 50% false positives and 50% true positives with 95% power or 50% of studies with 20% power and 50% studies with 80% power. If all of these studies were exactly replicated, they are expected to produce 50% significant results.

BMM uses the weights assigned to the standard normal with a mean of zero as an estimate of the percentage of false positive results.  It does not estimate the average power of true positives or average unconditional power.

Given my simulation studies with zcruve, I was surprised that BBM solved a problem that weights of individual components cannot be reliably estimated because the same distribution of p-values can be produced by many mixture models with different weights.  The next section examines how BMM tries to estimate the percentage of false positives from the distribution of p-values.

A Bayesian Approach

Another difference between BMM and Zcurve is that BMM uses prior distributions, whereas Zcurve does not.  Whereas Zcurve makes no assumptions about the percentage of false positives, BMM uses a uniform distribution with values from 0 to 1 (100%) as a prior.  That is, it is equally likely that the percentage of false positives is 0%, 100%, or any value in between.  A uniform prior is typically justified as being agnostic; that is, no subjective assumptions bias the final estimate.

For the mean of the true positives, the authors use a truncated normal prior, which they also describe as a folded standard normal.  They justify this prior as reasonable based on extensive simulation studies.

Most important, however, is the parameter for the standard deviation.  The prior for this parameter was a uniform distribution with values between 0 and 1.   The authors argue that larger values would produce too many p-values close to 1.

“implausible prediction that p values near 1 are more common under H1 than under H0” (p 1226). 

But why would this be implausible.  If there are very few false positives and many true positives with low power, most p-values close to 1 would be the result of  true positives (H1) than of false positives (H0).

Thus, one way BMM is able to estimate the false discovery rate is by setting the standard deviation in a way that there is a limit to the number of low z-scores that are predicted by true positives (H1).

Although understanding priors and how they influence results is crucial for meaningful use of Bayesian statistics, the choice of priors is not crucial for Bayesian estimation models with many observations because the influence of the priors diminishes as the number of observations increases.  Thus, the ability of BMM to estimate the percentage of false positives in large samples cannot be explained by the use of priors. It is therefore still not clear how BMM can distinguish between false positives and true positives with low power.

Simulation Studies

The authors report several simulation studies that suggest BMM estimates are close and robust across many scenarios.

The online supplemental material presents a set of simulation studies that highlight that the model is able to accurately estimate the quantities of interest under a relatively broad range of circumstances”  (p. 1226).

The first set of simulations uses a sample size of N = 500 (n = 250 per condition).  Heterogeneity in effect sizes is simulated with a truncated normal distribution with a standard deviation of .10 (truncated at 2*SD) and effect sizes of d = .45, .30, and .15.  The lowest values are .35, .20, and .05.  With N = 500, these values correspond to  97%, 61%, and 8% power respectively.

d = c(.35,.20,.05); 1-pt(qt(.975,500-2),500-2,d*sqrt(500)/2)

The number of studies was k = 5,000 with half of the studies being false positives (H0) and half being true positives (H1).

Figure 1 shows the Zcurve plot for the simulation with high power (d = .45, power >  97%; median true power = 99.9%).

Sim1.png

The graph shows a bimodal distribution with clear evidence of truncation (the steep drop at z = 1.96 (p = .05, two-tailed) is inconsistent with the distribution of significant z-scores.  The sharp drop from z = 1.96 to 3 shows that there are many studies with non-significant results are missing.  The estimate of unconditional power (called replicability = expected success rate in exact replication studies) is 53%.  This estimate is consistent with the simulation of 50% studies with a probability of success of 5% and 50% of studies with a success probability of 99.9% (.5 * .05 + .5 * .999 = 52.5).

The values below the x-axis show average power for  specific z-scores. A z-score of 2 corresponds roughly to p = .05 and 50% power without selection for significance. Due to selection for significance, the average power is only 9%. Thus the observed power of 50% provides a much inflated estimate of replicability.  A z-score of 3.5 is needed to achieve significance with p < .05, although the nominal p-value for z = 3.5 is p = .0002.  Thus, selection for significance renders nominal p-values meaningless.

The sharp change in power from Z = 3 to Z = 3.5 is due to the extreme bimodal distribution.  While most Z-scores below 3 are from the sampling distribution of H0 (false positives), most Z-scores of 3.5 or higher come from H1 (true positives with high power).

Figure 2 shows the results for the simulation with d = .30.  The results are very similar because d = .30 still gives 92% power.  As a result, replicabilty is nearly as high as in the previous example.

Sim2.png

 

The most interesting scenario is the simulation with low powered true positives. Figure 3 shows the Zcurve for this scenario with an unconditional average power of only 23%.

Sim3.png

It is no longer possible to recognize two sampling distributions and average power increases rather gradually from 18% for z = 2, to 35% for z = 3.5.  Even with this challenging scenario, BMM performed well and correctly estimated the percentage of false positives.   This is surprising because it is easy to generate a similar Zcurve without false positives.

Figure 4 shows a simulation with a mixture distribution but the false positives (d = 0) have been replaced by true positives (d = .06), while the mean for the heterogeneous studies was reduced to from d = .15 to d = .11.  These values were chosen to produce the same average unconditional power (replicability) of 23%.

Sim4.png

I transformed the z-scores into (two-sided) p-values and submitted them to the online BMM app at https://qfgronau.shinyapps.io/bmmsp/ .  I used only k = 1,500 p-values because the server timed me out several times with k = 5,000 p-values.  The estimated percentage of false positives was 24%, with a wide 95% credibility interval ranging from 0% to 48%.   These results suggest that BMM has problems distinguishing between false positives and true positives with low power.   BMM appears to be able to estimate the percentage of false positives correctly when most low z-scores are sampled from H0 (false positives). However, when these z-scores are due to studies with low power, BMM cannot distinguish between false positives and true positives with low power. As a result, the credibility interval is wide and the point estimates are misleading.

BMM.output.png

With k = 1,500 the influence of the priors is negligible.  However, with smaller sample sizes, the priors do have an influence on results and may lead to overestimation and false credibility intervals.  A simulation with k = 200, produced a point estimate of 34% false positives with a very wide CI ranging from 0% to 63%. The authors suggest a sensitivity analysis by changing model parameters. The most crucial parameter is the standard deviation.  Increasing the standard deviation to 2, increases the upper limit of the 95%CI to 75%.  Thus, without good justification for a specific standard deviation, the data provide very little information about the percentage of false positives underlying this Zcurve.

BMM.k200.png

 

For simulations with k = 100, the prior started to bias the results and the CI no longer included the true value of 0% false positives.

BMM.k100

In conclusion, these simulation results show that BMM promises more than it can deliver.  It is very difficult to distinguish p-values sampled from H0 (mean z = 0) and those sampled from H1 with weak evidence (e.g., mean z = 0.1).

In the Challenges and Limitations section, the authors pretty much agree with this assessment of BMM (Gronau et al., 2017, p. 1230).

The procedure does come with three important caveats.

First, estimating the parameters of the mixture model is an inherently difficult statistical problem. ..  and consequently a relatively large number of p values are required for the mixture model to provide informative results. 

A second caveat is that, even when a reasonable number of p values are available, a change in the parameter priors might bring about a noticeably different result.

The final caveat is that our approach uses a simple parametric form to account for the distribution of p values that stem from H1. Such simplicity comes with the risk of model-misspecification.

Practical Implications

Despite the limitations of BMM, the authors applied BMM to several real data.  The most interesting application selected focal hypothesis tests from social priming studies.  Social priming studies have come under attack as a research area with sloppy research methods as well as fraud (Stapel).  Bias tests show clear evidence that published results were obtained with questionable scientific practices (Schimmack, 2017a, 2017b).

The authors analyzed 159 social priming p-values.  The 95%CI for the percentage of false positives ranged from 48% to 88%.  When the standard deviation was increased to 2, the 95%CI increased slightly to 56% to 91%.  However, when the standard deviation was halved, the 95%CI ranged from only 10% to 75%.  These results confirm the authors’ warning that estimates in small sets of studies (k < 200) are highly sensitive to the specification of priors.

What inferences can be drawn from these results about the social priming literature?  A false positive percentage of 10% doesn’t sound so bad.  A false positive percentage of 88% sound terrible. A priori, the percentage is somewhere between 0 and 100%. After looking at the data, uncertainty about the percentage of false positives in the social priming literature remains large.  Proponents will focus on the 10% estimate and critics will use the 88% estimate.  The data simply do not resolve inconsistent prior assumptions about the credibility of discoveries in social priming research.

In short, BMM promises that it can estimate the percentage of false positives in a set of studies, but in practice these estimates are too imprecise and too dependent on prior assumptions to be very useful.

A Zcurve of Social Priming Studies (k = 159)

It is instructive to compare the BMM results to a Zcurve analysis of the same data.

SocialPriming.png

The zcurve graph shows a steep drop and very few z-scores greater than 4, which tend to have a high success rate in actual replication attempts (OSC, 2015).  The average estimated replicability is only 27%.  This is consistent with the more limited analysis of social priming studies in Kahneman’ s Thinking Fast and Slow book (Schimmack, 2017a).

More important than the point estimate is that the 95%CI ranges from 15% to a maximum of 39%.  Thus, even a sample size of 159 studies is sufficient to provide conclusive evidence that these published studies have a low probability of replicating even if it were possible to reproduce the exact conditions again.

These results show that it is not very useful to distinguish between false positives with a replicability of 5% and true positives with a replicability of 6, 10, or 15%.  Good research provides evidence that can be replicated at least with a reasonable degree of statistical power.  Tversky and Kahneman (1971) suggested a minimum of 50% and most social priming studies fail to meet this minimal standard and hardly any studies seem to have been planned with the typical standard of 80% power.

The power estimates below the x-axis show that a nomimal z-score of 4 or higher is required to achieve 50% average power and an actual false positive risk of 5%. Thus, after correcting for deceptive publication practices, most of the seemingly statistically significant results are actually not significant with the common criterion of a 5% risk of a false positive.

The difference between BMM and Zcurve is captured in the distinction between evidence of absence and absence of evidence.  BMM aims to provide evidence of absence (false positives). In contrast, Zcurve has the more modest goal of demonstrating absence (or presence) of evidence.  It is unknown whether any social priming studies could produce robust and replicable effects and under what conditions these effects occur or do not occur.  However, it is not possible to conclude from the poorly designed studies and the selectively reported results that social priming effects are zero.

Conclusion

Zcurve and BMM are both mixture models, but they have different statistical approaches, they have different aims.  They also differ in their ability to provide useful estimates.  Zcurve is designed to estimate average unconditional power to obtain significant results without distinguishing between true positives and false positives.  False positives reduce average power, just like low powered studies, and in reality it can be difficult or impossible to distinguish between a false positive with an effect size of zero and a true positive with an effect size that is negligibly different from zero.

The main problem of BMM is that it treats the nil-hypothesis as an important hypothesis that can be accepted or rejected.  However, this is a logical fallacy.  it is possible to reject an implausible effect sizes (e.g., the nil-hypothesis is probably false if the 95%CI ranges from .8 to  1.2], but it is not possible to accept the nil-hypothesis because there are always values close to 0 that are also consistent with the data.

The problem of BMM is that it contrasts the point-nil-hypothesis with all other values, even if these values are very close to zero.  The same problem plagues the use of Bayes-Factors that compare the point-nil-hypothesis with all other values (Rouder et al., 2009).  A Bayes-Factor in favor of the point nil-hypothesis is often interpreted as if all the other effect sizes are inconsistent with the data.  However, this is a logical fallacy because data that are inconsistent with a specific H1 can be consistent with an alternative H1.  Thus, a BF in favor of H0 can only be interpreted as evidence against a specific H1, but never as evidence that the nil-hypothesis is true.

To conclude, I have argued that it is more important to estimate the replicability of published results than to estimate the percentage of false positives.  A literature with 100% true positives and average power of 10% is no more desirable than a literature with 50% false positives and 50% true positives with 20% power.  Ideally, researchers should conduct studies with 80% power and honest reporting of statistics and failed replications should control the false discovery rate.  The Zcurve for social priming studies shows that priming researchers did not follow these basic and old principles of good science.  As a result, decades of research are worthless and Kahneman was right to compare social priming research to a train wreck because the conductors ignored all warning signs.

 

 

 

Bayesian Meta-Analysis: The Wrong Way and The Right Way

Carlsson, R., Schimmack, U., Williams, D.R., & Bürkner, P. C. (in press). Bayesian Evidence Synthesis is no substitute for meta-analysis: a re-analysis of Scheibehenne, Jamil and Wagenmakers (2016). Psychological Science.

In short, we show that the reported Bayes-Factor of 36 in the original article is inflated by pooling across a heterogeneous set of studies, using a one-sided prior, and assuming a fixed effect size.  We present an alternative Bayesian multi-level approach that avoids the pitfalls of Bayesian Evidence Synthesis, and show that the original set of studies produced at best weak evidence for an effect of social norms on reusing of towels.

How Can We Interpret Inferences with Bayesian Hypothesis Tests?

SUMMARY

In this blog post I show how it is possible to translate the results of a Bayesian Hypothesis Test into an equivalent frequentist statistical test that follows Neyman Pearsons approach of hypthesis testing where hypotheses are specified as ranges of effect sizes (critical regions) and observed effect sizes are used to make inferences about population effect sizes with long-run error rates.

INTRODUCTION

The blog post also explains why it is misleading to consider Bayes Factors that favor the null-hypothesis (d = 0) over an alternative hypothesis (e.g., Jeffrey’s prior) as evidence for the absence of an effect.  This conclusion is only warranted with infinite sample sizes, but with finite sample sizes, especially small sample sizes that are typical in psychology,  Bayes Factors in favor of H0 can only be interpreted as evidence that the population effect size is close to zero, but not as evidence that the population effect size is exactly zero.  How close the effect sizes that are consistent with H0 are depends on sample size and the criterion value that is used to interpret the results of a study as sufficient evidence for H0.

One problem with Bayes Factors is that like p-values, they are a continuous measure of likelihoods, just like p-values are a continuous measure of probabilities, and the observed value is not sufficient to justify an inference or interpretation of the data. This is why psychologists moved from Fisher’s approach to Neyman Pearson’s approach that compared an observed p-value to a specified (by convention or pre-registertation) criterion value. For p-values this is alpha. If p < alpha, we reject H0:d = 0 in favor of H1, there was a (positive or negative) effect.

Most researchers interpret Bayes Factors relative to some criterion value (e.g., BF > 3 or BF > 5 or BF > 10). These criterion values are just as arbitrary as the .05 criterion for p-values and the only justification for these values that I have seen is that (Jeffrey who invented Bayes Factors said so). There is nothing wrong with a conventional criterion value, even if Bayesian’s think there is something wrong with p < .05, but use BF > 3 in just the same way, but it is important to understand the implications of using a particular criterion value for an inference. In NHST the criterion value has a clear meaning. It means that in the long-run, the rate of false inferences (deciding in favor of H1 when H1 is false) will not be higher than the criterion value.  With alpha = .05 as a conventional criterion, a research community decided that it is ok to have a maximum 5% error rate.  Unlike, p-values, criterion values for Bayes-Factors provide no information about error rates.  The best way to understand what a Bayes-Factor of 3 means is that we can assume that H0 and H1 are equally probable before we conduct a study and a Bayes Factor of 3 in favor of H0 makes it 3 times more likely that H0 is true than that H1 is true. If we were gambling on results and the truth were known, we would increase our winning odds from 50:50 to 75:25.   With a Bayes-Factor of 5, the winning odds increase to 80:20.

HYPOTHESIS TESTING VERSUS EFFECT SIZE ESTIMATION

p-values and BF also share another shortcoming. Namely they provide information about the data given a hypothesis or two hypotheses, but they do not provide information about the data. We all know that we should not report results as “X influenced Y, p < .05”. The reason is that this statement provides no information about the effect size.  The effect size could be tiny, d = 0.02, small, d = .20, or larger, d = .80.  Thus, it is now required to provide some information about raw or standardized effect sizes and ideally also about the amount of  raw or standardized sampling error. For example, standardized effect sizes could be reported as the standardized mean difference and sampling error (d = .3, se = .15) or as a  confidence interval, e.g., (d = .3, 95% CI = 0 to .6). This is important information about the actual data, but it does not provide information about hypothesis tests. Thus, if the results of a study are used to test hypothesis, information about effect sizes and sampling errors has to be evaluated with specified criterion values that can be used to examine which hypothesis is consistent with an observed effect size.

RELATING HYPOTHESIS TESTS TO EFFECT SIZE ESTIMATION

In NHST, it is easy to see how p-values are related to effect size estimation.  A confidence interval around the observed effect size is constructed by multiplying the amount of sampling error by  a factor that is defined by alpha.  The 95% confidence interval covers all values around the observed effect size, except the most extreme 5% values in the tails of the sampling distribution.  It follows that any significance test that compares the observed effect size against a value outside the confidence interval will produce a p-value less than the error criterion.

It is not so straightforward to see how Bayes Factors relate to effect size estimates.  Rouder et al. (2016) discuss a scenario where the 95% credibiltiy interval ranges around the most likey effect size of d = .165 ranges from .055 to .275 and excludes zero.  Thus, an evaluation of the null-hypothesis, d = 0, in terms of a 95%CI would lead to the rejection of the point-zero hypothesis.  We cannot conclude from this evidence that an effect is absent. Rather the most reasonable inference is that the population effect size is likely to be small, d ~ .2.   In this scenario, Rouder et al. (2009) obtained a Bayes-Factor of 1.  This Bayes-Factor also does not support H0, but it also does not provide support for H1.  How is it possible that two Bayesian methods seem to produce contradictory results? One method rejects H0:d = 0 and the other method shows no more support for H1 than for H0:d = 0.

Rouder et al. provide no answer to this question.  “Here we have a divergence. By using posterior credible intervals, we might reject the null, but by using Bayes’ rule directly we see that this rejection is made prematurely as there is no decrease in the plausibility of the zero point” (p. 536).   Moreover, they suggest that Bayes Factors give the correct answer and the rejection of d = 0 by means of credibility intervals is unwarranted. “…, but by using Bayes’ rule directly we see that this rejection is made prematurely as there is no decrease in the plausibility of the zero point.Updating with Bayes’ rule directly is the correct approach because it describes appropriate conditioning of belief about the null point on all the information in the data” (p. 536).

The problem with this interpretation of the discrepancy is that Rouder et al. (2009) misinterpret the meaning of a Bayes Factor as if it can be directly interpreted as a test of the null-hypothesis, d = 0.  However, in more thoughtful articles by the same authors, they recognize that (a) Bayes Factors only provide relative information about H0 in comparison to a specific alternative hypothesis H1, (b) the specification of H1 influences Bayes Factors, (c) alternative hypotheses that give a high a priori probability to large effect sizes favor H0 when the observed effect size is small, and (d) it is always possible to specify an alternative hypothesis (H1) that will not favor H0 by limiting the range of effect sizes to small effect sizes. For example, even with a small observed effect size of d = .165, it is possible to provide strong support for H1 and reject H0, if H1 is specified as Cauchy(0,0.1) and the sample size is sufficiently large to test H0 against H1.

BF.N.r.Plot.png
Figure 1 shows how Bayes Factors vary as a function of the specification of H1 and as a function of sample size with the same observed effect size of d = .165.  It is possible to get Bayes-Factors greater than 3 in favor of H0 with a wide Cauchy (0,1) and a small sample size of N = 100 and a Bayes Factor greater than 3 in favor of H1 with a small scaling factor of .4 or smaller and a sample size of N = 250.  In short, it is not possible to interpret Bayes Factors that favor H0 as evidence for the absence of an effect.  The Bayes Factor only tells us that the observed effect size is more consistent with the data than H1, but it is difficult to interpret this result because H1 is not a clearly specified alternative effect size. H1 changes not only with the specification of the range of effect sizes, but also with sample size.  This property is not a design flaw of Bayes Factors.  They were designed to provide more and more stringent tests of H0:d = 0 that would eventually support H1 if the sample size is sufficiently large and H0:d = 0 is false.  However, if H0 is false and H1 includes many large effect sizes (an ultrawide prior), Bayes Factors will first favor H0 and data collection may stop before Bayes Factors switch and provide the correct result that the population effect size is not zero.   This behavior of Bayes-Factors was illustrated by Rouder et al. (2009) with a simulation of a population effect size of d = .02.

 

BFSmallEffect.png
Here we see that the Bayes Factor favors H0 until sample sizes are above N = 5,000 and provides the correct information about the point hypothesis being false with N = 20,000 or more.To avoid confusion in the interpretation of Bayes Factors and to provide a better understanding of the actual regions of effect sizes that are consistent with H0 and H1, I developed simple R-Code that translates the results of a Bayesian Hypothesis Test into a Neyman Pearson hypothesis test.

TRANSLATING RESULTS FROM A BAYESIAN HYPOTHESIS TEST INTO RESULTS FROM A NEYMAN PEARSON HYPOTHESIS TEST

A typical analysis with BF creates three regions. One region of observed effect sizes is defined by BF > BF.crit in favor of H1 over H0. One region is defined by inconclusive BF with BF < BF.crit in favor of H0 and BF < BF.crit for H1 (1/BF crit < BF(H1/H0) < BF.crit.). The third region is defined by effect sizes between 0 and the effect size that matches the criterion for BF > BF.crit in favor of H0.
The width and location of these regions depends on the specification of H1 (a wider or narrower distribution of effect sizes under the assumption that an effect is present), the sample size, and the long-run error rate, where an error is defined as a BF > BF.crit that supports H0 when H1 is true and vice versa.
I examined the properties of BF for two scenarios. In one scenario researchers specify H1 as a Cauchy(0,.4). The value of .4 was chosen because .4 is a reasonable estimate of the median effect size in psychological research. I chose a criterion value of BF.crit = 5 to maintain a relatively low error rate.
I used a one sample t-test with n = 25, 100, 200, 500, and 1,000. The same amount of sampling error would be obtained in a two-sample design with 4x the sample size (N = 100, 400, 800, 2,000, and 4,000).
bf.crit N bf0 ci.low border ci.high alpha
[1,] 5 25 2.974385 NA NA 0.557 NA
[2,] 5 100 5.296013 0.035 0.1535 0.272 0.1194271
[3,] 5 200 7.299299 0.063 0.1300 0.197 0.1722607
[4,] 5 500 11.346805 0.057 0.0930 0.129 0.2106060
[5,] 5 1000 15.951191 0.048 0.0715 0.095 0.2287873
We see that the typical sample size in cognitive psychology with a within-subject design (n = 25) will never produce a result in favor of H0 and it requires an effect size of d = .56 to produce a result in favor of H1. This criterion is somewhat higher than the criterion effect size for p < .05 (two-tailed), which is d = .41, and approximately the same as the effect size needed for with alpha = .01, d = .56.
With N = 100, it is possible to obtain evidence for H0. If the observed effect size is exactly 0, BF = 5.296, and the maximum observed effect size that produces evidence in favor of H0 is d = 0.035. The minimum effect size needed to support H1 is d = .272. We can think about these two criterion values as limits of a confidence interval around the effect size in the middle (d = .1535). The width of the confidence interval implies that in the long run, we would make ~ 11% errors in favor of H0 and 11% errors in favor of H1, if the population effect size is d = .1535(#1). If we treat d = .1535 as the boundary for an interval null-hypothesis, H0:abs(d) < .1535, we do not make a mistake when the population effect size is less than .1535. So, we can interpret a BF > 5 as evidence for H0:abs(d) < .15, with an 11% error rate. The probability of supporting H0 with a statistically small effect size of d = .2 would be less than 11%. In short, we can interpret BF > 5 in favor of H0 as evidence for abs(d) < .15 and BF > 5 in favor of H1 as evidence for H1:abs(d) > .15, with approximate error rates of 10% and a region of inconclusive evidence for observed effect sizes between d = .035 and d = .272.
The results for N = 200, 500, and 1,000 can be interpreted the same way. An increase in sample size has the following effects: (a) the boundary effect size d.b that separates H0:|d| <= d.b and H1:|d| > d.b shrinks. In the limit it reaches zero and only d = 0 supports H0: |d| <= 0. With N = 1,000, the boundary value is d.b = .048 and an observed effect size of d = .0715 provides sufficient evidence for H1. However, the table also shows that the error rate increases. In larger samples a BF of 5 in one direction or the other occurs more easily by chance and the long-term error rate has doubled. Of course, researchers could keep a fixed error rate by adjusting the BF criterion value to match a fixed error rate, but Bayesian Hypthesis tests are not designed to maintain a fixed error rate. If this were a researchers goal, they could just specify alpha and use NHST to test H0:|d| < d.crit vs. H1:|d| > d.crit.
In practice, many researchers use a wider prior and a lower criterion value. For example, EJ Wagenmakers prefers the original Jeffrey prior with a scaling factor of 1 and a criterion value of 3 as noteworthy (but not definitive) evidence.
The next table translates inferences with a Cauchy(0,1) and BF.crit = 3 into effect size regions.
bf.crit N bf0 ci.low border ci.high alpha
[1,] 3 25 6.500319 0.256 0.3925 0.529 0.2507289
[2,] 3 100 12.656083 0.171 0.2240 0.277 0.2986493
[3,] 3 200 17.812296 0.134 0.1680 0.202 0.3155818
[4,] 3 500 28.080784 0.094 0.1140 0.134 0.3274574
[5,] 3 1000 39.672827 0.071 0.0850 0.099 0.3290325

The main effect of using Cauchy(0,1) to specify H1 is that the border value that distinguishes H0 and H1 is higher. The main effect of using BF.crit = 3 as a criterion value is that it is easier to provide evidence for H0 or H1 at the expense of having a higher error rate.

It is now possible to provide evidence for H0 with a small sample of n = 25 in a one-sample t-test. However, when we translate this finding into ranges of effect sizes, we see that the boundary between H0 and H1 is d = .39.  Any observed effect size below .256 yields a BF in favor of H0. So, it would be misleading to interpret this finding as if a BF of 3 in a sample of n = 25 provides evidence for the point null-hypothesis d = 0.  It only shows that an effect size of d < .39 is more consistent with an effect size of 0 than with effect sizes specified in H1 which places a lot of weight on large effect sizes.  As sample sizes increase, the meaning of BF > 3 in favor of H0 changes. With N = 1,000,  a BF of 3  any effect size larger than .071 does no longer provide evidence for H0.  In the limit with an infinite sample size, only d = 0 would provide evidence for H0 and we can infer that H0 is true. However, BF > 3 in finite sample sizes does not justify this inference.

The translation of BF results into hypotheses about effect size regions makes it clear why BF results in small samples often seem to diverge from hypothesis tests with confidence intervals or credibility intervals.  In small samples, BF are sensitive to specification of H1 and even if it is unlikely that the population effect size is 0 (0 is outside the confidence or credibility interval), the BF may show support for H0 because the effect size is below the criterion value that is needed to support H0.  This inconsistency does not mean that different statistical procedures lead to different inferences. It only means that BF > 3 in favor of H0 RELATIVE TO H1 cannot be interpreted as a test of the hypothesis of d = 0.  It can only be interpreted as evidence for H0 relative to H1 and the specification of H1 influences  which effect sizes provide support for H0.

CONCLUSION

Sir Arthur Eddington (cited by Cacioppo & Berntson, 1994) described a hypothetical
scientist who sought to determine the size of the various fish in the sea. The scientist began by weaving a 2-in. mesh net and setting sail across the seas. repeatedly sampling catches and carefully measuring. recording. and analyzing the results of each catch. After extensive sampling. the scientist concluded that there were no fish smaller than 2 in. in the sea.

The moral of this story is that a scientists method influences their results.  Scientists who use p-values to search for significant results in small samples, will rarely discover small effects and may start to believe that most effects are large.  Similarly, scientists who use Bayes-Factors with wide priors may delude themselves that they are searching for small and large effects and falsely believe that effects are either absent or large.  In both cases, scientists make the same mistake.  A small sample is like a net with large holes that can only (reliably) capture big fish.  This is ok, if the goal is to capture only big fish, but it is a problem when the goal is to find out whether a pond contains any fish at all.  A wide net with big holes may never lead to the discovery of a fish in the pond, while there are plenty of small fish in the pond.

Researchers therefore have to be careful when they interpret a Bayes Factor and they should not interpret Bayes-Factors in favor of H0 as evidence for the absence of an effect. This fallacy is just as problematic as the fallacy to interpret a p-value above alpha (p > .05) as evidence for the absence of an effect.  Most researchers are aware that non-significant results do not justify the inference that the population effect size is zero. It may be news to some that a Bayes Factor in favor of H0 suffers from the same problem.  A Bayes-Factor in favor of H0 is better considered a finding that rejects the specific alternative hypothesis that was pitted against d = 0.  Falsification of this specific H1 does not justify the inference that H0:d = 0 is true.  Another model that was not tested could still fit the data better than H0.

Bayes Ratios: A Principled Approach to Bayesian Hypothesis Testing

I  have written a few posts before that are critical of Bayesian Hypothesis Testing with Bayes Factors (Rouder et al.,. 2009; Wagenmakers et al., 2010, 2011).

The main problem with this approach is that it typically compares a single effect size (typically 0) with an alternative hypothesis that is a composite of all other effect sizes. The alternative is often specified as a weighted average with a Cauchy distribution to weight effect sizes.  This leads to a comparison of H0:d=0 vs. H1:d=Cauchy(es,0,r) with r being a scaling factor that specifies the median absolute effect size for the alternative hypothesis.

It is well recognized by critics and proponents of this test that the comparison of H0 and H1 favors H0 more and more as the scaling factor is increased.  This makes the test sensitive to the specification of H1.

Another problem is that Bayesian hypothesis testing either uses arbitrary cutoff values (BF > 3) to interpret the results of a study or asks readers to specify their own prior odds of H0 and H1.  I have started to criticize this approach because the use of a subjective prior in combination with an objective specification of the alternative hypothesis can lead to false conclusions.  If I compare H0:d = 0 with H1:d = .2, I am comparing two hypothesis with a single value.  If I am very uncertain about the results of a study , I can assign an equal prior probability to both effect sizes and the prior odds of H0/H1 are .5/.5 = 1. Thus, a Bayes Factor can be directly interpreted as the posterior odds of H0 and H1 given the data.

Bayes Ratio (H0/H1) = Prior Odds (H0,H1) * Bayes Factor (H0/H1)

However, if I increase the range of possible effect sizes for H1 and I am uncertain about the actual effect sizes, the a priori probability increases, just like my odds of winning increases when I disperse my bet on several possible outcomes (lottery numbers, horses in the Kentucky derby, or numbers in a roulette game).  Betting on effect sizes is no different and the prior odds in favor of H1 increase the more effect sizes I consider plausible.

I therefore propose to use the prior distribution of effect sizes to specify my uncertainty about what could happen in a study. If I think, the null-hypothesis is most likely, I can weight it more than other effect sizes (e.g., with a Cauchy or normal distribution centered at 0).   I can then use this distribution to compute (a) the prior odds of H0 and H1, and (b) the conditional probabilities of the observed test statistic (e.g., a t-value) given H0 and H1.

Instead of interpreting Bayes Factors directly, which is not Bayesian, and confuses conditional probabilities of data given hypothesis with conditional probabilities of hypotheses given data,  Bayes-Factors are multiplied with the prior odds, to get Bayes Ratios, which many Bayesians consider to be the answer to the real question researchers want to answer.  How much should I believe H0 or H1 after I collected data and computed a test-statistic like a t-value?

This approach is more principled and Bayesian than the use of Bayes Factors with arbitrary cut-off values that are easily misinterpreted as evidence for H0 or H1.

One reason why this approach may not have been used before is that H0 is often specified as a point-value (d = 0) and the a priori probability of a single point effect size is 0.  Thus, the prior odds (H0/H1) are zero and the Bayes Ratio is also zero.  This problem can be avoided by restricting H1 to a reasonably small range of effect sizes and by specifying the null-hypothesis as a small range of effect sizes around zero.  As a result, it becomes possible to obtain non-zero prior odds for H0 and to obtain interpretable Bayes Ratios.

The inferences based on Bayes Ratios are not only more principled than those based on Base Factors,  they are also more in line with inferences that one would draw on the basis of other methods that can be used to test H0 and H1 such as confidence intervals or Bayesian credibility intervals.

For example, imagine a researcher who wants to provide evidence for the null-hypothesis that there are no gender differences in intelligence.   The researcher decided a priori that small differences of less than 1.5 IQ points (0.1 SD) will be considered as sufficient to support the null-hypothesis. He collects data from 50 men and 50 women and finds a mean difference of 3 IQ points in one or the other direction (conveniently, it doesn’t matter in which direction).

The t-value with a standardized mean difference of d = 3/15d = .2, and sampling error of SE = 2/sqrt(100) = .2 is t = .2/2 = 1.  A t-value of 1 is not statistically significant. Thus, it is clear that the data do not provide evidence against H0 that there are no gender differences in intelligence.  However, do the data provide positive sufficient evidence for the null-hypothesis?   p-values are not designed to answer this question.  The 95%CI around the observed standardized effect size is -.19 to .59.  This confidence interval is wide. It includes 0, but it also includes d = .2 (a small effect size) and d = .5 (a moderate effect size), which would translate into a difference by 7.5 IQ points.  Based on this finding it would be questionable to interpret the data as support for the null-hypothesis.

With a default specification of the alternative hypothesis with a Cauchy distribution scaled to 1,  the Bayes-Factor (H0/H1) favors H0 over H1  4.95:1.   The most appropriate interpretation of this finding is that the prior odds should be updated by a factor of 5:1 in favor of H0, whatever these prior odds are.  However, following Jeffrey’s many users who compute Bayes-Factors interpret Bayes-Factors directly with reference to Jeffrey’s criterion values and a value greater than 3 can be and has been used to suggest that the data provide support for the null-hypothesis.

This interpretation ignores that the a priori distribution of effect sizes allocates only a small probability (p = .07) to H0 and a much larger area to H1 (p = .93).  When the Bayes Factor is combined with the prior odds (H0/H1) of .07/.93 = .075/1,   the resulting Bayes Ratio shows that support for H0 increased, but that it is still more likely that H1 is true than that H0 is true,   .075 * 4.95 = .37.   This conclusion is consistent with the finding that the 95%CI overlaps with the region of effect sizes for H0 (d = -.1, .1).

We can increase the prior odds of H0 by restricting the range of effect sizes that are plausible under H1.  For example, we can restrict effect sizes to 1 or we can set the scaling parameter of the Cauchy distribution to .5. This way, 50% of the distribution falls into the range between d = -.5 and .5.

The t-value and 95%CI remain unchanged because they do not require a specification of H1.  By cutting the range of effect sizes for H1 roughly in half (from scaling parameter 1 to .5), the Bayes-Factor in favor of H0 is also cut roughly in half and is no longer above the criterion value of 3, BF (H0/H1) = 2.88.

The change of the alternative hypothesis has the opposite effect on prior odds. The probability of H0 nearly doubled (p = .13) and the prior odds are now .13/.87 = .15.  The resulting Bayes Ratio in favor of H0 remains similar to the Bayes Ratio with the wider Cauchy distribution, Bayes Ratio = .15 * 2.88 = 0.45.  In fact, it actually is a bit stronger than the Bayes Ratio with the wider specification of effect sizes (BR (H0/H1) = .45.  However, both Bayes Ratios lead to the same conclusion that is also consistent with the observed effect size, d = .2, and the confidence interval around it, d = -.19 to d = .59.  That is, given the small sample size, the observed effect size provides insufficient information to draw any firm conclusions about H0 or H1. More data are required to decide empirically which hypothesis is more likely to be true.

The example used an arbitrary observed effect size of d = .2.  Evidently, effect sizes much larger than this would lead to the rejection of H0 with p-values, confidence intervals, Bayes Factor, or Bayes-Ratios.  A more interesting question is what the results would be like if the observed effect size would have provided maximum support for the null-hypothesis, which assumes an observed effect size of 0, which also produces a t-value of 0.   With the default prior of Cauchy(M=0,V=1), the Bayes-Factor in favor of H0 is 9.42, which is close to the next criterion value of BF > 10 that is sometimes used to stop data collection because the results are decisive.  However, the Bayes Ratio is still slightly in favor of H1, BR (H1/H0) = 1.42.  The 95%CI ranges from -.39 to .39 and overlaps with the criterion range of effect sizes in the range from -.1 to .1.   Thus, the Bayes Ratio shows that even an observed effect size of 0 in a sample of N = 100 provides insufficient evidence to infer that the null-hypothesis is true.

When we increase sample size to N = 2,000,  the 95%CI around d = 0 ranges from -.09 to .09.  This finding means that the data support the null-hypothesis and that we would make a mistake in our inferences that use the same approach in no more than 5% of our tests (not just those that provide evidence for H0, but all tests that use this approach).  The Bayes-Factor also favors H0 with a massive BF (H0/H1) = 711..27.   The Bayes-Ratio also favors H0, with a Bayes-Ratio of 53.35.   As Bayes-Ratios are the ratio of two complementary probabilities p(H0) + p(H1) = 1, we can compute the probability of H0 being true with the formula  BR(H0/H1) / (Br(H0/H1) + 1), which yields a probability of 98%.  We see how the Bayes-Ratio is consistent with the information provided by the confidence interval.  The long-run error frequency for inferring H0 from the data was less than 5% and the probability of H1 being true given the data is 1-.98 = .02.

Conclusion

Bayesian Hypothesis Testing has received increased interest among empirical psychologists, especially in situations when researchers aim to demonstrate the lack of an effect.  Increasingly, researchers use Bayes-Factors with criterion values to claim that their data provide evidence for the null-hypothesis.  This is wrong for three reasons.

First, it is impossible to test a hypothesis that is specified as one effect size out of an infinite number of alternative effect sizes.  Researchers appear to be confused that Bayes Factors in favor of H0 can be used to suggest that all other effect sizes are implausible. This is not the case because Bayes Factors do not compare H0 to all other effect sizes. They compare H0 to a composite hypotheses of all other effect sizes and Bayes Factors depend on the way the composite is created. Falsification of one composite does not ensure that the null-hypothesis is true (the only viable hypothesis still standing) because other composites can still fit the data better than H0.

Second, the use of Bayes-Factors with criterion values also suffers from the problem that it ignores the a priori odds of H0 and H1.  A full Bayesian inferences requires to take the prior odds into account and to compute posterior odds or Bayes Ratios.  The problem for the point-null hypothesis (d = 0) is that the prior odds for H0 over H1 is 0. The reason is that the prior distribution of effect sizes adds up to 1 (the true effect size has to be somewhere), leaving zero probability for d = 0.   It is possible to compute Bayes-Factors for d = 0 because Bayes-Factors use densities. For the computation of Bayes Factors the distinction between densities and probabilities is not important, but the for the computation of prior odds, the distinction is important.  A single effect size has a density on the Cauchy distribution, but it has zero probability.

The fundamental inferential problem of Bayes-Factors that compare H0:d=0 can be avoided by specifying H0 as a critical region around d=0.  It is then possible to compute prior odds based on the area under the curve for H0 and the area under the curve for H1. It is also possible to compute Bayes Factors for H0 and H1 when H0 and H1 are specified as complementary regions of effect sizes.  The two ratios can be multiplied to obtain a Bayes Ratio. Furthermore, Bayes Ratios can be used as the probability of H0 given the data and the probability of H1 given the data.  The results of this test are consistent with other approaches to the testing of regional null-hypothesis and they are robust to misspecifications of the alternative hypothesis that allocate to much weight to large effect sizes.   Thus, I recommend Bayes Ratios for principled Bayesian Hypothesis testing.

*************************************************************************

R-Code for the analyses reported in this post.

*************************************************************************

#######################
### set input
#######################

### What is the total sample size?
N = 2000

### How many groups?  One sample or two sample?
gr = 2

### what is the observed effect size
obs.es = 0

### Set the range for H0, H1 is defined as all other effect sizes outside this range
H0.range = c(-.1,.1)  #c(-.2,.2) # 0 for classic point null

### What is the limit for maximum effect size, d = 14 = r = .99
limit = 14

### What is the mode of the a priori distribution of effect sizes?
mode = 0

### What is the variability (SD for normal, scaling parameter for Cauchy) of the a priori distribution of effect sizes?
var = 1

### What is the shape of the a priori distribution of effect sizes
shape = “Cauchy”  # Uniform, Normal, Cauchy  Uniform needs limit

### End of Input
### R computes Likelihood ratios and Weighted Mean Likelihood Ratio (Bayes Factor)
prec = 100 #set precision, 100 is sufficient for 2 decimal
df = N-gr
se = gr/sqrt(N)
pop.es = mode
if (var > 0) pop.es = seq(-limit*prec,limit*prec)/prec
weights = 1
if (var > 0 & shape == “Cauchy”) weights = dcauchy(pop.es,mode,var)
if (var > 0 & shape == “Normal”) weights = dnorm(pop.es,mode,var)
if (var > 0 & shape == “Uniform”) weights = dunif(pop.es,-limit,limit)
H0.mat = cbind(0,1)
H1.mat = cbind(mode,1)
if (var > 0) H0.mat = cbind(pop.es,weights)[pop.es >= H0.range[1] & pop.es <= H0.range[2],]
if (var > 0) H1.mat = cbind(pop.es,weights)[pop.es < H0.range[1] | pop.es > H0.range[2],]
H0.mat = matrix(H0.mat,,2)
H1.mat = matrix(H1.mat,,2)
H0 = sum(dt(obs.es/se,df,H0.mat[,1]/se)*H0.mat[,2])/sum(H0.mat[,2])
H1 = sum(dt(obs.es/se,df,H1.mat[,1]/se)*H1.mat[,2])/sum(H1.mat[,2])
BF10 = H1/H0
BF01 = H0/H1
Pr.H0 = sum(H0.mat[,2]) / sum(weights)
Pr.H1 = sum(H1.mat[,2]) / sum(weights)
PriorOdds = Pr.H1/Pr.H0
Bayes.Ratio10 = PriorOdds*BF10
Bayes.Ratio01 = 1/Bayes.Ratio10
### R creates output file
text = c()
text[1] = paste0(‘The observed t-value with d = ‘,obs.es,’ and N = ‘,N,’ is t(‘,df,’) = ‘,round(obs.es/se,2))
text[2] = paste0(‘The 95% confidence interal is ‘,round(obs.es-1.96*se,2),’ to ‘,round(obs.es+1.96*se,2))
text[3] = paste0(‘Weighted Mean Density(H0:d >= ‘,H0.range[1],’ & <= ‘,H0.range[2],’) = ‘,round(H0,5))
text[4] = paste0(‘Weighted Mean Density(H1:d <= ‘,H0.range[1],’ | => ‘,H0.range[2],’) = ‘,round(H1,5))
text[5] = paste0(‘Weighted Mean Likelihood Ratio (Bayes Factor) H0/H1: ‘,round(BF01,2))
text[6] = paste0(‘Weighted Mean Likelihood Ratio (Bayes Factor) H1/H0: ‘,round(BF10,2))
text[7] = paste0(‘The a priori likelihood ratio of H1/H0 is ‘,round(Pr.H1,2),’/’,round(Pr.H0,2),’ = ‘,round(PriorOdds,2))
text[8] = paste0(‘The Bayes Ratio(H1/H0) (Prior Odds x Bayes Factor) is ‘,round(Bayes.Ratio10,2))
text[9] = paste0(‘The Bayes Ratio(H0/H1) (Prior Odds x Bayes Factor) is ‘,round(Bayes.Ratio01,2))
### print output
text

Subjective Bayesian T-Test Code

########################################################

rm(list=ls()) #will remove ALL objects

##############################################################
Bayes-Factor Calculations for T-tests
##############################################################

#Start of Settings

### Give a title for results output
Results.Title = ‘Normal(x,0,.5) N = 100 BS-Design, Obs.ES = 0′

### Criterion for Inference in Favor of H0, BF (H1/H0)
BF.crit.H0 = 1/3

### Criterion for Inference in Favor of H1
#set z.crit.H1 to Infinity to use Bayes-Factor, BF(H1/H0)
BF.crit.H1 = 3
z.crit.H1 = Inf

### Set Number of Groups
gr = 2

### Set Total Sample size
N = 100

### Set observed effect size
### for between-subject designs and one sample designs this is Cohen’s d
### for within-subject designs this is dz
obs.es = 0

### Set the mode of the alternative hypothesis
alt.mode = 0

### Set the variability of the alternative hypothesis
alt.var = .5

### Set the shape of the distribution of population effect sizes
alt.dist = 2  #1 = Cauchy; 2 = Normal

### Set the lower bound of population effect sizes
### Set to zero if there is zero probability to observe effects with the opposite sign
low = -3

### Set the upper bound of population effect sizes
### For example, set to 1, if you think effect sizes greater than 1 SD are unlikely
high = 3

### set the precision of density estimation (bigger takes longer)
precision = 100

### set the graphic resolution (higher resolution takes longer)
graphic.resolution = 20

### set limit for non-central t-values
nct.limit = 100

################################
# End of Settings
################################

# compute degrees of freedom
df = (N – gr)

# get range of population effect sizes
pop.es=seq(low,high,(1/precision))

# compute sampling error
se = gr/sqrt(N)

# limit population effect sizes based on non-central t-values
pop.es = pop.es[pop.es/se >= -nct.limit & pop.es/se <= nct.limit]

# function to get weights for Cauchy or Normal Distributions
get.weights=function(pop.es,alt.dist,p) {
if (alt.dist == 1) w = dcauchy(pop.es,alt.mode,alt.var)
if (alt.dist == 2) w = dnorm(pop.es,alt.mode,alt.var)
sum(w)
# get the scaling factor to scale weights to 1*precision
#scale = sum(w)/precision
# scale weights
#w = w / scale
return(w)
}

# get weights for population effect sizes
weights = get.weights(pop.es,alt.dist,precision)

#Plot Alternative Hypothesis
Title=”Alternative Hypothesis”
ymax=max(max(weights)*1.2,1)
plot(pop.es,weights,type=’l’,ylim=c(0,ymax),xlab=”Population Effect Size”,ylab=”Density”,main=Title,col=’blue’,lwd=3)
abline(v=0,col=’red’)

#create observations for plotting of prediction distributions
obs = seq(low,high,1/graphic.resolution)

# Get distribution for observed effect size assuming H1
H1.dist = as.numeric(lapply(obs, function(x) sum(dt(x/se,df,pop.es/se) * weights)/precision))

#Get Distribution for observed effect sizes assuming H0
H0.dist = dt(obs/se,df,0)

#Compute Bayes-Factors for Prediction Distribution of H0 and H1
BFs = H1.dist/H0.dist

#Compute z-scores (strength of evidence against H0)
z = qnorm(pt(obs/se,df,log.p=TRUE),log.p=TRUE)

# Compute H1 error rate rate
BFpos = BFs
BFpos[z < 0] = Inf
if (z.crit.H1 == Inf) z.crit.H1 = abs(z[which(abs(BFpos-BF.crit.H1) == min(abs(BFpos-BF.crit.H1)))])
ncz = qnorm(pt(pop.es/se,df,log.p=TRUE),log.p=TRUE)
weighted.power = sum(pnorm(abs(ncz),z.crit.H1)*weights)/sum(weights)
H1.error = 1-weighted.power

#Compute H0 Error Rate
z.crit.H0 = abs(z[which(abs(BFpos-BF.crit.H0) == min(abs(BFpos-BF.crit.H0)))])
H0.error = (1-pnorm(z.crit.H0))*2

# Get density for observed effect size assuming H0
Density.Obs.H0 = dt(obs.es,df,0)

# Get density for observed effect size assuming H1
Density.Obs.H1 = sum(dt(obs.es/se,df,pop.es/se) * weights)/precision

# Compute Bayes-Factor for observed effect size
BF.obs.es = Density.Obs.H1 / Density.Obs.H0

#Compute z-score for observed effect size
obs.z = qnorm(pt(obs.es/se,df,log.p=TRUE),log.p=TRUE)

#Show Results
ymax=max(H0.dist,H1.dist)*1.3
plot(type=’l’,z,H0.dist,ylim=c(0,ymax),xlab=”Strength of Evidence (z-value)”,ylab=”Density”,main=Results.Title,col=’black’,lwd=2)
par(new=TRUE)
plot(type=’l’,z,H1.dist,ylim=c(0,ymax),xlab=””,ylab=””,col=’blue’,lwd=2)
abline(v=obs.z,lty=2,lwd=2,col=’darkgreen’)
abline(v=-z.crit.H1,col=’blue’,lty=3)
abline(v=z.crit.H1,col=’blue’,lty=3)
abline(v=-z.crit.H0,col=’red’,lty=3)
abline(v=z.crit.H0,col=’red’,lty=3)
points(pch=19,c(obs.z,obs.z),c(Density.Obs.H0,Density.Obs.H1))
res = paste0(‘BF(H1/H0): ‘,format(round(BF.obs.es,3),nsmall=3))
text(min(z),ymax*.95,pos=4,res)
res = paste0(‘BF(H0/H1): ‘,format(round(1/BF.obs.es,3),nsmall=3))
text(min(z),ymax*.90,pos=4,res)
res = paste0(‘H1 Error Rate: ‘,format(round(H1.error,3),nsmall=3))
text(min(z),ymax*.80,pos=4,res)
res = paste0(‘H0 Error Rate: ‘,format(round(H0.error,3),nsmall=3))
text(min(z),ymax*.75,pos=4,res)

######################################################
### END OF Subjective Bayesian T-Test CODE
######################################################
### Thank you to Jeff Rouder for posting his code that got me started.
### http://jeffrouder.blogspot.ca/2016/01/what-priors-should-i-use-part-i.html

 

Wagenmakers’ Default Prior is Inconsistent with the Observed Results in Psychologial Research

Bayesian statistics is like all other statistics. A bunch of numbers are entered into a formula and the end result is another number.  The meaning of the number depends on the meaning of the numbers that enter the formula and the formulas that are used to transform them.

The input for a Bayesian inference is no different than the input for other statistical tests.  The input is information about an observed effect size and sampling error. The observed effect size is a function of the unknown population effect size and the unknown bias introduced by sampling error in a particular study.

Based on this information, frequentists compute p-values and some Bayesians compute a Bayes-Factor. The Bayes Factor expresses how compatible an observed test statistic (e.g., a t-value) is with one of two hypothesis. Typically, the observed t-value is compared to a distribution of t-values under the assumption that H0 is true (the population effect size is 0 and t-values are expected to follow a t-distribution centered over 0 and an alternative hypothesis. The alternative hypothesis assumes that the effect size is in a range from -infinity to infinity, which of course is true. To make this a workable alternative hypothesis, H1 assigns weights to these effect sizes. Effect sizes with bigger weights are assumed to be more likely than effect sizes with smaller weights. A weight of 0 would mean a priori that these effects cannot occur.

As Bayes-Factors depend on the weights attached to effect sizes, it is also important to realize that the support for H0 depends on the probability that the prior distribution was a reasonable distribution of probable effect sizes. It is always possible to get a Bayes-Factor that supports H0 with an unreasonable prior.  For example, an alternative hypothesis that assumes that an effect size is at least two standard deviations away from 0 will not be favored by data with an effect size of d = .5, and the BF will correctly favor H0 over this improbable alternative hypothesis.  This finding would not imply that the null-hypothesis is true. It only shows that the null-hypothesis is more compatible with the observed result than the alternative hypothesis. Thus, it is always necessary to specify and consider the nature of the alternative hypothesis to interpret Bayes-Factors.

Although the a priori probabilities of  H0 and H1 are both unknown, it is possible to test the plausibility of priors against actual data.  The reason is that observed effect sizes provide information about the plausible range of effect sizes. If most observed effect sizes are less than 1 standard deviation, it is not possible that most population effect sizes are greater than 1 standard deviation.  The reason is that sampling error is random and will lead to overestimation and underestimation of population effect sizes. Thus, if there were many population effect sizes greater than 1, one would also see many observed effect sizes greater than 1.

To my knowledge, proponents of Bayes-Factors have not attempted to validate their priors against actual data. This is especially problematic when priors are presented as defaults that require no further justification for a specification of H1.

In this post, I focus on Wagenmakers’ prior because Wagenmaker has been a prominent advocate of Bayes-Factors as an alternative approach to conventional null-hypothesis-significance testing.  Wagenmakers’ prior is a Cauchy distribution with a scaling factor of 1.  This scaling factor implies a 50% probability that effect sizes are larger than 1 standard deviation.  This prior was used to argue that Bem’s (2011) evidence for PSI was weak. It has also been used in many other articles to suggest that the data favor the null-hypothesis.  These articles fail to point out that the interpretation of Bayes-Factors in favor of H0 is only valid for Wagenmakers’ prior. A different prior could have produced different conclusions.  Thus, it is necessary to examine whether Wagenmakers’ prior is a plausible prior for psychological science.

Wagenmakers’ Prior and Replicability

A prior distribution of effect sizes makes assumption about population effect sizes. In combination with information about sample size, it is possible to compute non-centrality parameters, which are equivalent to the population effect size divided by sampling error.  For each non-centrality parameter it is possible to estimate power as the area under the curve of the non-central t-distribution on the right side of the criterion value that corresponds to alpha, typically .05 (two-tailed).   The assumed typical power is simply the weighted average of the power values for each non-centrality parameters.

Replicability is not identical to power for a set of studies with heterogeneous non-centrality parameters because studies with higher power are more likely to become significant. Thus, the set of studies that achieved significance has higher average power as the original set of studies.

Aside from power, the distribution of observed test statistics is also informative. Unlikely power which is bound at 1, the distribution of test-statistics is unlimited. Thus, unreasonable assumptions about the distribution of effect sizes are visible in a distribution of test statistics that does not match distributions of tests statistics in actual studies.  One problem is that test-statistics are not directly comparable for different sample sizes or statistical tests because non-central distributions vary as a function of degrees of freedom and the test being used (e.g., chi-square vs. t-test).  To solve this problem, it is possible to convert all test statistics into z-scores so that they are on a common metric.  In a heterogeneous set of studies, the sign of the effect provides no useful information because signs only have to be consistent in tests of the same population effect size. As a result, it is necessary to use absolute z-scores. These absolute z-scores can be interpreted as the strength of evidence against the null-hypothesis.

I used a sample size of N = 80 and assumed a between subject design. In this case, sampling error is defined as 2/sqrt(80) = .224.  A sample size of N = 80 is the median sample size in Psychological Science. It is also the total sample size that would be obtained in a 2 x 2 ANOVA with n = 20 per cell.  Power and replicability estimates would increase for within-subject designs and for studies with larger N. Between subject designs with smaller N would yield lower estimates.

I simulated effect sizes in the range from 0 to 4 standard deviations.  Effect sizes of 4 or larger are extremely rare. Excluding these extreme values means that power estimates underestimate power slightly, but the effect is negligible because Wagenmakers’ prior assigns low probabilities (weights) to these effect sizes.

For each possible effect size in the range from 0 to 4 (using a resolution of d = .001)  I computed the non-centrality parameter as d/se.  With N = 80, these non-centrality parameters define a non-central t-distribution with 78 degrees of freedom.

I computed the implied power to achieve a significant result with alpha = .05 (two-tailed) with the formula

power = pt(ncp,N-2,qt(1-.025,N-2))

The formula returns the area under the curve on the right side of the criterion value that corresponds to a two-tailed test with p = .05.

The mean of these power values is the average power of studies if all effect sizes were equally likely.  The value is 89%. This implies that in the long run, a random sample of studies drawn from this population of effect sizes is expected to produce 89% significant results.

However, Wagenmakers’ prior assumes that smaller effect sizes are more likely than larger effect sizes. Thus, it is necessary to compute the weighted average of power using Wagenmakes’ prior distribution as weights.  The weights were obtained using the density of a Cauchy distribution with a scaling factor of 1 for each effect size.

wagenmakers.weights = dcauchy(es,0,1)

The weighted average power was computed as the sum of the weighted power estimates divided by the sum of weights.  The weighted average power is 69%.  This estimate implies that Wagenmakers’ prior assumes that 69% of statistical tests produce a significant result, when the null-hypothesis is false.

Replicability is always higher than power because the subset of studies that produce a significant result has higher average power than the the full set of studies. Replicabilty for a set of studies with heterogeneous power is the sum of the squared power of individual studies divided by the sum of power.

Replicability = sum(power^2) / sum(power)

The unweighted estimate of replicabilty is 96%.   To obtain the replicability for Wagenmakers’ prior, the same weighting scheme as for power can be used for replicability.

Wagenmakers.Replicability = sum(weights * power^2) / sum(weights*power)

The formula shows that Wagenmakers’ prior implies a replicabilty of 89%.  We see that the weighting scheme has relatively little effect on the estimate of replicability because many of the studies with small effect sizes are expected to produce a non-significant result, whereas the large effect sizes often have power close to 1, which implies that they wil be significant in the original study and the replication study.

The success rate of replication studies is difficult to estimate. Cohen estimated that typical studies in psychology have 50% power to detect a medium effect size, d = .5.  This would imply that the actual success rate would be lower because in an unknown percentage of studies the null-hypothesis is true.  However, replicability would be higher because studies with higher power are more likely to be significant.  Given this uncertainty, I used a scenario with 50% replicability.  That is an unbiased sample of studies taken from psychological journals would produce 50% successful replications in an exact replication study of the original studies.  The following computations show the implications of a 50% success rate in replication studies for the proportion of hypothesis tests where the null hypothesis is true, p(H0).

The percentage of true null-hypothesis is a function of the success rate in replication study, weighted average power, and weighted replicability.

p(H0) = (weighted.average.power * (weighted.replicability – success.rate)) / (success.rate*.05 – success.rate*weighted.average.power – .05^2 + weighted.average.power*weighted.replicability)

To produce a success rate of 50% in replication studies with Wagenmakers’ prior when H1 is true (89% replicability), the percentage of true null-hypothesis has to be 92%.

The high percentage of true null-hypothesis (92%) also has implications for the implied false-positive rate (i.e., the percentage of significant results that are true null-hypothesis.

False Positive Rate =  (Type.1.Error *.05)  / (Type.1.Error * .05 +
(1-Type.1.Error) * Weighted.Average.Power)
For every 100 studies, there are 92 true null-hypothesis that produce 92*.05 = 4.6 false positive results. For the remaining 8 studies with a true effect, there are 8 * .67 = 5.4 true discoveries.  The false positive rate is 4.6 / (4.6 + 5.4) = 46%.  This means Wagenmakers prior assumes that a success rate of 50% in replication studies implies that nearly 50% of studies that replicate successfully are false-positives results that would not replicate in future replication studies.

Aside from these analytically derived predictions about power and replicability, Wagenmakers’ prior also makes predictions about the distribution of observed evidence in individual studies. As observed scores are influenced by sampling error, I used simulations to illustrate the effect of Wagenmakers’ prior on observed test statistics.

For the simulation I converted the non-central t-values into non-central z-scores and simulated sampling error with a standard normal distribution.  The simulation included 92% true null-hypotheses and 8% true H1 based on Wagenmaker’s prior.  As published results suffer from publication bias, I simulated publication bias by selecting only observed absolute z-scores greater than 1.96, which corresponds to the p < .05 (two-tailed) significance criterion.  The simulated data were submitted to a powergraph analysis that estimates power and replicability based on the distribution of absolute z-scores.

Figure 1 shows the results.   First, the estimation method slightly underestimated the actual replicability of 50% by 2 percentage points.  Despite this slight estimation error, the Figure accurately illustrates the implications of Wagenmakers’ prior for observed distributions of absolute z-scores.  The density function shows a steep decrease in the range of z-scores between 2 and 3, and a gentle slope for z-scores greater than 4 to 10 (values greater than 10 are not shown).

Powergraphs provide some information about the composition of the total density by dividing the total density into densities for power less than 20%, 20-50%, 50% to 85% and more than 85%. The red line (power < 20%) mostly determines the shape of the total density function for z-scores from 2 to 2.5, and most the remaining density is due to studies with more than 85% power starting with z-scores around 4.   Studies with power in the range between 20% and 85% contribute very little to the total density. Thus, the plot correctly reveals that Wagenmakers’ prior assumes that the roughly 50% average replicability is mostly due to studies with very low power (< 20%) and studies with very high power (> 85%).
Powergraph for Wagenmakers' Prior (N = 80)

Validation Study 1: Michael Nujiten’s Statcheck Data

There are a number of datasets that can be used to evaluate Wagenmakers’ prior. The first dataset is based on an automatic extraction of test statistics from psychological journals. I used Michael Nuijten’s dataset to ensure that I did not cheery-pick data and to allow other researchers to reproduce the results.

The main problem with automatically extracted test statistics is that the dataset does not distinguish between  theoretically important test statistics and other statistics, such as significance tests of manipulation checks.  It is also not possible to distinguish between between-subject and within-subject designs.  As a result, replicability estimates for this dataset will be higher than the simulation based on a between-subject design.

Powergraph for Michele Nuijten's StatCheck Data

 

Figure 2 shows all of the data, but only significant z-scores (z > 1.96) are used to estimate replicability and power. The most striking difference between Figure 1 and Figure 2 is the shape of the total density on the right side of the significance criterion.  In Figure 2 the slope is shallower. The difference is visible in the decomposition of the total density into densities for different power bands.  In Figure 1 most of the total density was accounted for by studies with less than 20% power and studies with more than 85% power.  In Figure 2, studies with power in the range between 20% and 85% account for the majority of studies with z-scores greater than 2.5 up to z-scores of 4.5.

The difference between Figure 1 and Figure 2 has direct implications for the interpretation of Bayes-Factors with t-values that correspond to z-scores in the range of just significant results. Given Wagenmakers’ prior, z-scores in this range mostly represent false-positive results. However, the real dataset suggests that some of these z-scores are the result of underpowered studies and publication bias. That is, in these studies the null-hypothesis is false, but the significant result will not replicate because these studies have low power.

Validation Study 2:  Open Science Collective Articles (Original Results)

The second dataset is based on the Open Science Collective (OSC) replication project.  The project aimed to replicate studies published in three major psychology journals in the year 2008.  The final number of articles that were selected for replication was 99. The project replicated one study per article, but articles often contained multiple studies.  I computed absolute z-scores for theoretically important tests from all studies of these 99 articles.  This analysis produced 294 test statistics that could be converted into absolute z-scores.

Powergraph for OSC Rep.Project Articles (all studies)
Figure 3 shows clear evidence of publication bias.  No sampling distribution can produce the steep increase in tests around the critical value for significance. This selection is not an artifact of my extraction, but an actual feature of published results in psychological journals (Sterling, 1959).

Given the small number of studies, the figure also contains bootstrapped 95% confidence intervals.  The 95% CI for the power estimate shows that the sample is too small to estimate power for all studies, including studies in the proverbial file drawer, based on the subset of studies that were published. However, the replicability estimate of 49% has a reasonably tight confidence interval ranging from 45% to 66%.

The shape of the density distribution in Figure 3 differs from the distribution in Figure 2 in two ways. Initially the slop is steeper in Figure 3, and there is less density in the tail with high z-scores.  Both aspects contribute to the lower estimate of replicability in Figure 3, suggesting that replicabilty of focal hypothesis tests is lower than replicabilty for all statistical tests.

Comparing Figure 3 and Figure 1 shows again that the powergraph based on Wagenmakers’ prior differs from the powergraph for real data. In this case, the discrepancy is even more notable because focal hypothesis tests rarely produce large z-scores (z > 6).

Validation Study 3:  Open Science Collective Articles (Replication Results)

At present, the only data that are somewhat representative of psychological research (at least of social and cognitive psychology) and that do not suffer from publication bias are the results from the replication studies of the OSC replication project.  Out of 97 significant results in original studies, 36 studies (37%) produced that produced a significant result in the original studies produced a significant result in the replication study.  After eliminating some replication studies (e.g., sample of replication study was considerably smaller), 88 studies remained.

Powergraph for OSC Replication Results (k = 88)Figure 4 shows the powergraph for the 88 studies. As there is no publication bias, estimates of power and replicability are based on non-significant and significant results.  Although the sample size is smaller, the estimate of power has a reasonably narrow confidence interval because the estimate includes non-significant results. Estimated power is only 31%. The 95% confidence interval includes the actual success rate of 40%, which shows that there is no evidence of publication bias.

A visual comparison of Figure 1 and Figure 4 shows again that real data diverge from the predicted pattern by Wagenmakers’ prior.  Real data show a greater contribution of power in the range between 20% and 85% to the total density, and large z-scores (z > 6) are relatively rare in real data.

Conclusion

Statisticians have noted that it is good practice to examine the assumptions underlying statistical tests. This blog post critically examines the assumptions underlying the use of Bayes-Factors with Wagenmakers’ prior.  The main finding is that Wagenmaker’s prior makes unreasonable assumptions about power, replicability, and the distribution of observed test-statistics with or without publication bias. The main problem from Wagenmakers’ prior is that it predicts too many statistical results with strong evidence against the null-hypothesis (z > 5, or the 5 sigma rule in physics).  To achieve reasonable predictions for success rates without publication bias (~50%), Wagenmakers’ prior has to assume that over 90% of statistical tests conducted in psychology test false hypothesis (i.e., predict an effect when H0 is true), and that the false-positive rate is close to 50%.

Implications

Bayesian statisticians have pointed out for a long time that the choice of a prior influences Bayes-Factors (Kass, 1993, p. 554).  It is therefore useful to carefully examine priors to assess the effect of priors on Bayesian inferences. Unreasonable priors will lead to unreasonable inferences.  This is also true for Wagenmakers’ prior.

The problem of using Bayes-Factors with Wagenmakers’ prior to test the null-hypothesis is apparent in a realistic scenario that assumes a moderate population effect size of d = .5 and a sample size of N = 80 in a between subject design. This study has a non-central t of 2.24 and 60% power to produce a significant result with p < .05, two-tailed.   I used R to simulate 10,000 test-statistics using the non-central t-distribution and then computed Bayes-Factors with Wagenmakers’ prior.

Figure 5 shows a histogram of log(BF). The log is being used because BF are ratios and have very skewed distributions.  The histogram shows that BF never favor the null-hypothesis with a BF of 10 in favor of H0 (1/10 in the histogram).  The reason is that even with Wagenmakers’ prior a sample size of N = 80 is too small to provide strong support for the null-hypothesis.  However, 21% of observed test statistics produce a Bayes-Factor less than 1/3, which is sometimes used as sufficient evidence to claim that the data support the null-hypothesis.  This means that the test has a 21% error rate to provide evidence for the null-hypothesis when the null-hypothesis is false.  A 21% error rate is 4 times larger than the 5% error rate in null-hypothesis significance testing. It is not clear why researchers should replace a statistical method with a 5% error rate for a false discovery of an effect with a 20% error rate of false discoveries of null effects.

Another 48% of the results produce Bayes-Factors that are considered inconclusive. This leaves 31% of results that favor H1 with a Bayes-Factor greater than 3, and only 17% of results produce a Bayes-Factor greater than 10.   This implies that even with the low standard of a BF > 3, the test has only 31% power to provide evidence for an effect that is present.

These results are not wrong because they correctly express the support that the observed data provide for H0 and H1.  The problem only occurs when the specification of H1 is ignored. Given Wagenmakers prior, it is much more likely that a t-value of 1 stems from the sampling distribution of H0 than from the sampling distribution of H1.  However, studies with 50% power when an effect is present are also much more likely to produce t-values of 1 than t-values of 6 or larger.   Thus, a different prior that is more consistent with the actual power of studies in psychology would produce different Bayes-Factors and reduce the percentage of false discoveries of null effects.  Thus, researchers who think Wagenmakers’ prior is not a realistic prior for their research domain should use a more suitable prior for their research domain.

HistogramBF

 

Counterarguments

Wagenmakers’ has ignored previous criticisms of his prior.  It is therefore not clear what counterarguments he would make.  Below, I raise some potential counterarguments that might be used to defend the use of Wagenmakers’ prior.

One counterargument could be that the prior is not very important because the influence of priors on Bayes-Factors decreases as sample sizes increase.  However, this argument ignores the fact that Bayes-Factors are often used to draw inferences from small samples. In addition, Kass (1993) pointed out that “a simple asymptotic analysis shows that even in large samples Bayes factors remain sensitive to the choice of prior” (p. 555).

Another counterargument could be that a bias in favor of H0 is desirable because it keeps the rate of false-positives low. The problem with this argument is that Bayesian statistics does not provide information about false-positive rates.  Moreover, the cost for reducing false-positives is an increase in the rate of false negatives; that is, either inconclusive results or false evidence for H0 when an effect is actually present.  Finally, the choice of the correct prior will minimize the overall amount of errors.  Thus, it should be desirable for researchers interested in Bayesian statistics to find the most appropriate priors in order to minimize the rate of false inferences.

A third counterargument could be that Wagenmakers’ prior expresses a state of maximum uncertainty, which can be considered a reasonable default when no data are available.  If one considers each study as a unique study, a default prior of maximum uncertainty would be a reasonable starting point.  In contrast, it may be questionable to treat a new study as a randomly drawn study from a sample of studies with different population effect sizes.  However, Wagenmakers’ prior does not express a state of maximum uncertainty and makes assumptions about the probability of observing very large effect sizes.  It does so without any justification for this expectation.  It therefore seems more reasonable to construct priors that are consistent with past studies and to evaluate priors against actual results of studies.

A fourth counterargument is that Bayes-Factors are superior because they can provide evidence for the null-hypothesis and the alternative hypothesis.  However, this is not correct. Bayes-Factors only provide relative support for the null-hypothesis relative to a specific alternative hypothesis.  Researchers who are interested in testing the null-hypothesis can do so using parameter estimation with confidence or credibility intervals. If the interval falls within a specified region around zero, it is possible to affirm the null-hypothesis with a specified level of certainty that is determined by the precision of the study to estimate the population effect size.  Thus, it is not necessary to use Bayes-Factors to test the null-hypothesis.

In conclusion, Bayesian statistics and other statistics are not right or wrong. They combine assumptions and data to draw inferences.  Untrustworthy data and wrong assumptions can lead to false conclusions.  It is therefore important to test the integrity of data (e.g., presence of publication bias) and to examine assumptions.  The uncritical use of Bayes-Factors with default assumptions is not good scientific practice and can lead to false conclusions just like the uncritical use of p-values can lead to false conclusions.

Power Analysis for Bayes-Factor: What is the Probability that a Study Produces an Informative Bayes-Factor?

Jacob Cohen has warned fellow psychologists about the problem of conducting studies with insufficient statistical power to demonstrate predicted effects in 1962. The problem is simple enough. An underpowered study has only a small chance to produce the correct result; that is, a statistically significant result when an effect is present.

Many researchers have ignored Cohen’s advice to conduct studies with at least 80% power, that is, an 80% probability to produce the correct result when an effect is present because they were willing to pay low odds. Rather than conducting a single powerful study with 80% power, it seemed less risky to conduct three underpowered studies with 30% power. The chances of getting a significant result are similar (the power to get a significant result in at least 1 out of 3 studies with 30% power is 66%). Moreover, the use of smaller samples is even less problematic if a study tests multiple hypotheses. With 80% power to detect a single effect, a study with two hypotheses has a 96% probability that at least one of the two effects will produce a significant result. Three studies allow for six hypotheses tests. With 30% power to detect at least one of the two effects in six attempts, power to obtain at least one significant result is 88%. Smaller samples also provide additional opportunities to increase power by increasing sample sizes until a significant result is obtained (optional stopping) or by eliminating outliers. The reason is that these questionable practices have larger effects on the results in smaller samples. Thus, for a long time researchers did not feel a need to conduct adequately powered studies because there was no shortage of significant results to report (Schimmack, 2012).

Psychologists have ignored the negative consequences of relying on underpowered studies to support their conclusions. The problem is that the reported p-values are no longer valid. A significant result that was obtained by conducting three studies no longer has a 5% chance to be a random event. By playing the sampling-error lottery three times, the probability of obtaining a significant result by chance alone is now 15%. By conducting three studies with two hypothesis tests, the probability of obtaining a significant result by chance alone is 30%. When researchers use questionable research practices, the probability of obtaining a significant result by chance can further increase. As a result, a significant result no longer provides strong statistical evidence that the result was not just a random event.

It would be easy to distinguish real effects from type-I errors (significant results when the null-hypothesis is true) by conducting replication studies. Even underpowered studies with 30% power will replicate in every third study. In contrast, when the null-hypothesis is true, type-I errors will replicate only in 1 out of 20 studies, when the criterion is set to 5%. This is what a 5% criterion means. There is only a 5% chance (1 out of 20) to get a significant result when the null-hypothesis is true. However, this self-correcting mechanism failed because psychologists considered failed replication studies as uninformative. The perverse logic was that failed replications are to be expected because studies have low power. After all, if a study has only 30% power, a non-significant result is more likely than a significant result. So, non-significant results in underpowered studies cannot be used to challenge a significant result in an underpowered study. By this perverse logic, even false hypothesis will only receive empirical support because only significant results will be reported, no matter whether an effect is present or not.

The perverse consequences of abusing statistical significance tests became apparent when Bem (2011) published 10 studies that appeared to demonstrate that people can anticipate random future events and that practicing for an exam after writing an exam can increase grades. These claims were so implausible that few researchers were willing to accept Bem’s claims despite his presentation of 9 significant results in 10 studies. Although the probability that this even occurred by chance alone is less than 1 in a billion, few researchers felt compelled to abandon the null-hypothesis that studying for an exam today can increase performance on yesterday’s exam.   In fact, most researchers knew all too well that these results could not be trusted because they were aware that published results are not an honest report of what happens in a lab. Thus, a much more plausible explanation for Bem’s incredible results was that he used questionable research practices to obtain significant results. Consistent with this hypothesis, closer inspection of Bem’s results shows statistical evidence that Bem used questionable research practices (Schimmack, 2012).

As the negative consequences of underpowered studies have become more apparent, interest in statistical power has increased. Computer programs make it easy to conduct power analysis for simple designs. However, so far power analysis has been limited to conventional statistical methods that use p-values and a criterion value to draw conclusions about the presence of an effect (Neyman-Pearson Significance Testing, NPST).

Some researchers have proposed Bayesian statistics as an alternative approach to hypothesis testing. As far as I know, these researchers have not provided tools for the planning of sample sizes. One reason is that Bayesian statistics can be used with optional stopping. That is, a study can be terminated early when a criterion value is reached. However, an optional stopping rule also needs a rule when data collection will be terminated in case the criterion value is not reached. It may sound appealing to be able to finish a study at any moment, but if this event is unlikely to occur in a reasonably sized sample, the study would produce an inconclusive result. Thus, even Bayesian statisticians may be interested in the effect of sample sizes on the ability to obtain a desired Bayes-Factor. Thus, I wrote some r-code to conduct power analysis for Bayes-Factors.

The code uses the Bayes-Factor package in r for the default Bayesian t-test (see also blog post on Replication-Index blog). The code is posted at the end of this blog. Here I present results for typical sample sizes in the between-subject design for effect sizes ranging from 0 (the null-hypothesis is true) to Cohen’s d = .5 (a moderate effect). Larger effect sizes are not reported because large effects are relatively easy to detect.

The first table shows the percentage of studies that meet a specified criterion value based on 10,000 simulations of a between-subject design. For Bayes-Factors the criterion values are 3 and 10. For p-values the criterion values are .05, .01, and .001. For Bayes-Factors, a higher number provides stronger support for a hypothesis. For p-values, lower values provide stronger support for a hypothesis. For p-values, percentages correspond to the power of a study. Bayesian statistics has no equivalent concept, but percentages can be used in the same way. If a researcher aims to provide empirical support for a hypothesis with a Bayes-Factor greater than 3 or 10, the table gives the probability of obtaining the desired outcome (success) as a function of the effect size and sample size.

d   n     N     3   10     .05 .01     .001
.5   20   40   17   06     31     11     02
.4   20   40   12   03     22     07     01
.3   20   40   07   02     14     04     00
.2   20   40   04   01     09     02     00
.1   20   40   02   00     06     01     00
.0   20   40   33   00     95     99   100

For an effect size of zero, the interpretation of results switches. Bayes-Factors of 1/3 or 1/10 are interpreted as evidence for the null-hypothesis. The table shows how often Bayes-Factors provide support for the null-hypothesis as a function of the effect size, which is zero, and sample size. For p-values, the percentage is 1 – p. That is, when the effect is zero, the p-value will correctly show a non-significant result with a probability of 1 – p and it will falsely reject the null-hypothesis with the specified type-I error.

Typically, researchers do not interpret non-significant results as evidence for the null-hypothesis. However, it is possible to interpret non-significant results in this way, but it is important to take the type-II error rate into account. Practically, it makes little difference whether a non-significant result is not interpreted or whether it is taken as evidence for the null-hypothesis with a high type-II error probability. To illustrate this consider a study with N = 40 (n = 20 per group) and an effect size of d = .2 (a small effect). As there is a small effect, the null-hypothesis is false. However, the power to detect this effect in a small sample is very low. With p = .05 as the criterion, power is only 9%. As a result, there is a 91% probability to end up with a non-significant result even though the null-hypothesis is false. This probability is only slightly lower than the probability to get a non-significant result when the null-hypothesis is true (95%). Even if the effect size were d = .5, a moderate effect, power is only 31% and the type-II error rate is 69%. With type-II error rates of this magnitude, it makes practically no difference whether a null-hypothesis is accepted with a warning that the type-II error rate is high or whether the non-significant result is simply not interpreted because it provides insufficient information about the presence or absence of small to moderate effects.

The main observation in Table 1 is that small samples provide insufficient information to distinguish between the null-hypothesis and small to moderate effects. Small studies with N = 40 are only meaningful to demonstrate the presence of moderate to large effects, but they have insufficient power to show effects and insufficient power to show the absence of effects. Even when the null-hypothesis is true, a Bayes-Factor of 3 is reached only 33% of the time. A Bayes-Factor of 10 is never reached because the sample size is too small to provide such strong evidence for the null-hypothesis when the null-hypothesis is true. Even more problematic is that a Bayes-Factor of 3 is reached only 17% of the time when a moderate effect is present. Thus, the most likely outcome in small samples is an inconclusive result unless a strong effect is present. This means that Bayes-Factors in these studies have the same problem as p-values. They can only provide evidence that an effect is present when a strong effect is present, but they cannot provide sufficient evidence for the null-hypothesis when the null-hypothesis is true.

d   n     N     3   10     .05 .01     .001
.5   50 100   49   29     68     43     16
.4   50 100   30   15     49     24     07
.3   50 100   34   18     56     32     12
.2   50 100   07   02     16     05     01
.1   50 100   03   01     08     02     00
.0   50 100   68   00     95     99   100

In Table 2 the sample size has been increased to N = 100 participants (n = 50 per cell). This is already a large sample size by past standards in social psychology. Moreover, in several articles Wagenmakers has implemented a stopping rule that terminates data collection at this point. The table shows that a sample size of N = 100 in a between-subject design has modest power to demonstrate even moderate effect sizes of d = .5 with a Bayes-Factor of 3 as a criterion (49%). In comparison, a traditional p-value of .05 would provide 68% power.

The main argument for using Bayesian statistics is that it can also provide evidence for the null-hypothesis. With a criterion value of BF = 3, the default test correctly favors the null-hypothesis 68% of the time (see last row of the table). However, the sample size is too small to produce Bayes-Factors greater than 10. In sum, the default-Bayesian t-test with N = 100 can be used to demonstrate the presence of a moderate to large effects and with a criterion value of 3 it can be used to provide evidence for the null-hypothesis when the null-hypothesis is true. However it cannot be used to demonstrate that provide evidence for small to moderate effects.

The Neyman-Pearson approach to significance testing would reveal this fact in terms of the type-I I error rates associated with non-significant results. Using the .05 criterion, a non-significant result would be interpreted as evidence for the null-hypothesis. This conclusion is correct in 95% of all tests when the null-hypothesis is actually true. This is higher than the 68% criterion for a Bayes-Factor of 3. However, the type-II error rates associated with this inference when the null-hypothesis is false are 32% for d = .5, 51% for d = .4, 44% for d = .3, 84% for d = .2, and 92% for d = .1. If we consider effect size of d = .2 as important enough to be detected (small effect size according to Cohen), the type-II error rate could be as high as 84%.

In sum, a sample size of N = 100 in a between-subject design is still insufficient to test for the presence of a moderate effect size (d = .5) with a reasonable chance to find it (80% power). Moreover, a non-significant result is unlikely to occur for moderate to large effect sizes, but the sample size is insufficient to discriminate accurately between the null-hypothesis and small to moderate effects. A Bayes-Factor greater than 3 in favor of the null-hypothesis is most likely to occur when the null-hypothesis is true, but it can also occur when a small effect is present (Simonsohn, 2015).

The next table increases the total sample size to 200 for a between-subject design. The pattern doesn’t change qualitatively. So the discussion will be brief and focus on the power of a study with 200 participants to provide evidence for small to moderate effects and to distinguish small to moderate effects from the null-hypothesis.

d   n     N     3   10     .05 .01     .001
.5 100 200   83   67     94     82     58
.4 100 200   60   41     80     59     31
.3 100 200   16   06     31     13     03
.2 100 200   13   06     29     12     03
.1 100 200   04   01     11     03     00
.0 100 200   80   00     95     95     95  

Using Cohen’s guideline of 80% success rate (power), a study with N = 200 participants has sufficient power to show a moderate effect of d = .5 with p = .05, p = .01, and Bayes-Factor = 3 as criterion values. For d = .4, only the criterion value of p = .05 has sufficient power. For all smaller effects, the sample size is still too small to have 80% power. A sample of N = 200 also provides 80% power to provide evidence for the null-hypothesis with a Bayes-Factor of 3. Power for a Bayes-Factor of 10 is still 0 because this value cannot be reached with N = 200. Finally, with N = 200, the type-II error rate for d = .5 is just shy of .05 (1 – .94 = .06). Thus, it is justified to conclude from a non-significant result with a 6% error rate that the true effect size cannot be moderate to large (d >= .5). However, type-II error rates for smaller effect sizes are too high to test the null-hypothesis against these effect sizes.

d   n     N     3   10     .05 .01     .001
.5 200 400   99   97   100     99     95
.4 200 400   92   82     98     92     75
.3 200 400   64   46     85     65     36
.2 200 400   27   14     52     28     10
.1 200 400   05   02     17     06     01
.0 200 400   87   00     95     99     95

The next sample size doubles the number of participants. The reason is that sampling error decreases in a log-function and large increases in sample sizes are needed to further decrease sampling error. A sample size of N = 200 yields a standard error of 2 / sqrt(200) = .14. (14/100 of a standard deviation). A sample size of N = 400 is needed to reduce this to .10 (2 / sqrt (400) = 2 / 20 = .10; 2/10 of a standard deviation).   This is the reason why it is so difficult to find small effects.

Even with N = 400, power is only sufficient to show effect sizes of .3 or greater with p = .05, or effect sizes of d = .4 with p = .01 or Bayes-Factor 3. Only d = .5 can be expected to meet the criterion p = .001 more than 80% of the time. Power for Bayes-Factors to show evidence for the null-hypothesis also hardly changed. It increased from 80% to 87% with Bayes-Factor = 3 as criterion. The chance to get a Bayes-Factor of 10 is still 0 because the sample size is too small to produce such extreme values. Using Neyman-Pearson’s approach with a 5% type-II error rate as criterion, it is possible to interpret non-significant results as evidence that the true effect size cannot be .4 or larger. With a 1% criterion it is possible to say that a moderate to large effect would produce a significant result 99% of the time and the null-hypothesis would produce a non-significant result 99% of the time.

Doubling the sample size to N = 800 reduces sampling error from SE = .1 to SE = .07.

d   n     N     3     10     .05   .01     .001
.5 400 800 100 100   100  100     100
.4 400 800 100   99   100  100       99
.3 400 800   94   86     99     95      82
.2 400 800   54   38     81     60      32
.1 400 800   09   04     17     06      01
.0 400 800   91   52     95     95      95

A sample size of N = 800 is sufficient to have 80% power to detect a small effect according to Cohen’s classification of effect sizes (d = .2) with p = .05 as criterion. Power to demonstrate a small effect with Bayes-Factor = 3 as criterion is only 54%. Power to demonstrate evidence for the null-hypothesis with Bayes-Factor = 3 as criterion increased only slightly from 87% to 91%, but a sample size of N = 100 is sufficient to produce Bayes-Factors greater than 10 in favor of the null-hypothesis 52% of the time. Thus, researchers who aim for this criterion value need to plan their studies with N = 800. Smaller samples cannot produce these values with the default Bayesian t-test. Following Neyman-Pearson, a non-significant result can be interpreted as evidence that the true effect cannot be larger than d = .3, with a type-II error rate of 1%.

Conclusion

A common argument in favor of Bayes-Factors has been that Bayes-Factors can be used to test the null-hypothesis, whereas p-values can only reject the null-hypothesis. There are two problems with this claim. First, it confuses Null-Significance-Testing (NHST) and Neyman-Pearson-Significance-Testing (NPST). NPST also allows researchers to accept the null-hypothesis. In fact, it makes it easier to accept the null-hypothesis because every non-significant result favors the null-hypothesis. Of course, this does not mean that all non-significant results show that the null-hypothesis is true. In NPST the error of falsely accepting the null-hypothesis depends on the amount of sampling error. The tables here make it possible to compare Bayes-Factors and NPST. No matter which statistical approach is being used, it is clear that meaningful evidence for the null-hypothesis requires rather large samples. The r-code below can be used to compute power for different criterion values, effect sizes, and sample sizes. Hopefully, this will help researchers to better plan sample sizes and to better understand Bayes-Factors that favor the null-hypothesis.

########################################################################
###                       R-Code for Power Analysis for Bayes-Factor and P-Values                ###
########################################################################

## setup
library(BayesFactor)         # Load BayesFactor package
rm(list = ls())                       # clear memory

## set parameters
nsim = 10000      #set number of simulations
es 1 favor effect)
BF10_crit = 3      #set criterion value for BF favoring effect (> 1 = favor null)
p_crit = .05          #set criterion value for two-tailed p-value (e.g., .05

## computations
Z <- matrix(rnorm(groups*n*nsim,mean=0,sd=1),nsim,groups*n)   # create observations
Z[,1:n] <- Z[,1:n] + es                                                                                                #add effect size
tt <- function(x) {                                                                                                       #compute t-statistic (t-test)
oes <- mean(x[1:n])                                                                                    #compute mean group 1
if (groups == 2) oes = oes – mean(x[(n+1):(2*n)])                                  #compute mean for 2 groups
oes <- oes / sd(x[1:n*groups])                                                                  #compute observed effect size
t <- abs(oes) / (groups / sqrt(n*groups))                                                 #compute t-value
}

t <- apply(Z,1,function(x) tt(x))                                                                                 #get t-values for all simulations
df <- t – t + n*groups-groups                                                                                    #get degrees of freedom
p2t <- (1 – pt(abs(t),df))*2                                                                                         #compute two-tailed p-value
getBF <- function(x) {                                                                                                 #function to get Bayes-Factor
t <- x[1]
df <- x[2]
bf <- exp(ttest.tstat(t,(df+2)/2,(df+2)/2,rscale=rsc)$bf)
}              # end of function to get Bayes-Factor

input = matrix(cbind(t,df),,2)                                                                  # combine t and df values
BF10 <- apply(input,1, function(x) getBF(x) )                                        # get BF10 for all simulations
powerBF10 = length(subset(BF10, BF10 > BF10_crit))/nsim*100        # % results support for effect
powerBF01 = length(subset(BF10, BF10 < 1/BF10))/nsim*100            # % results support for null
powerP = length(subset(p2t, p2t < .05))/nsim*100                                # % significant, p < p-criterion

##output of results
cat(
” Power to support effect with BF10 >”,BF10_crit,”: “,powerBF10,
“\n”,
“Power to support null with BF01 >”,BF01_crit,” : “,powerBF01,
“\n”,
“Power to show effect with p < “,p_crit,” : “,powerP,
“\n”)

Bayesian Statistics in Small Samples: Replacing Prejudice against the Null-Hypothesis with Prejudice in Favor of the Null-Hypothesis

Matzke, Nieuwenhuis, van Rijn, Slagter, van der Molen, and Wagenmakers (2015) published the results of a preregistered adversarial collaboration. This article has been considered a model of conflict resolution among scientists.

The study examined the effect of eye-movements on memory. Drs. Nieuwenhuis and Slagter assume that horizontal eye-movements improve memory. Drs. Matzke, van Rijn, and Wagenmakers did not believe that horizontal-eye movements improve memory. That is, they assumed the null-hypothesis to be true. Van der Molen acted as a referee to resolve conflict about procedural questions (e.g., should some participants be excluded from analysis?).

The study was a between-subject design with three conditions: horizontal eye movements, vertical eye movements, and no eye movement.

The researchers collected data from 81 participants and agreed to exclude 2 participants, leaving 79 participants for analysis. As a result there were 27 or 26 participants per condition.

The hypothesis that horizontal eye-movements improve performance can be tested in several ways.

An overall F-test can compare the means of the three groups against the hypothesis that they are all equal. This test has low power because nobody predicted differences between vertical eye-movements and no eye-movements.

A second alternative is to compare the horizontal condition against the combined alternative groups. This can be done with a simple t-test. Given the directed hypothesis, a one-tailed test can be used.

Power analysis with the free software program GPower shows that this design has 21% power to reject the null-hypothesis with a small effect size (d = .2). Power for a moderate effect size (d = .5) is 68% and power for a large effect size (d = .8) is 95%.

Thus, the decisive study that was designed to solve the dispute only has adequate power (95%) to test Drs. Matzke et al.’s hypothesis d = 0 against the alternative hypothesis that d = .8. For all effect sizes between 0 and .8, the study was biased in favor of the null-hypothesis.

What does an effect size of d = .8 mean? It means that memory performance is boosted by .8 standard deviations. For example, if students take a multiple-choice exam with an average of 66% correct answers and a standard deviation of 15%, they could boost their performance by 12% points (15 * 0.8 = 12) from an average of 66% (C) to 78% (B+) by moving their eyes horizontally while thinking about a question.

The article makes no mention of power-analysis and the implicit assumption that the effect size has to be large to avoid biasing the experiment in favor of the critiques.

Instead the authors used Bayesian statistics; a type of statistics that most empirical psychologists understand even less than standard statistics. Bayesian statistics somehow magically appears to be able to draw inferences from small samples. The problem is that Bayesian statistics requires researchers to specify a clear alternative to the null-hypothesis. If the alternative is d = .8, small samples can be sufficient to decide whether an observed effect size is more consistent with d = 0 or d = .8. However, with more realistic assumptions about effect sizes, small samples are unable to reveal whether an observed effect size is more consistent with the null-hypothesis or a small to moderate effect.

Actual Results

So what where the actual results?

Condition                                          Mean     SD         

Horizontal Eye-Movements          10.88     4.32

Vertical Eye-Movements               12.96     5.89

No Eye Movements                       15.29     6.38      

The results provide no evidence for a benefit of horizontal eye movements. In a comparison of the two a priori theories (d = 0 vs. d > 0), the Bayes-Factor strongly favored the null-hypothesis. However, this does not mean that Bayesian statistics has magical powers. The reason was that the empirical data actually showed a strong effect in the opposite direction, in that participants in the no-eye-movement condition had better performance than in the horizontal-eye-movement condition (d = -.81).   A Bayes Factor for a two-tailed hypothesis or the reverse hypothesis would not have favored the null-hypothesis.

Conclusion

In conclusion, a small study surprisingly showed a mean difference in the opposite prediction than previous studies had shown. This finding is noteworthy and shows that the effects of eye-movements on memory retrieval are poorly understood. As such, the results of this study are simply one more example of the replicability crisis in psychology.

However, it is unfortunate that this study is published as a model of conflict resolution, especially as the empirical results failed to resolve the conflict. A key aspect of a decisive study is to plan a study with adequate power to detect an effect.   As such, it is essential that proponents of a theory clearly specify the effect size of their predicted effect and that the decisive experiment matches type-I and type-II error. With the common 5% Type-I error this means that a decisive experiment must have 95% power (1 – type II error). Bayesian statistics does not provide a magical solution to the problem of too much sampling error in small samples.

Bayesian statisticians may ignore power analysis because it was developed in the context of null-hypothesis testing. However, Bayesian inferences are also influenced by sample size and studies with small samples will often produce inconclusive results. Thus, it is more important that psychologists change the way they collect data than to change the way they analyze their data. It is time to allocate more resources to fewer studies with less sampling error than to waste resources on many studies with large sampling error; or as Cohen said: Less is more.