Matzke, Nieuwenhuis, van Rijn, Slagter, van der Molen, and Wagenmakers (2015) published the results of a preregistered adversarial collaboration. This article has been considered a model of conflict resolution among scientists.

The study examined the effect of eye-movements on memory. Drs. Nieuwenhuis and Slagter assume that horizontal eye-movements improve memory. Drs. Matzke, van Rijn, and Wagenmakers did not believe that horizontal-eye movements improve memory. That is, they assumed the null-hypothesis to be true. Van der Molen acted as a referee to resolve conflict about procedural questions (e.g., should some participants be excluded from analysis?).

The study was a between-subject design with three conditions: horizontal eye movements, vertical eye movements, and no eye movement.

The researchers collected data from 81 participants and agreed to exclude 2 participants, leaving 79 participants for analysis. As a result there were 27 or 26 participants per condition.

The hypothesis that horizontal eye-movements improve performance can be tested in several ways.

An overall F-test can compare the means of the three groups against the hypothesis that they are all equal. This test has low power because nobody predicted differences between vertical eye-movements and no eye-movements.

A second alternative is to compare the horizontal condition against the combined alternative groups. This can be done with a simple t-test. Given the directed hypothesis, a one-tailed test can be used.

Power analysis with the free software program GPower shows that this design has 21% power to reject the null-hypothesis with a small effect size (d = .2). Power for a moderate effect size (d = .5) is 68% and power for a large effect size (d = .8) is 95%.

Thus, the decisive study that was designed to solve the dispute only has adequate power (95%) to test Drs. Matzke et al.’s hypothesis d = 0 against the alternative hypothesis that d = .8. For all effect sizes between 0 and .8, the study was biased in favor of the null-hypothesis.

What does an effect size of d = .8 mean? It means that memory performance is boosted by .8 standard deviations. For example, if students take a multiple-choice exam with an average of 66% correct answers and a standard deviation of 15%, they could boost their performance by 12% points (15 * 0.8 = 12) from an average of 66% (C) to 78% (B+) by moving their eyes horizontally while thinking about a question.

The article makes no mention of power-analysis and the implicit assumption that the effect size has to be large to avoid biasing the experiment in favor of the critiques.

Instead the authors used Bayesian statistics; a type of statistics that most empirical psychologists understand even less than standard statistics. Bayesian statistics somehow magically appears to be able to draw inferences from small samples. The problem is that Bayesian statistics requires researchers to specify a clear alternative to the null-hypothesis. If the alternative is d = .8, small samples can be sufficient to decide whether an observed effect size is more consistent with d = 0 or d = .8. However, with more realistic assumptions about effect sizes, small samples are unable to reveal whether an observed effect size is more consistent with the null-hypothesis or a small to moderate effect.

**Actual Results**

So what where the actual results?

Condition Mean SD

Horizontal Eye-Movements 10.88 4.32

Vertical Eye-Movements 12.96 5.89

No Eye Movements 15.29 6.38

The results provide no evidence for a benefit of horizontal eye movements. In a comparison of the two a priori theories (d = 0 vs. d > 0), the Bayes-Factor strongly favored the null-hypothesis. However, this does not mean that Bayesian statistics has magical powers. The reason was that the empirical data actually showed a strong effect in the opposite direction, in that participants in the no-eye-movement condition had better performance than in the horizontal-eye-movement condition (d = -.81). A Bayes Factor for a two-tailed hypothesis or the reverse hypothesis would not have favored the null-hypothesis.

**Conclusion**

In conclusion, a small study surprisingly showed a mean difference in the opposite prediction than previous studies had shown. This finding is noteworthy and shows that the effects of eye-movements on memory retrieval are poorly understood. As such, the results of this study are simply one more example of the replicability crisis in psychology.

However, it is unfortunate that this study is published as a model of conflict resolution, especially as the empirical results failed to resolve the conflict. A key aspect of a decisive study is to plan a study with adequate power to detect an effect. As such, it is essential that proponents of a theory clearly specify the effect size of their predicted effect and that the decisive experiment matches type-I and type-II error. With the common 5% Type-I error this means that a decisive experiment must have 95% power (1 – type II error). Bayesian statistics does not provide a magical solution to the problem of too much sampling error in small samples.

Bayesian statisticians may ignore power analysis because it was developed in the context of null-hypothesis testing. However, Bayesian inferences are also influenced by sample size and studies with small samples will often produce inconclusive results. Thus, it is more important that psychologists change the way they collect data than to change the way they analyze their data. It is time to allocate more resources to fewer studies with less sampling error than to waste resources on many studies with large sampling error; or as Cohen said: Less is more.

“However, this does not mean that Bayesian statistics has magical powers. The reason was that the empirical data actually showed a strong effect in the opposite direction, in that participants in the no-eye-movement condition had better performance than in the horizontal-eye-movement condition (d = -.81). A Bayes Factor for a two-tailed hypothesis or the reverse hypothesis would not have favored the null-hypothesis.”

Right, a two-tailed test (or theory positing negative effects) might not have favored the null so much. But the whole point of this study was to see if horizontal eye movement *improves* performance. It very well may make it worse, but these authors have not made that theoretical claim- only that it improves scores. So if we are comparing a theory that says, “horizontal eye movement improves scores” (which is what some people believe) against a theory saying, “it has no effect on scores” (which is what other people believe) then the data fit the latter more than the former. Which is a valid conclusion from the data.

Saying the null isn’t supported so much when we compare it with a different (post-hoc) theory isn’t exactly relevant because that’s not the theory people have claimed explains reality. Model comparison only makes sense if the models are theoretically relevant.

I think it is important that the observed means show a strong effect in the opposite direction.

This explains why the Bayes-Factor showed strong support for the null-hypothesis (d = 0) vs. the predicted alternative (d > 0) even in a small sample. If the means had turned out as predicted by the null-hypothesis camp, the Bayes-Factor would not have favored their hypothesis so strongly.

I think the strong difference in the opposite direction shows that something strange happened in this experiment that neither group expected. In this case, researchers should first figure out what is going on in their experiment rather than publish the results as evidence for or against a particular theory. But then, who has the time to conduct research that can actually settle scientific disputes. More research is needed, forever.

Your power calculations, I suspect, are meaningless given the fact that the stopping rule was “continue sampling until the BF reached a particular level.” You need to account for that in the power calculations, because had given number of participants failed ended in equivocal evidence, more participants would be run. The power is much higher than what you’re calculating.

Regarding Dr. R’s comment, “I think the strong difference in the opposite direction shows that something strange happened in this experiment that neither group expected”, sometimes unlikely things happen even when there’s no effect. You’ve got to be cautious over-interpreting the occasional strange t value. Tests should be driven by predictions; if they’re not you just end up chasing wiggles in the data.

Dear Dr. Morey,

thank you for your comment. I am not an expert in Bayesian statistics and I am curious how Bayesian statisticians deal with the rather vague criterion that the study will continue until the BF reaches a particular, presumably decisive, criterion (e.g., BF=10). What should researchers do when this criterion is not reached after N = 100, N = 200, N = 300, or N = 1,000? Does Bayesian statistic come with some warning (power analysis) what sample size may be needed to reach the criterion? My concern is merely that we may see more inconclusive studies because researchers assume that Bayesian statistics alone can solve their problems, when often the main problem is insufficient empirical evidence.

Hi Dr. R,

I would say that the criterion BF>X is not vague at all, as long as it is specified in advance which Bayes factor is meant. What could happen is that one runs out of time, money, or patience, and decides to stop on those grounds, but it will be clear that that is what happened, since the evidence will not have reached the criterion. If this happens, then the null and alternative can still be weighed and an effect size computed that represents the knowledge given the data at hand (taking into account the remaining plausibility of the null, and shrinking the estimate to 0). The critical thing is that the (Bayesian) inference doesn’t depend on the stopping rule.

Bayesian statistics can be used to “pre-experimentally”; the main difference between a power analysis and a Bayesian pre-experimental analysis will be that the Bayesian can use a whole prior distribution, while power analysis is stuck with a point. But there’s no reason one could not use, for instance, a simulation to give the expected distribution of the size of the Bayes factor and use that as a guide to picking sample sizes.

With regards to “insufficient empirical evidence”, we need a way to *assess* insufficient empirical evidence. Leaving aside for now other issues (whether the experiment was done well, whether the research conclusions follow from the experiment; all these impact strength of empirical evidence, but are not the topic here), Bayesian statistics’ very purpose is the quantification of the strength of empirical evidence. No one claims that Bayesian statistics solves all problems, but it does help quantify evidence (in a limited statistical sense). One cannot infer from a power calculation anything about the evidence in a data set.

Thank you for your response. I am still wondering what Bayesian statisticians would tell a post-doc or assistant professor who needs to publish about the probability that they will get a publishable result before they run out of time. I am sure it can be done and any valuable information about projected sample sizes would be helpful for researchers who are interested in using Bayesian statistics. For example, a researcher may predict an effect with a d-value of .3 to .5 and wants to test this prediction against the hypothesis that the d-value is -infinity to .1. Any values in the range between .1 and .3 would be deemed untestable because it would require too much resources. What is the probability of getting a BF of 10or more with N = 100, if the hypothesis about the true effect size is correct?

Reblogged this on Replication-Index and commented:

when Uri and Uli agree, it must be true. 🙂

concern about correct interpretation of Bayesian statistics