Baby Einstein: The Numbers Do Not Add Up

A small literature suggests that babies can add and subtract. Wynn (1992) showed 5-month olds a Mickey Mouse doll, covered this toy, and placed another doll behind the cover to imply addition (1 + 1 = 2). A second group of infants saw two Mickey Mouse dolls, that were covered and then one Mickey Mouse was removed (2 – 1 = 1). When the cover was removed, either 1 or 2 Mickeys were visible. Infants looked longer at the incongruent display, suggesting that they expected 2 Mickeys in the addition scenario and one Mickey in the subtraction scenario.

Both studies produced just significant results; Study 1, t(30) = 2.078, p = .046 (two-tailed), Study 2 , t(14) = 1.795, p = .047 (one-tailed). Post-2011, these just significant results raise a red flag about the replicability of these results.

This study produced a small literature that was meta-analyzed by Christodoulou, Lac, and Moore (2017). The headline finding was that a random-effects meta-analysis showed a significant effect, d = .34, “suggesting that the phenomenon Wynn originally reported is reliable.”

The problem with effect-size meta-analysis is that effect sizes are inflated when published results are selected for significance. Christodoulou et al. (2017) examined the presence of publication bias using a variety of statistical tests that produced inconsistent results. The Incredibility Index showed that there were just as many significant results (k = 12) as one would predict based on median observed power (k = 11). Trim-and-fill suggested some bias, but the corrected effect size estimate would still be significant, d = .24. However, PEESE showed significant evidence of publication bias, and no significant effect after correcting for bias.

Christodoulou et al. (2017) dismiss the results obtained with PEESE that would suggest the findings are not robust.

For instance, the PET-PEESE has been criticized on grounds that it severely penalizes samples with a small N (Cunningham & Baumeister, 2016), is inappropriate for syntheses involving a limited number of studies (Cunningham & Baumeister, 2016), is sometimes inferior in performance compared to estimation methods that do not correct for publication bias (Reed, Florax, & Poot, 2015), and is premised on acceptance of the assumption that large sample sizes confer unbiased effect size estimates (Inzlicht, Gervais, & Berkman, 2015). Each of the other four tests used have been criticized on various grounds as well (e.g., Cunningham & Baumeister, 2016)

These arguments are not very convincing. Studies with larger samples produce more robust results than studies with smaller samples. Thus, placing a greater emphasizes on larger samples is justified by the smaller sampling error in these studies. In fact, random effects meta-analysis gives too much weight to small samples. It is also noteworthy that Baumeister and Inzlicht are not unbiased statisticians. Their work has been criticized as unreliable using PEESE and their responses are at least partially motivated by defending their work.

I will demonstrate that the PEETSE results are credible and that the other methods failed to reveal publication bias because effect-size meta-analyses fail to reveal the selection bias in original articles. For example, Wynn’s (1992) seminal finding was only significant with a one-sided test. However, the meta-analysis used a two-sided p-value of .055, which was coded as a non-significant result. This is a coding mistake because the result was used to reject the null-hypothesis with a different alpha level. A follow-up study by McCrink and Wynn (2004) reported a significant interaction effect with opposite effects for addition and subtraction, p = .016. However, the meta-analysis coded addition and subtraction separately, which produced one significant, p = .01, and one non-significant result, p = .504. The coding by subgroups is common in meta-analysis to conduct moderator analyses. However, this practices mutes the selection bias, which makes it more difficult to detect selection bias. Thus, bias tests need to be applied to the focal tests that supported authors’ main conclusions.

I recoded all 12 articles that reported 14 independent tests of the hypothesis that babies can add and subtract. I found only two articles that reported a failure to reject the null-hypothesis. Wakeley,Rivera, and Langer’s (2000) article is a rare example of an article in a major journal that reported a series of failed replication studies before 2011. “Unlike Wynn, we found no systematic evidence of either imprecise or precise adding and subtracting in young infants” (p. 1525). Moore and Cocas (2006) published two studies. Study 2 reported a non-significant result with an effect in the opposite direction. They clearly stated that this result failed to replicate Wynn’s results. “This test failed to reveal a reliable difference between the two groups’ fixation preferences, t(87) = -1.31, p = .09” However, they continued to examine the data with an Analysis of Variance that produced a significant four-way interaction, F(1, 85) = 4.80, p = .031. If this result had been used as the focal test, there would be only 2 non-significant results. However, I coded the study as reporting a non-significant result. Thus, the success rate across 14 studies in 12 articles is 11/14 = 78.6%. Without Wakeley et al.’s exceptional report of replication failures, the success rate would have been 93%, which is the norm in psychology publications (Sterling, 1959; Sterling et al., 1995).

The mean observed power of the 14 studies was MOP = 57%. The binomial probabilty of obtaining 11 or more significant results in 14 studies with 57% power is p = .080. This shows significant bias with the typical alpha level of .10 for bias tests due to the low power of these tests in small samples.

I also developed a more powerful bias tests that corrects for the inflation in the estimate of observed mean power that is based on the replicability index (Schimmack, 2016). Simulation studies show that this method has higher power, while maintaining good type-I error rates. To correct for inflation, I subtract the difference between the success rate and observed mean power from the observed mean power (simulation studies show that the mean is superior to the median that was used in the 2016 manuscript). This yields a value of .57 – (.79 – .57) = .35. The binomial probability of obtaining 11 out of 14 significant results with just 35% power is p = .001. These results confirm the results obtained with PEESE that publication bias contributes to the evidence in favor of babies’ math abilities.

To examine the credibilty of the published literature, I submitted the 11 significant results to a z-curve analysis (Brunner & Schimmack, 2019). The z-curve analysis also confirms the presence of publication bias. Whereas the observed discovery rate is 79%, 95%CI = 57% to 100%, the expected discovery rate is only 6%, 95%CI = 5% to 31%. As the confidence intervals do not overlap, the difference is statistically significant. The expected replication rate is 15%. Thus, if the 11 studies could be replicated exactly only 2 rather than 11 are expected to be significant again. The 95%CI included a value of 5% which means that all studies could be false positives. This shows that the published studies do not provide empirical evidence to reject the null-hypothesis that babies cannot add or subtract.

Meta-analyses also have another drawback. They focus on results that are common across studies. However, subsequent studies are not mere replication studies. Several studies in this literature examined whether the effect is an artifact of the experimental procedure and showed that performance is altered by changing the experimental setup. These studies first replicate the original finding and then show that the effect can be attributed to other factors. Given the low power to replicate the effect, it is not clear how credible this evidence is. However, it does show that even if the effect were robust, it does not warrant the conclusion that infants can do math.


The problems with bias tests in standard meta-analysis are by no means unique to this article. It is well known that original articles publish nearly exclusively confirmatory evidence with success rates over 90%. However, meta-analyses often include a much larger number of non-significant results. This paradox is explained by the coding of original studies that produces non-significant results that were either not published or not the focus of an original article. This coding practices mutes the signal and makes it difficult to detect publication bias. This does not mean that the bias has disappeared. Thus, most published meta-analysis are useless because effect sizes are inflated to an unknown degree by selection for significance in the primary literature.

Francis's Audit of Multiple-Study Articles in Psychological Science in 2009-2012

Citation: Francis G., (2014). The frequency of excess success for articles
in Psychological Science. Psychon Bull Rev (2014) 21:1180–1187
DOI 10.3758/s13423-014-0601-x


The Open Science Collaboration article in Science has over 1,000 articles (OSC, 2015). It showed that attempting to replicate results published in 2008 in three journals, including Psychological Science, produced more failures than successes (37% success rate). It also showed that failures outnumbered successes 3:1 in social psychology. It did not show or explain why most social psychological studies failed to replicate.

Since 2015 numerous explanations have been offered for the discovery that most published results in social psychology cannot be replicated: decline effect (Schooler), regression to the mean (Fiedler), incompetent replicators (Gilbert), sabotaging replication studies (Strack), contextual sensitivity (vanBavel). Although these explanations are different, they share two common elements, (a) they are not supported by evidence, and (b) they are false.

A number of articles have proposed that the low replicability of results in social psychology are caused by questionable research practices (John et al., 2012). Accordingly, social psychologists often investigate small effects in between-subject experiments with small samples that have large sampling error. A low signal to noise ratio (effect size/sampling error) implies that these studies have a low probability of producing a significant result (i.e., low power and high type-II error probability). To boost power, researchers use a number of questionable research practices that inflate effect sizes. Thus, the published results provide the false impression that effect sizes are large and results are replicated, but actual replication attempts show that the effect sizes were inflated. The replicability projected suggested that effect sizes are inflated by 100% (OSC, 2015).

In an important article, Francis (2014) provided clear evidence for the widespread use of questionable research practices for articles published from 2009-2012 (pre crisis) in the journal Psychological Science. However, because this evidence does not fit the narrative that social psychology was a normal and honest science, this article is often omitted from review articles, like Nelson et al’s (2018) ‘Psychology’s Renaissance’ that claims social psychologists never omitted non-significant results from publications (cf. Schimmack, 2019). Omitting disconfirming evidence from literature reviews is just another sign of questionable research practices that priorities self-interest over truth. Given the influence that Annual Review articles hold, many readers maybe unfamiliar with Francis’s important article that shows why replication attempts of articles published in Psychological Science often fail.

Francis (2014) “The frequency of excess success for articles in Psychological Science”

Francis (2014) used a statistical test to examine whether researchers used questionable research practices (QRPs). The test relies on the observation that the success rate (percentage of significant results) should match the mean power of studies in the long run (Brunner & Schimmack, 2019; Ioannidis, J. P. A., & Trikalinos, T. A., 2007; Schimmack, 2012; Sterling et al., 1995). Statistical tests rely on the observed or post-hoc power as an estimate of true power. Thus, mean observed power is an estimate of the expected number of successes that can be compared to the actual success rate in an article.

It has been known for a long time that the actual success rate in psychology articles is surprisingly high (Sterling, 1995). The success rate for multiple-study articles is often 100%. That is, psychologists rarely report studies where they made a prediction and the study returns a non-significant results. Some social psychologists even explicitly stated that it is common practice not to report these ‘uninformative’ studies (cf. Schimmack, 2019).

A success rate of 100% implies that studies required 99.9999% power (power is never 100%) to produce this result. It is unlikely that many studies published in psychological science have the high signal-to-noise ratios to justify these success rates. Indeed, when Francis applied his bias detection method to 44 studies that had sufficient results to use it, he found that 82 % (36 out of 44) of these articles showed positive signs that questionable research practices were used with a 10% error rate. That is, his method could at most produce 5 significant results by chance alone, but he found 36 significant results, indicating the use of questionable research practices. Moreover, this does not mean that the remaining 8 articles did not use questionable research practices. With only four studies, the test has modest power to detect questionable research practices when the bias is relatively small. Thus, the main conclusion is that most if not all multiple-study articles published in Psychological Science used questionable research practices to inflate effect sizes. As these inflated effect sizes cannot be reproduced, the effect sizes in replication studies will be lower and the signal-to-noise ratio will be smaller, producing non-significant results. It was known that this could happen since 1959 (Sterling, 1959). However, the replicability project showed that it does happen (OSC, 2015) and Francis (2014) showed that excessive use of questionable research practices provides a plausible explanation for these replication failures. No review of the replication crisis is complete and honest, without mentioning this fact.

Limitations and Extension

One limitation of Francis’s approach and similar approaches like my incredibility Index (Schimmack, 2012) is that p-values are based on two pieces of information, the effect size and sampling error (signal/noise ratio). This means that these tests can provide evidence for the use of questionable research practices, when the number of studies is large, and the effect size is small. It is well-known that p-values are more informative when they are accompanied by information about effect sizes. That is, it is not only important to know that questionable research practices were used, but also how much these questionable practices inflated effect sizes. Knowledge about the amount of inflation would also make it possible to estimate the true power of studies and use it as a predictor of the success rate in actual replication studies. Jerry Brunner and I have been working on a statistical method that is able to to this, called z-curve, and we validated the method with simulation studies (Brunner & Schimmack, 2019).

I coded the 195 studies in the 44 articles analyzed by Francis and subjected the results to a z-curve analysis. The results are shocking and much worse than the results for the studies in the replicability project that produced an expected replication rate of 61%. In contrast, the expected replication rate for multiple-study articles in Psychological Science is only 16%. Moreover, given the fairly large number of studies, the 95% confidence interval around this estimate is relatively narrow and includes 5% (chance level) and a maximum of 25%.

There is also clear evidence that QRPs were used in many, if not all, articles. Visual inspection shows a steep drop at the level of significance, and the only results that are not significant with p < .05 are results that are marginally significant with p < .10. Thus, the observed discovery rate of 93% is an underestimation and the articles claimed an amazing success rate of 100%.

Correcting for bias, the expected discovery rate is only 6%, which is just shy of 5%, which would imply that all published results are false positives. The upper limit for the 95% confidence interval around this estimate is 14, which would imply that for every published significant result there are 6 studies with non-significant results if file-drawring were the only QRP that was used. Thus, we see not only that most article reported results that were obtained with QRPs, we also see that massive use of QRPs was needed because many studies had very low power to produce significant results without QRPs.


Social psychologists have used QRPs to produce impressive results that suggest all studies that tested a theory confirmed predictions. These results are not real. Like a magic show they give the impression that something amazing happened, when it is all smoke and mirrors. In reality, social psychologists never tested their theories because they simply failed to report results when the data did not support their predictions. This is not science. The 2010s have revealed that social psychological results in journals and text books cannot be trusted and that influential results cannot be replicated when the data are allowed to speak. Thus, for the most part, social psychology has not been an empirical science that used the scientific method to test and refine theories based on empirical evidence. The major discovery in the 2010s was to reveal this fact, and Francis’s analysis provided valuable evidence to reveal this fact. However, most social psychologists preferred to ignore this evidence. As Popper pointed out, this makes them truly ignorant, which he defined as “the unwillingness to acquire knowledge.” Unfortunately, even social psychologists who are trying to improve it wilfully ignore Francis’s evidence that makes replication failures predictable and undermines the value of actual replication studies. Given the extent of QRPs, a more rational approach would be to dismiss all evidence that was published before 2012 and to invest resources in new research with open science practices. Actual replication failures were needed to confirm predictions made by bias tests that old studies cannot be trusted. The next decade should focus on using open science practices to produce robust and replicable findings that can provide the foundation for theories.

When DataColada kissed Fiske's ass to publish in Annual Review of Psychology

One of the worst articles about the decade of replication failures is the “Psychology’s Renaissance” article by the datacolada team (Leif Nelson, Joseph Simmons, & Uri Simonsohn).

This is not your typical Annual Review article that aims to give a review over developments in the field. it is an opinion piece filled with bold claims that lack empirical evidence.

The worst claim is that p-hacking is so powerful that pretty much every study can be made to work.

Experiments that work are sent to a journal, whereas experiments that fail are sent to the file drawer (Rosenthal 1979). We believe that this “file-drawer explanation” is incorrect. Most failed studies are not missing. They are published in our journals, masquerading as successes.

We can all see that not publishing failed studies is a bit problematic. Even Bem’s famous manual for p-hackers warned that it is unethical to hide contradictory evidence. “The integrity of the scientific enterprise requires the reporting of disconfirming results” (Bem). Thus, the idea that researchers are sitting on a pile of failed studies that they failed to disclose makes psychologists look bad and we can’t have that in Fiske’s Annual Review of Psychology journal. Thus, psychologists must have been doing something that is not dishonest and can be sold as normal science.

“P-hacking is the only honest and practical way to consistently get underpowered studies to be statistically significant. Researchers did not learn from experience to increase their sample sizes precisely because their underpowered studies were not failing.” (p. 515).

This is utter nonsense. First, researchers have file-drawers of studies that did not work. Just ask them and they may tell you that they do.

“We did run multiple studies, some of which did not work, and some of which worked better than others. You may think that not reporting the less successful studies is wrong, but that is how the field works.” (Roy Baumeister, personal email communication)

Leading social psychologists, Gilbert and Wilson provide an even more detailed account of their research practices that produce many non-significant results that are not reported (a.k.a. a file drawer), which has been preserved thanks to Greg Francis.

First, it’s important to be clear about what “publication bias” means. It doesn’t mean that anyone did anything wrong, improper, misleading, unethical, inappropriate, or illegal. Rather it refers to the well known fact that scientists in every field publish studies whose results tell them something interesting about the world, and don’t publish studies whose results tell them nothing. Francis uses sophisticated statistical tools to discover what everyone already knew—and what he could easily have discovered simply by asking us. Yes, of course we ran some studies on “consuming experience” that failed to show interesting effects and are not reported in our JESP paper. Let us be clear: We did not run the same study over and over again until it yielded significant results and then report only the study that “worked.” Doing so would be clearly unethical. Instead, like most researchers who are developing new methods, we did some preliminary studies that used different stimuli and different procedures and that showed no interesting effects. Why didn’t these studies show interesting effects? We’ll never know. Failed studies are often (though not always) inconclusive, which is why they are often (but not always) unpublishable. So yes, we had to mess around for a while to establish a paradigm that was sensitive and powerful enough to observe the effects that we had hypothesized. In one study we might have used foods that didn’t differ sufficiently in quality, in another we might have made the metronome tick too fast for people to chew along. Exactly how good a potato chip should be and exactly how fast a person can chew it are the kinds of mundane things that scientists have to figure out in preliminary testing, and they are the kinds of mundane things that scientists do not normally report in journals (but that they informally share with other scientists who work on similar phenomenon). Looking back at our old data files, it appears that in some cases we went hunting for potentially interesting mediators of our effect (i.e., variables that might make it larger or smaller) and although we replicated the effect, we didn’t succeed in making it larger or smaller. We don’t know why, which is why we don’t describe these blind alleys in our paper. All of this is the hum-drum ordinary stuff of day-to-day science.

Aside from this anecdotal evidence, the datacolada crew actually had access to empirical evidence in an article that they cite, but maybe never read. An important article in the 2010s reported a survey of research practices (John, Loewenstein, & Prelec, 2012). The survey asked about several questionable research practices, including not reporting entire studies that failed to support the main hypothesis.

Not reporting studies that “did not work” was the third most frequently used QRP. Unfortunately, this result contradicts datacolada’s claim that there are no studies in file-drawers and so they ignore this inconvenient empirical fact to tell their fairy tail of honest p-hackers that didn’t know better until 2011 when they published their famous “False Positive Psychology” article.

This is a cute story that isn’t supported by evidence, but that has never stopped psychologists from writing articles that advance their own career. The beauty of review articles is that you don’t even have to phack data. You just pick and choose citations or make claims without evidence. As long as the editor (Fiske) likes what you have to say, it will be published. Welcome to psychology’s renaissance; same bullshit as always.

Statistics Wars: Don't change alpha. Change the null-hypothesis!

The statistics wars go back all the way to Fisher, Pearson, and Neyman-Pearson(Jr), and there is no end in sight. I have no illusion that I will be able to end these debates, but at least I can offer a fresh perspective. Lately, statisticians and empirical researchers like me who dabble in statistics have been debating whether p-values should be banned and if they are not banned outright whether they should be compared to a criterion value of .05 or .005 or be chosen on an individual basis. Others have advocated the use of Bayes-Factors.

However, most of these proposals have focused on the traditional approach to test the null-hypothesis that the effect size is zero. Cohen (1994) called this the nil-hypothesis to emphasize that this is only one of many ways to specify the hypothesis that is to be rejected in order to provide evidence for a hypothesis.

For example, a nil-hypothesis is that the difference in the average height of men and women is exactly zero). Many statisticians have pointed out that a precise null-hypothesis is often wrong a priori and that little information is provided by rejecting it. The only way to make nil-hypothesis testing meaningful is to think about the nil-hypothesis as a boundary value that distinguishes two opposing hypothesis. One hypothesis is that men are taller than women and the other is that women are taller than men. When data allow rejecting the nil-hypothesis, the direction of the mean difference in the sample makes it possible to reject one of the two directional hypotheses. That is, if the sample mean height of men is higher than the sample mean height of women, the hypothesis that women are taller than men can be rejected.

However, the use of the nil-hypothesis as a boundary value does not solve another problem of nil-hypothesis testing. Namely, specifying the null-hypothesis as a point value makes it impossible to find evidence for it. That is, we could never show that men and women have the same height or the same intelligence or the same life-satisfaction. The reason is that the population difference will always be different from zero, even if this difference is too small to be practically meaningful. A related problem is that rejecting the nil-hypothesis provides no information about effect sizes. A significant result can be obtained with a large effect size and with a small effect size.

In conclusion, nil-hypothesis testing has a number of problems, and many criticism of null-hypothesis testing are really criticism of nil-hypothesis testing. A simple solution to the problem of nil-hypothesis testing is to change the null-hypothesis by specifying a minimal effect size that makes a finding theoretically or practically useful. Although this effect size can vary from research question to research question, Cohen’s criteria for standardized effect sizes can give some guidance about reasonable values for a minimal effect size. Using the example of mean differences, Cohen considered an effect size of d = .2 small, but meaningful. So, it makes sense to set a criterion for a minimum effect size somewhere between 0 and .2, and d = .1 seems a reasonable value.

We can even apply this criterion retrospectively to published studies with some interesting implications for the interpretation of published results. Shifting the null-hypothesis from d = 0 to d < abs(.1), we are essentially raising the criterion value that a test statistic has to meet in order to be significant. Let me illustrate this first with a simple one-sample t-test with N = 100.

Conveniently, the sampling error for N = 100 is 1/sqrt(100) = .1. To achieve significance with alpha = .05 (two-tailed) and H0:d = 0, the test statistic has to be greater than t.crit = 1.98. However, if we change H0 to d > abs(.1), the t-distribution is now centered at the t-value that is expected for an effect size of d = .1. The criterion value to get significance is now t.crit = 3.01. Thus, some published results that were able to reject the nil-hypothesis would be non-significant when the null-hypothesis specifies a range of values between d = -.1 to .1.

If the null-hypothesis is specified in terms of standardized effect sizes, the critical values vary as a function of sample size. For example, with N = 10 the critical t-value is 2.67, with N = 100 it is 3.01, and with N = 1,000 it is 5.14. An alternative approach is to specify H0 in terms of a fixed test statistic which implies different effect sizes for the boundary value. For example, with t = 2.5, the effect sizes would be d = .06 with N = 10, d = .05 with N = 100, and d = .02 with N = 1000. This makes sense because researchers should use larger samples to test weaker effects. The example also shows that a t-value of 2.5 specifies a very narrow range of values around zero. However, the example was based on one-sample t-tests. For the typical comparison of two groups, a criterion value of 2.5 corresponds to an effect size of d = .1 with N = 100. So, while t = 2.5 is arbitrary, it is a meaningful value to test for statistical significance. With N = 100, t(98) = 2.5 corresponds to an alpha criterion of .014, which is a bit more stringent than .05, but not as strict as a criterion value of .005. With N = 100, alpha = .005 corresponds to a criterion value of t.crit = 2.87, which implies a boundary value of d = .17.

In conclusion, statistical significance depends on the specification of the null-hypothesis. While it is common to specify the null-hypothesis as an effect size of zero, this is neither necessary, nor ideal. An alternative approach is to (re)specify the null-hypothesis in terms of a minimum effect size that makes a finding theoretically interesting or practically important. If the population effect size is below this value, the results could also be used to show that a hypothesis is false. Examination of various effect sizes shows that criterion values in the range between 2 and 3 provide can be used to define reasonable boundary values that vary around a value of d = .1

The problem with t-distributions is that they differ as a function of the degrees of freedom. To create a common metric it is possible to convert t-values into p-values and then to convert the p-values into z-scores. A z-score of 2.5 corresponds to a p-value of .01 (exact .0124) and an effect size of d = .13 with N = 100 in a between-subject design. This seems to be a reasonable criterion value to evaluate statistical significance when the null-hypothesis is defined as a range of smallish values around zero and alpha is .05.

Shifting the significance criterion in this way can dramatically change the evaluation of published results, especially results that are just significant, p < .05 & p > .01. There have been concerns that many of these results have been obtained with questionable research practices that were used to reject the nil-hypothesis. However, these results would not be strong enough to reject the modified hypothesis that the population effect size exceeds a minimum value of theoretical or practical significance. Thus, no debates about the use of questionable research practices are needed. There is also no need to reduce the type-I error rate at the expense of increasing the type-II error rate. It can be simply noted that the evidence is insufficient to reject the hypothesis that the effect size is greater than zero but too small to be important. This would shift any debates towards discussion about effect sizes and proponents of theories would have to make clear which effect sizes they consider to be theoretically important. I believe that this would be more productive than quibbling over alpha levels.

To demonstrate the implications of redefining the null-hypothesis, I use the results of the replicability project (Open Science Collaboration, 2015). The first z-curve shows the traditional analysis for the nil-hypothesis and alpha = .05, which has z = 1.96 as the criterion value for statistical significance (red vertical line).

Figure 1 shows that 86 out of 90 studies reported a test-statistic that exceeded the criterion value of 1.96 for H0:d = 0, alpha = .05 (two-tailed). The other four studies met the criterion for marginal significance (alpha = .10, two-tailed or .05 one-tailed). The figure also shows that the distribution of observed z-scores is not consistent with sampling error. The steep drop at z = 1.96 is inconsistent with random sampling error. A comparison of the observed discovery rate (86/90, 96%) and the expected discovery rate 43% shows evidence that the published results are selected from a larger set of studies/tests with non-significant results. Even the upper limit of the confidence interval around this estimate (71%) is well below the observed discovery rate, showing evidence of publication bias. Z-curve estimates that only 60% of the published results would reproduce a significant result in an actual replication attempt. The actual success rate for these studies was 39%.

Results look different when the null-hypothesis is changed to correspond to a range of effect sizes around zero that correspond to a criterion value of z = 2.5. Along with shifting the significance criterion, z-curve is also only fitted to studies that produced z-scores greater than 2.5. As questionable research practices have a particularly strong effect on the distribution of just significant results, the new estimates are less influenced by these practices.

Figure 2 shows the results. Most important, the observed discovery rate dropped from 96% to 61%, indicating that many of the original results provided just enough evidence to reject the nil-hypothesis, but not enough evidence to rule out even small effect sizes. The observed discovery rate is also more in line with the expected discovery rate. Thus, some of the missing non-significant results may have been published as just significant results. This is also implied by the greater frequency of results with z-scores between 2 and 2.5 than the model predicts (grey curve). However, the expected replication rate of 63% is still much higher than the actual replication rate with a criterion value of 2.5 (33%). Thus, other factors may contribute to the low success rate in the actual replication studies of the replicability project.


In conclusion, statisticians have been arguing about p-values, significance levels, and Bayes-Factors. Proponents of Bayes-Factors have argued that their approach is supreme because Bayes-Factors can provide evidence for the null-hypothesis. I argue that this is wrong because it is theoretically impossible to demonstrate that a population effect size is exactly zero or any other specific value. A better solution is to specify the null-hypothesis as a range of values that are too small to be meaningful. This makes it theoretically possible to demonstrate that a population effect size is above or below the boundary value. This approach can also be applied retrospectively to published studies. I illustrate this by defining the null-hypothesis as the region of effect sizes that is defined by the effect size that corresponds to a z-score of 2.5. While a z-score of 2.5 corresponds to p = .01 (two-tailed) for the nil-hypothesis, I use this criterion value to maintain an error rate of 5% and to change the null-hypothesis to a range of values around zero that becomes smaller as sample sizes increase.

As p-hacking is often used to just reject the nil-hypothesis, changing the null-hypothesis to a range of values around zero makes many ‘significant’ results non-significant. That is, the evidence is too weak to exclude even trivial effect sizes. This does not mean that the hypothesis is wrong or that original authors did p-hack their data. However, it does mean that they can no longer point to their original results as empirical evidence. Rather they have to conduct new studies to demonstrate with larger samples that they can reject the new null-hypothesis that the predicted effect meets some minimal standard of practical or theoretical significance. With a clear criterion value for significance, authors also risk to obtain evidence that positively contradicts their predictions. Thus, the biggest improvement that arises form rethinking null-hypothesis testing is that authors have to specify effect sizes a priori and that that studies can provide evidence for and against a zero. Thus, changing the nil-hypothesis to a null-hypothesis with a non-null value makes it possible to provide evidence for or against a theory. In contrast, computing Bayes-Factors in favor of the nil-hypothesis fails to achieve this goal because the nil-hypothesis is always wrong, the real question is only how wrong.

Bayes-Factors in Favor of the Nil-Hypothesis are Meaningless

Zoltan Dienes just published an article in the journal that is supposed to save psychological science; Advances in Methods and Practices in Psychological Science. It is a tutorial about Bayes-Factors, which are advocated by Dienes and others as a solution to alleged problems with null-hypothesis significance testing.

The advantage of Bayes-Factors is supposed to be its ability to provide evidence for the null-hypothesis, while NHST is supposed to be one-sided and can only reject the null-hypothesis. The claim is that led to the problem that authors only published articles with p-values less than .05.

“Significance testing is a tool that is commonly used for this purpose; however, nonsignificance is not itself evidence that something does not exist. On the other hand, a Bayes factor can provide a measure of evidence for a model of something existing versus a model of it not existing”

The problem with this attack on NHST is that it is false. The main reason why NHST is unable to provide evidence for the non-existence of an effect is that it is logically impossible to show with empirical data that something does not exist or that the difference between two populations is exactly zero. For this reason, it has been pointed out again and again that it is silly to test the nil-hypothesis that there is no effect or that a mean difference or correlation is exactly zero.

This does not mean that it is impossible to provide evidence for the absence of an effect. The solution is simply to specify a range of values that are sufficiently small to consider these differences meaningful. Once the null-hypothesis is specified as a region of values, it becomes empirically testable with NHST or with Bayesian method. However, neither NHST nor Bayesian methods can provide evidence for a point hypothesis, and the idea that Bayes-Factors can be used to do so is an illusion.

The real problem for demonstrations of the absence of an effect is that small samples with between-subject designs produce large regions of plausible values because small samples have large sampling errors. As a result, the mean differences or correlations move around considerably and it is difficult to say something about the effect size in the population. As a result, the population effect size may be within a region around zero (H0) or outside this region (H1).

Let’s illustrate this with Dienes’ first example. “Theory A claims that autistic subjects will perform worse on a novel task than control subjects will. Theory B claims that the two groups will perform the same.” A researcher tests these two “theories” in a study with 30 participants in each group.

The statistical results that serve as the input for NHST or Bayesian statistics are the effect size and sampling error, and the degrees of freedom.

The autistic group had a score of 8 percent with sampling error of 6 percentage points. The 95%CI ranges from -4 to 20.

The control group has a score of 10 with a sampling error of 5 percentage points. The 95%CI ranges from 0 to 20.

Evidently, the confidence intervals overlap, but they also allow for large differences between the two populations from which these small samples were recruited.

A comparison of the two groups, yields a standardized effect size of d = .05, se = .2, t = .05/.20 = 0.25. The 95%CI for the standardized mean difference between the two groups ranges from d = -.35 to .45, and includes values for a small negative (d = -.2) or a small positive effect (d = .2).

Nevertheless, the default prior that is advocated by Wagenmakers and Rouder yields a Bayes-Factor of 0.27, which is below the aribtrary and low criterion of 1/3 that is used to claim that the data favor the model that claims there is absolutely no performance difference. It is hard to reconcile this claim with the 95%CI that allows for values as large as d = .4. However, to maintain the illusion that Bayes-Factors can miraculously provide evidence for the nil-hypothesis Bayesian propaganda claims that confidence intervals are misleading. Even if we do not trust confidence intervals, we can ask how a study with four times as much sampling error (se = .2) than the effect size (d = .05) can assure us that the true population effect size is 0? It can not.

A standard NHST analysis produces an unimpressive p-value of .80. Everybody knows that this p-value cannot be used to claim that there is no effect, but few people know why this p-value is uninformative. First, it is uninformative because it used d = 0 as the null-hypothesis. We can never prove that this hypothesis is false. However, we could set d = .2 as the lowest effect size that we consider a meaningful difference. Thus, we can compute the t-value for a one-sided test whether the observed value of d = .05 is significantly below d = .2. This is standard NHST. We may also recognize that the sample size is rather small, and adjust our alpha criterion accordingly and allow for a 20% chance of falsely rejecting the null-hypothesis that the effect size is d = .2 or larger. As we are only expecting worse performance, this is a one-sided test.

pt(.05/.20,28,.20/.20) gives us a p-value of .226. Still not good enough to reject the null-hypothesis that the true performance difference in the population is less than d = .2. The problem is that the study with 30 participants in a between-subject design simply has too much sampling error to draw inferences about the population.

Thus, there are three reasons why psychologists rarely provide evidence for the absence of an effect. First, they always specify the null-hypothesis as a point value. This makes it impossible to provide evidence for the null-hypothesis. Second, the sampling error is typically to large to draw firm conclusions about the absence of an effect. What is the solution to improve psychological science? Theories need to be specified with some minimal effect size. For example, tests of ego-depletion, facial feedback, or terror management (to name just a few) need to make explicit predictions about effect sizes. If even small effects are considered theoretically meaningful, studies that aim to demonstrate these effects need to be powered accordingly. For example, to test an effect of d = .2 with an 80% chance of a successful outcome, if the theory is right, requires N = 788 participants. If this study were to produce a non-significant result, one would also be justified to infer that the population effect size is trivial (d < .20) with an error probability of 20%. So, true tests of theories require specification of a boundary effect size that distinguishes meaningful effects from negligible ones. And theorists who claim that their theory is meaningful even if effect sizes are small (e.g., Greenwald’s predictive validity of IAT scores) have to pay the price and conduct studies that can detect these effects.

In conclusion, how do we advance psychological science? With better data. Statisticians are getting paid for publishing statistics articles. They have been unhelpful in advancing statistics for the past one-hundred years in their in-fighting about finding the right statistical tool for inconclusive data (between-subject N = 30). Let them keep fighting, but let’s ignore them. We will only make progress by reducing sampling error so that we can see signals or the absence of signals clearly. And the only statistician you need to read is Jacob Cohen. The real enemy is not NHST or p-values, but sampling error.