# Statistics Wars: Don’t change alpha. Change the null-hypothesis!

The statistics wars go back all the way to Fisher, Pearson, and Neyman-Pearson(Jr), and there is no end in sight. I have no illusion that I will be able to end these debates, but at least I can offer a fresh perspective. Lately, statisticians and empirical researchers like me who dabble in statistics have been debating whether p-values should be banned and if they are not banned outright whether they should be compared to a criterion value of .05 or .005 or be chosen on an individual basis. Others have advocated the use of Bayes-Factors.

However, most of these proposals have focused on the traditional approach to test the null-hypothesis that the effect size is zero. Cohen (1994) called this the nil-hypothesis to emphasize that this is only one of many ways to specify the hypothesis that is to be rejected in order to provide evidence for a hypothesis.

For example, a nil-hypothesis is that the difference in the average height of men and women is exactly zero). Many statisticians have pointed out that a precise null-hypothesis is often wrong a priori and that little information is provided by rejecting it. The only way to make nil-hypothesis testing meaningful is to think about the nil-hypothesis as a boundary value that distinguishes two opposing hypothesis. One hypothesis is that men are taller than women and the other is that women are taller than men. When data allow rejecting the nil-hypothesis, the direction of the mean difference in the sample makes it possible to reject one of the two directional hypotheses. That is, if the sample mean height of men is higher than the sample mean height of women, the hypothesis that women are taller than men can be rejected.

However, the use of the nil-hypothesis as a boundary value does not solve another problem of nil-hypothesis testing. Namely, specifying the null-hypothesis as a point value makes it impossible to find evidence for it. That is, we could never show that men and women have the same height or the same intelligence or the same life-satisfaction. The reason is that the population difference will always be different from zero, even if this difference is too small to be practically meaningful. A related problem is that rejecting the nil-hypothesis provides no information about effect sizes. A significant result can be obtained with a large effect size and with a small effect size.

In conclusion, nil-hypothesis testing has a number of problems, and many criticism of null-hypothesis testing are really criticism of nil-hypothesis testing. A simple solution to the problem of nil-hypothesis testing is to change the null-hypothesis by specifying a minimal effect size that makes a finding theoretically or practically useful. Although this effect size can vary from research question to research question, Cohen’s criteria for standardized effect sizes can give some guidance about reasonable values for a minimal effect size. Using the example of mean differences, Cohen considered an effect size of d = .2 small, but meaningful. So, it makes sense to set a criterion for a minimum effect size somewhere between 0 and .2, and d = .1 seems a reasonable value.

We can even apply this criterion retrospectively to published studies with some interesting implications for the interpretation of published results. Shifting the null-hypothesis from d = 0 to d < abs(.1), we are essentially raising the criterion value that a test statistic has to meet in order to be significant. Let me illustrate this first with a simple one-sample t-test with N = 100.

Conveniently, the sampling error for N = 100 is 1/sqrt(100) = .1. To achieve significance with alpha = .05 (two-tailed) and H0:d = 0, the test statistic has to be greater than t.crit = 1.98. However, if we change H0 to d > abs(.1), the t-distribution is now centered at the t-value that is expected for an effect size of d = .1. The criterion value to get significance is now t.crit = 3.01. Thus, some published results that were able to reject the nil-hypothesis would be non-significant when the null-hypothesis specifies a range of values between d = -.1 to .1.

If the null-hypothesis is specified in terms of standardized effect sizes, the critical values vary as a function of sample size. For example, with N = 10 the critical t-value is 2.67, with N = 100 it is 3.01, and with N = 1,000 it is 5.14. An alternative approach is to specify H0 in terms of a fixed test statistic which implies different effect sizes for the boundary value. For example, with t = 2.5, the effect sizes would be d = .06 with N = 10, d = .05 with N = 100, and d = .02 with N = 1000. This makes sense because researchers should use larger samples to test weaker effects. The example also shows that a t-value of 2.5 specifies a very narrow range of values around zero. However, the example was based on one-sample t-tests. For the typical comparison of two groups, a criterion value of 2.5 corresponds to an effect size of d = .1 with N = 100. So, while t = 2.5 is arbitrary, it is a meaningful value to test for statistical significance. With N = 100, t(98) = 2.5 corresponds to an alpha criterion of .014, which is a bit more stringent than .05, but not as strict as a criterion value of .005. With N = 100, alpha = .005 corresponds to a criterion value of t.crit = 2.87, which implies a boundary value of d = .17.

In conclusion, statistical significance depends on the specification of the null-hypothesis. While it is common to specify the null-hypothesis as an effect size of zero, this is neither necessary, nor ideal. An alternative approach is to (re)specify the null-hypothesis in terms of a minimum effect size that makes a finding theoretically interesting or practically important. If the population effect size is below this value, the results could also be used to show that a hypothesis is false. Examination of various effect sizes shows that criterion values in the range between 2 and 3 provide can be used to define reasonable boundary values that vary around a value of d = .1

The problem with t-distributions is that they differ as a function of the degrees of freedom. To create a common metric it is possible to convert t-values into p-values and then to convert the p-values into z-scores. A z-score of 2.5 corresponds to a p-value of .01 (exact .0124) and an effect size of d = .13 with N = 100 in a between-subject design. This seems to be a reasonable criterion value to evaluate statistical significance when the null-hypothesis is defined as a range of smallish values around zero and alpha is .05.

Shifting the significance criterion in this way can dramatically change the evaluation of published results, especially results that are just significant, p < .05 & p > .01. There have been concerns that many of these results have been obtained with questionable research practices that were used to reject the nil-hypothesis. However, these results would not be strong enough to reject the modified hypothesis that the population effect size exceeds a minimum value of theoretical or practical significance. Thus, no debates about the use of questionable research practices are needed. There is also no need to reduce the type-I error rate at the expense of increasing the type-II error rate. It can be simply noted that the evidence is insufficient to reject the hypothesis that the effect size is greater than zero but too small to be important. This would shift any debates towards discussion about effect sizes and proponents of theories would have to make clear which effect sizes they consider to be theoretically important. I believe that this would be more productive than quibbling over alpha levels.

To demonstrate the implications of redefining the null-hypothesis, I use the results of the replicability project (Open Science Collaboration, 2015). The first z-curve shows the traditional analysis for the nil-hypothesis and alpha = .05, which has z = 1.96 as the criterion value for statistical significance (red vertical line).

Figure 1 shows that 86 out of 90 studies reported a test-statistic that exceeded the criterion value of 1.96 for H0:d = 0, alpha = .05 (two-tailed). The other four studies met the criterion for marginal significance (alpha = .10, two-tailed or .05 one-tailed). The figure also shows that the distribution of observed z-scores is not consistent with sampling error. The steep drop at z = 1.96 is inconsistent with random sampling error. A comparison of the observed discovery rate (86/90, 96%) and the expected discovery rate 43% shows evidence that the published results are selected from a larger set of studies/tests with non-significant results. Even the upper limit of the confidence interval around this estimate (71%) is well below the observed discovery rate, showing evidence of publication bias. Z-curve estimates that only 60% of the published results would reproduce a significant result in an actual replication attempt. The actual success rate for these studies was 39%.

Results look different when the null-hypothesis is changed to correspond to a range of effect sizes around zero that correspond to a criterion value of z = 2.5. Along with shifting the significance criterion, z-curve is also only fitted to studies that produced z-scores greater than 2.5. As questionable research practices have a particularly strong effect on the distribution of just significant results, the new estimates are less influenced by these practices.

Figure 2 shows the results. Most important, the observed discovery rate dropped from 96% to 61%, indicating that many of the original results provided just enough evidence to reject the nil-hypothesis, but not enough evidence to rule out even small effect sizes. The observed discovery rate is also more in line with the expected discovery rate. Thus, some of the missing non-significant results may have been published as just significant results. This is also implied by the greater frequency of results with z-scores between 2 and 2.5 than the model predicts (grey curve). However, the expected replication rate of 63% is still much higher than the actual replication rate with a criterion value of 2.5 (33%). Thus, other factors may contribute to the low success rate in the actual replication studies of the replicability project.

Conclusion

In conclusion, statisticians have been arguing about p-values, significance levels, and Bayes-Factors. Proponents of Bayes-Factors have argued that their approach is supreme because Bayes-Factors can provide evidence for the null-hypothesis. I argue that this is wrong because it is theoretically impossible to demonstrate that a population effect size is exactly zero or any other specific value. A better solution is to specify the null-hypothesis as a range of values that are too small to be meaningful. This makes it theoretically possible to demonstrate that a population effect size is above or below the boundary value. This approach can also be applied retrospectively to published studies. I illustrate this by defining the null-hypothesis as the region of effect sizes that is defined by the effect size that corresponds to a z-score of 2.5. While a z-score of 2.5 corresponds to p = .01 (two-tailed) for the nil-hypothesis, I use this criterion value to maintain an error rate of 5% and to change the null-hypothesis to a range of values around zero that becomes smaller as sample sizes increase.

As p-hacking is often used to just reject the nil-hypothesis, changing the null-hypothesis to a range of values around zero makes many ‘significant’ results non-significant. That is, the evidence is too weak to exclude even trivial effect sizes. This does not mean that the hypothesis is wrong or that original authors did p-hack their data. However, it does mean that they can no longer point to their original results as empirical evidence. Rather they have to conduct new studies to demonstrate with larger samples that they can reject the new null-hypothesis that the predicted effect meets some minimal standard of practical or theoretical significance. With a clear criterion value for significance, authors also risk to obtain evidence that positively contradicts their predictions. Thus, the biggest improvement that arises form rethinking null-hypothesis testing is that authors have to specify effect sizes a priori and that that studies can provide evidence for and against a zero. Thus, changing the nil-hypothesis to a null-hypothesis with a non-null value makes it possible to provide evidence for or against a theory. In contrast, computing Bayes-Factors in favor of the nil-hypothesis fails to achieve this goal because the nil-hypothesis is always wrong, the real question is only how wrong.

# Tukey 1991 explains Null-Hypothesis Testing in 8 Paragraphs

1. We need to distinguish regions of effect sizes and precise values. The value 0 is a precise value. All positive values or all negative values are regions of values.

2. The most common use of null-hypothesis testing is to test whether the point-null or nil-hypothesis (Cohen, 1994) is consistent with the data.

3. Tukey explains that this hypothesis is likely to be false all the time. “All we know about the world teaches us that the effect of A and B are always different”. Many critics of NHST have suggested that this makes it useless to test the nil-hypothesis because we already know that it is false (the prior probability of H0 being true is 0, no data can change this).

4. NHST becomes useful when we think about the null-hypothesis (no difference) as the boundary value that distinguishes two regions. We are really testing the direction of the mean difference (or the sign of of a correlation coefficient). Once we can reject the nil-hypothesis (p < alpha) in a two-sided test, we are allowed to interpret the direction of the mean difference in a sample as the mean difference in the population (i.e., if we had studied all people from which the sample was drawn).

5. Some psychologists have criticized NHST because it can never provide evidence for the nil-hypothesis (Rouder, Wagenmakers). This criticism is based on a misunderstanding of NHST. Tukey explains we should never accept the nil-hypothesis because we can never provide empirical support FOR a precise effect size.

6. Once we have evidence that the nil-hypothesis is false and the effect is either positive or negative, we may ask follow-up questions about the size of an effect.

7. A good way to answer these questions is to conduct NHST with confidence intervals. If the confidence interval includes 0, we cannot draw inferences about the direction of the effect. However, if the confidence interval does not include 0, we can make inferences about the direction of an effect and the boundaries of the intervals provide information about plausible values for the smallest and the largest possible effect size.

8. In conclusion, we can think about two-sided tests as an efficient way of conducting two one-sided tests without inflating the type-I error probability. Rejecting the hypothesis that there is no effect is not interesting. Determining the direction of an effect is and NHST is a useful tool to do so.

9. I probably made things worse by paraphrasing Tukey. Therefore I also posted the relevant section of his article below.

# A critique of Stroebe and Strack’s Article “The Alleged Crisis and the Illusion of Exact Replication”

The article by Stroebe and Strack (2014) [henceforth S&S] illustrates how experimental social psychologists responded to replication failures in the beginning of the replicability revolution.  The response is a classic example of repressive coping: Houston, we do not have a problem. Even in 2014,  problems with the way experimental social psychologists had conducted research for decades were obvious (Bem, 2011; Wagenmakers et al., 2011; John et al., 2012; Francis, 2012; Schimmack, 2012; Hasher & Wagenmakers, 2012).  S&S article is an attempt to dismiss these concerns as misunderstandings and empirically unsupported criticism.

“In contrast to the prevalent sentiment, we will argue that the claim of a replicability crisis is greatly exaggerated” (p. 59).

Although the article was well received by prominent experimental social psychologists (see citations in appendix), future events proved S&S wrong and vindicated critics of research methods in experimental social psychology. Only a year later, the Open Science Collaboration (2015) reported that only 25% of studies in social psychology could be replicated successfully.  A statistical analysis of focal hypothesis tests in social psychology suggests that roughly 50% of original studies could be replicated successfully if these studies were replicated exactly (Motyl et al., 2017).  Ironically, one of S&S’s point is that exact replication studies are impossible. As a result, the 50% estimate is an optimistic estimate of the success rate for actual replication studies, suggesting that the actual replicability of published results in social psychology is less than 50%.

Thus, even if S&S had reasons to be skeptical about the extent of the replicability crisis in experimental social psychology, it is now clear that experimental social psychology has a serious replication problem. Many published findings in social psychology textbooks may not replicate and many theoretical claims in social psychology rest on shaky empirical foundations.

What explains the replication problem in experimental social psychology?  The main reason for replication failures is that social psychology journals mostly published significant results.  The selective publishing of significant results is called publication bias. Sterling pointed out that publication bias in psychology is rampant.  He found that psychology journals publish over 90% significant results (Sterling, 1959; Sterling et al., 1995).  Given new estimates that the actual success rate of studies in experimental social psychology is less than 50%, only publication bias can explain why journals publish over 90% results that confirm theoretical predictions.

It is not difficult to see that reporting only studies that confirm predictions undermines the purpose of empirical tests of theoretical predictions.  If studies that do not confirm predictions are hidden, it is impossible to obtain empirical evidence that a theory is wrong.  In short, for decades experimental social psychologists have engaged in a charade that pretends that theories are empirically tested, but publication bias ensured that theories would never fail.  This is rather similar to Volkswagen’s emission tests that were rigged to pass because emissions were never subjected to a real test.

In 2014, there were ample warning signs that publication bias and other dubious practices inflated the success rate in social psychology journals.  However, S&S claim that (a) there is no evidence for the use of questionable research practices and (b) that it is unclear which practices are questionable or not.

“Thus far, however, no solid data exist on the prevalence of such research practices in either social or any other area of psychology. In fact, the discipline still needs to reach an agreement about the conditions under which these practices are unacceptable” (p. 60).

Scientists like to hedge their statements so that they are immune to criticism. S&S may argue that the evidence in 2014 was not “solid” and surely there was and still is no agreement about good research practices. However, this is irrelevant. What is important is that success rates in social psychology journals were and still are inflated by suppressing disconfirming evidence and biasing empirical tests of theories in favor of positive outcomes.

Although S&S’s main claims are not based on empirical evidence, it is instructive to examine how they tried to shield published results and established theories from the harsh light of open replication studies that report results without selection for significance and subject social psychological theories to real empirical tests for the first time.

Failed Replication of Between-Subject Priming Studies

S&S discuss failed replications of two famous priming studies in social psychology: Bargh’s elderly priming study and Dijksterhuis’s professor priming studies.  Both seminal articles reported several successful tests of the prediction that a subtle priming manipulation would influence behavior without participants even noticing the priming effect.  In 2012, Doyen et al., failed to replicate elderly priming. Schanks et al. (2013) failed to replicate professor priming effects and more recently a large registered replication report also provided no evidence for professor priming.  For naïve readers it is surprising that original studies had a 100% success rate and replication studies had a 0% success rate.  However, S&S are not surprised at all.

“as in most sciences, empirical findings cannot always be replicated” (p. 60).

Apparently, S&S knows something that naïve readers do not know.  The difference between naïve readers and experts in the field is that experts have access to unpublished information about failed replications in their own labs and in the labs of their colleagues. Only they know how hard it sometimes was to get the successful outcomes that were published. With the added advantage of insider knowledge, it makes perfect sense to expect replication failures, although may be not 0%.

The problem is that S&S give the impression that replication failures are too be expected, but that this expectation cannot be based on the objective scientific record that hardly ever reports results that contradict theoretical predictions.  Replication failures occur all the time, but they remained unpublished. Doyen et al. and Schanks et al.’s articles only violated the code to publish only supportive evidence.

Kahneman’s Train Wreck Letter

S&S also comment on Kahneman’s letter to Bargh that compared priming research to a train wreck.  In response S&S claim that

“priming is an entirely undisputed method that is widely used to test hypotheses about associative memory (e.g., Higgins, Rholes, & Jones, 1977; Meyer & Schvaneveldt, 1971; Tulving & Schacter, 1990).” (p. 60).

This argument does not stand the test of time.  Since S&S published their article researchers have distinguished more clearly between highly replicable priming effects in cognitive psychology with repeated measures and within-subject designs and difficult to replicate between-subject social priming studies with subtle priming manipulations and a single outcome measure (BS social priming).  With regards to BS social priming, it is unclear which of these effects can be replicated and leading social psychologists have been reluctant to demonstrate replicability of their famous studies by conducting self-replications as they were encouraged to do in Kahneman’s letter.

S&S also point to empirical evidence for robust priming effects.

“A meta-analysis of studies that investigated how trait primes influence impression formation identified 47 articles based on 6,833 participants and found overall effects to be statistically highly significant (DeCoster & Claypool, 2004).” (p. 60).

The problem with this evidence is that this meta-analysis did not take publication bias into account; in fact, it does not even mention publication bias as a possible problem.  A meta-analysis of studies that were selected for significance produces is also biased by selection for significance.

Several years after Kahneman’s letter, it is widely agreed that past research on social priming is a train wreck.  Kahneman published a popular book that celebrated social priming effects as a major scientific discovery in psychology.  Nowadays, he agrees with critiques that the existing evidence is not credible.  It is also noteworthy that none of the researchers in this area have followed Kahneman’s advice to replicate their own findings to show the world that these effects are real.

It is all a big misunderstanding

S&S suggest that “the claim of a replicability crisis in psychology is based on a major misunderstanding.” (p. 60).

Apparently, lay people, trained psychologists, and a Noble laureate are mistaken in their interpretation of replication failures.  S&S suggest that failed replications are unimportant.

“the myopic focus on “exact” replications neglects basic epistemological principles” (p. 60).

To make their argument, they introduce the notion of exact replications and suggest that exact replication studies are uninformative.

“a finding may be eminently reproducible and yet constitute a poor test of a theory.” (p. 60).

The problem with this line of argument is that we are supposed to assume that a finding is eminently reproducible, which probably means it has been successfully replicate many times.  It seems sensible that further studies of gender differences in height are unnecessary to convince us that there is a gender difference in height. However, results in social psychology are not like gender differences in height.  According to S&S own accord earlier, “empirical findings cannot always be replicated” (p. 60). And if journals only publish significant results, it remains unknown which results are eminently reproducible and which results are not.  S&S ignore publication bias and pretend that the published record suggests that all findings in social psychology are eminently reproducible. Apparently, they would suggest that even Bem’s findings that people have supernatural abilities is eminently reproducible.  These days, few social psychologists are willing to endorse this naïve interpretation of the scientific record as a credible body of empirical facts.

Exact Replication Studies are Meaningful if they are Successful

Ironically, S&S next suggest that exact replication studies can be useful.

Exact replications are also important when studies produce findings that are unexpected and only loosely connected to a theoretical framework. Thus, the fact that priming individuals with the stereotype of the elderly resulted in a reduction of walking speed was a finding that was unexpected. Furthermore, even though it was consistent with existing theoretical knowledge, there was no consensus about the processes that mediate the impact of the prime on walking speed. It was therefore important that Bargh et al. (1996) published an exact replication of their experiment in the same paper.

Similarly, Dijksterhuis and van Knippenberg (1998) conducted four studies in which they replicated the priming effects. Three of these studies contained conditions that were exact replications.

Because it is standard practice in publications of new effects, especially of effects that are surprising, to publish one or two exact replications, it is clearly more conducive to the advancement of psychological knowledge to conduct conceptual replications rather than attempting further duplications of the original study.

Given these citations it is problematic that S&S article is often cited to claim that exact replications are impossible or unnecessary.  The argument that S&S are making here is rather different.  They are suggesting that original articles already provide sufficient evidence that results in social psychology are eminently reproducible because original articles report multiple studies and some of these studies are often exact replication studies.  At face value, S&S have a point.  An honest series of statistically significant results makes it practically impossible that an effect is a false positive result (Schimmack, 2012).  The problem is that multiple study articles are not honest reports of all replication attempts.  Francis (2014) found that at least 80% of multiple study articles showed statistical evidence of questionable research practices.  Given the pervasive influence of selection for significance, exact replication studies in original articles provide no information about the replicability of these results.

What made the failed replications by Doyen et al. and Shank et al. so powerful was that these studies were the first real empirical tests of BS social priming effects because the authors were willing to report successes or failures.  The problem for social psychology is that many textbook findings that were obtained with selection for significance cannot be reproduced in honest empirical tests of the predicted effects.  This means that the original effects were either dramatically inflated or may not exist at all.

Replication Studies are a Waste of Resources

S&S want readers to believe that replication studies are a waste of resources.

Given that both research time and money are scarce resources, the large scale attempts at duplicating previous studies seem to us misguided” (p. 61).

This statement sounds a bit like a plea to spare social psychology from the embarrassment of actual empirical tests that reveal the true replicability of textbook findings. After all, according to S&S it is impossible to duplicate original studies (i.e., conduct exact replication studies) because replication studies differ in some way from original studies and may not reproduce the original results.  So, none of the failed replication studies is an exact replication.  Doyen et al. replicate Bargh’s study that was conducted in New York city in Belgium and Shanks et al. replicated Dijksterhuis’s studies from the Netherlands in the United States.  The finding that the original results could not be replicate the original results does not imply that the original findings were false positives, but they do imply that these findings may be unique to some unspecified specifics of the original studies.  This is noteworthy when original results are used in textbook as evidence for general theories and not as historical accounts of what happened in one specific socio-cultural context during a specific historic period. As social situations and human behavior are never exact replications of the past, social psychological results need to be permanently replicated and doing so is not a waste of resources.  Suggesting that replications is a waste of resources is like suggesting that measuring GDP or unemployment every year is a waste of resources because we can just use last-year’s numbers.

As S&S ignore publication bias and selection for significance, they are also ignoring that publication bias leads to a massive waste of resources.  First, running empirical tests of theories that are not reported is a waste of resources.  Second, publishing only significant results is also a waste of resources because researchers design new studies based on the published record. When the published record is biased, many new studies will fail, just like airplanes who are designed based on flawed science would drop from the sky.  Thus, a biased literature creates a massive waste of resources.

Ultimately, a science that publishes only significant result wastes all resources because the outcome of the published studies is a foregone conclusion: the prediction was supported, p < .05. Social psychologists might as well publish purely theoretical article, just like philosophers in the old days used “thought experiments” to support their claims. An empirical science is only a real science if theoretical predictions are subjected to tests that can fail.  By this simple criterion, experimental social psychology is not (yet) a science.

Should Psychologists Conduct Exact Replications or Conceptual Replications?

Strobe and Strack’s next cite Pashler and Harris (2012) to claim that critiques of experimental social psychology have dismissed the value of so-called conceptual replications and generalize.

The main criticism of conceptual replications is that they are less informative than exact replications (e.g., Pashler & Harris, 2012).”

Before I examine S&S’s counterargument, it is important to realize that S&S misrepresented, and maybe misunderstood, Pashler and Harris’s main point. Here is the relevant quote from Pashler and Harris’s article.

We speculate that the harmful interaction of publication bias and a focus on conceptual rather than direct replications may even shed light on some of the famous and puzzling “pathological science” cases that embarrassed the natural sciences at several points in the 20th century (e.g., Polywater; Rousseau & Porto, 1970; and cold fusion; Taubes, 1993).

The problem for S&S is that they cannot address the problem of publication bias and therefore carefully avoid talking about it.  As a result, they misrepresent Pashler and Harris’s critique of conceptual replications in combination with publication bias as a criticism of conceptual replication studies, which is absurd and not what Pashler and Harris’s intended to say or actually said. The following quote from their article makes this crystal clear.

However, what kept faith in cold fusion alive for some time (at least in the eyes of some onlookers) was a trickle of positive results achieved using very different designs than the originals (i.e., what psychologists would call conceptual replications). This suggests that one important hint that a controversial finding is pathological may arise when defenders of a controversial effect disavow the initial methods used to obtain an effect and rest their case entirely upon later studies conducted using other methods. Of course, productive research into real phenomena often yields more refined and better ways of producing effects. But what should inspire doubt is any situation where defenders present a phenomenon as a “moving target” in terms of where and how it is elicited (cf. Langmuir, 1953/1989). When this happens, it would seem sensible to ask, “If the finding is real and yet the methods used by the original investigators are not reproducible, then how were these investigators able to uncover a valid phenomenon with methods that do not work?” Again, the unavoidable conclusion is that a sound assessment of a controversial phenomenon should focus first and foremost on direct replications of the original reports and not on novel variations, each of which may introduce independent ambiguities.

I am confident that unbiased readers will recognize that Pashler and Harris did not suggest that conceptual replication studies are bad.  Their main point is that a few successful conceptual replication studies can be used to keep theories alive in the face of a string of many replication failures. The problem is not that researchers conduct successful conceptual replication studies. The problem is dismissing or outright hiding of disconfirming evidence in replication studies. S&S misconstrue Pashler and Harris’s claim to avoid addressing this real problem of ignoring and suppressing failed studies to support an attractive but false theory.

The illusion of exact replications.

S&S next argument is that replication studies are never exact.

If one accepts that the true purpose of replications is a (repeated) test of a theoretical hypothesis rather than an assessment of the reliability of a particular experimental procedure, a major problem of exact replications becomes apparent: Repeating a specific operationalization of a theoretical construct at a different point in time and/or with a different population of participants might not reflect the same theoretical construct that the same procedure operationalized in the original study.

The most important word in this quote is “might.”   Ebbinghaus’s memory curve MIGHT not replicate today because he was his own subject.  Bargh’s elderly priming study MIGHT not work today because Florida is no longer associated with the elderly, and Disjterhuis’s priming study MIGHT no longer works because students no longer think that professors are smart or that Hooligans are dumb.

Just because there is no certainty in inductive inferences doesn’t mean we can just dismiss replication failures because something MIGHT have changed.  It is also possible that the published results MIGHT be false positives because significant results were obtained by chance, with QRPs, or outright fraud.  Most people think that outright fraud is unlikely, but the Stapel debacle showed that we cannot rule it out.  So, we can argue forever about hypothetical reasons why a particular study was successful or a failure. These arguments are futile and have nothing to do with scientific arguments and objective evaluation of facts.

This means that every study, whether it is a groundbreaking success or a replication failure needs to be evaluate in terms of the objective scientific facts. There is no blanket immunity for seminal studies that protects them from disconfirming evidence.  No study is an exact replication of another study. That is a truism and S&S article is often cited for this simple fact.  It is as true as it is irrelevant to understand the replication crisis in social psychology.

Exact Replications Are Often Uninformative

S&S contradict themselves in the use of the term exact replication.  First it is impossible to do exact replications, but then they are uninformative.  I agree with S&S that exact replication studies are impossible. So, we can simply drop the term “exact” and examine why S&S believe that some replication studies are uninformative.

First they give an elaborate, long and hypothetical explanation for Doyen et al.’s failure to replicate Bargh’s pair of elderly priming studies. After considering some possible explanations, they conclude

It is therefore possible that the priming procedure used in the Doyen et al. (2012) study failed in this respect, even though Doyen et al. faithfully replicated the priming procedure of Bargh et al. (1996).

Once more the realm of hypothetical conjectures has to rescue seminal findings. Just as it is possible that S&S are right it is also possible that Bargh faked his data. To be sure, I do not believe that he faked his data and I apologized for a Facebook comment that gave the wrong impression that I did. I am only raising this possibility here to make the point that everything is possible. Maybe Bargh just got lucky.  The probability of this is 1 out of 1,600 attempts (the probability to get the predicted effect with .05 two-tailed (!) twice is .025^2). Not very likely, but also not impossible.

No matter what the reason for the discrepancy between Bargh and Doyen’s findings is, the example does not support S&S’s claim that replication studies are uninformative. The failed replication raised concerns about the robustness of BS social priming studies and stimulated further investigation of the robustness of social priming effects. In the short span of six years, the scientific consensus about these effects has shifted dramatically, and the first publication of a failed replication is an important event in the history of social psychology.

S&S’s critique of Shank et al.’s replication studies is even weaker.  First, they have to admit that professor probably still primes intelligence more than soccer hooligans. To rescue the original finding S&S propose

“the priming manipulation might have failed to increase the cognitive representation of the concept “intelligence.”

S&S also think that

another LIKELY reason for their failure could be their selection of knowledge items.

Meanwhile a registered replication report with a design that was approved by Dijksterhuis failed to replicate the effect.  Although it is possible to come up with more possible reasons for these failures, real scientific creativity is revealed in creating experimental paradigms that produce replicable results, not in coming up with many post-hoc explanations for replication failures.

Ironically, S&S even agree with my criticism of their argument.

“To be sure, these possibilities are speculative”  (p. 62).

In contrast, S&S fail to consider the possibility that published significant results are false positives, even though there is actual evidence for publication bias. The strong bias against published failures may be rooted in a long history of dismissing unpublished failures that social psychologists routinely encounter in their own laboratory.  To avoid the self-awareness that hiding disconfirming evidence is unscientific, social psychologists made themselves believe that minute changes in experimental procedures can ruin a study (Stapel).  Unfortunately, a science that dismisses replication failures as procedural hiccups is fated to fail because it removed the mechanism that makes science self-correcting.

Failed Replications are Uninformative

S&S next suggest that “nonreplications are uninformative unless one can demonstrate that the theoretically relevant conditions were met” (p. 62).

This reverses the burden of proof.  Original researchers pride themselves on innovative ideas and groundbreaking discoveries.  Like famous rock stars, they are often not the best musicians, nor is it impossible for other musicians to play their songs. They get rewarded because they came up with something original. Take the Implicit Association Test as an example. The idea to use cognitive switching tasks to measure attitudes was original and Greenwald deserves recognition for inventing this task. The IAT did not revolutionize attitude research because only Tony Greenwald could get the effects. It did so because everybody, including my undergraduate students, could replicate the basic IAT effect.

However, let’s assume that the IAT effect could not have been replicated. Is it really the job of researchers who merely duplicated a study to figure out why it did not work and develop a theory under which circumstances an effect may occur or not?  I do not think so. Failed replications are informative even if there is no immediate explanation why the replication failed.  As Pashler and Harris’s cold fusion example shows there may not even be a satisfactory explanation after decades of research. Most probably, cold fusion never really worked and the successful outcome of the original study was a fluke or a problem of the experimental design.  Nevertheless, it was important to demonstrate that the original cold fusion study could not be replicated.  To ask for an explanation why replication studies fail is simply a way to make replication studies unattractive and to dismiss the results of studies that fail to produce the desired outcome.

Finally, S&S ignore that there is a simple explanation for replication failures in experimental social psychology: publication bias.  If original studies have low statistical power (e.g., Bargh’s studies with N = 30) to detect small effects, only vastly inflated effect sizes reach significance.  An open replication study without inflated effect sizes is unlikely to produce a successful outcome. Statistical analysis of original studies show that this explanation accounts for a large proportion of replication failures. Thus, publication bias provides one explanation for replication failures.

Conceptual Replication Studies are Informative

S&S cite Schmidt (2009) to argue that conceptual replication studies are informative.

With every difference that is introduced the confirmatory power of the replication increases, because we have shown that the phenomenon does not hinge on a particular operationalization but “generalizes to a larger area of application” (p. 93).

S&S continue

“An even more effective strategy to increase our trust in a theory is to test it using completely different manipulations.”

This is of course true as long as conceptual replication studies are successful. However, it is not clear why conceptual replication studies that for the first time try a completely different manipulation should be successful.  As I pointed out in my 2012 article, reading multiple-study articles with only successful conceptual replication studies is a bit like watching a magic show.

Multiple-study articles are most common in experimental psychology to demonstrate the robustness of a phenomenon using slightly different experimental manipulations. For example, Bem (2011) used a variety of paradigms to examine ESP. Demonstrating a phenomenon in several different ways can show that a finding is not limited to very specific experimental conditions. Analogously, if Joe can hit the bull’s-eye nine times from different angles, with different guns, and in different light conditions, Joe truly must be a sharpshooter. However, the variation of experimental procedures also introduces more opportunities for biases (Ioannidis, 2005). The reason is that variation of experimental procedures allows researchers to discount null findings. Namely, it is possible to attribute nonsignificant results to problems with the experimental procedure rather than to the absence of an effect.

I don’t know whether S&S are impressed by Bem’s article with 9 conceptual replication studies that successfully demonstrated supernatural abilities.  According to their line of arguments, they should be.  However, even most social psychologists found it impossible to accept that time-reversed subliminal priming works. Unfortunately, this also means that successful conceptual replication studies are meaningless if only successful results are published.  Once more, S&S cannot address this problem because they ignore the simple fact that selection for significance undermines the purpose of empirical research to test theoretical predictions.

Exact Replications Contribute Little to Scientific Knowledge

Without providing much evidence for their claims, S&S conclude

one reason why exact replications are not very interesting is that they contribute little to scientific knowledge.

Ironically, one year later Science published 100 replication studies with the only goal of estimating the replicability of psychology, with a focus on social psychology.  The article has already been cited 640 times, while S&S’s criticism of replication studies has been cited (only) 114 times.

Although the article did nothing else then to report the outcome of replication studies, it made a tremendous empirical contribution to psychology because it reported results of studies without the filter of publication bias.  Suddenly the success rate plummeted from over 90% to 37% and for social psychology to 25%.  While S&S could claim in 2014 that “Thus far, however, no solid data exist on the prevalence of such [questionable] research practices in either social or any other area of psychology,” the reproducibility project revealed that these practices dramatically inflated the percentage of successful studies reported in psychology journals.

The article has been celebrated by scientists in many disciplines as a heroic effort and a sign that psychologists are trying to improve their research practices. S&S may disagree, but I consider the reproducibility project a big contribution to scientific knowledge.

Why null findings are not always that informative

To fully appreciate the absurdity of S&S’s argument, I let them speak for themselves.

One reason is that not all null findings are interesting.  For example, just before his downfall, Stapel published an article on how disordered contexts promote stereotyping and discrimination. In this publication, Stapel and Lindenberg (2011) reported findings showing that litter or a broken-up sidewalk and an abandoned bicycle can increase social discrimination. These findings, which were later retracted, were judged to be sufficiently important and interesting to be published in the highly prestigious journal Science. Let us assume that Stapel had actually conducted the research described in this paper and failed to support his hypothesis. Such a null finding would have hardly merited publication in the Journal of Articles in Support of the Null Hypothesis. It would have been uninteresting for the same reason that made the positive result interesting, namely, that (a) nobody expected a relationship between disordered environments and prejudice and (b) there was no previous empirical evidence for such a relationship. Similarly, if Bargh et al. (1996) had found that priming participants with the stereotype of the elderly did not influence walking speed or if Dijksterhuis and van Knippenberg (1998) had reported that priming participants with “professor” did not improve their performance on a task of trivial pursuit, nobody would have been interested in their findings.

Notably, all of the examples are null-findings in original studies. Thus, they have absolutely no relevance for the importance of replication studies. As noted by Strack and Stroebe earlier

Thus, null findings are interesting only if they contradict a central hypothesis derived from an established theory and/or are discrepant with a series of earlier studies.” (p. 65).

Bem (2011) reported 9 significant results to support unbelievable claims about supernatural abilities.  However, several failed replication studies allowed psychologists to dismiss these findings and to ignore claims about time-reversed priming effects. So, while not all null-results are important, null-results in replication studies are important because they can correct false positive results in original articles. Without this correction mechanism, science looses its ability to correct itself.

Failed Replications Do Not Falsify Theories

S&S state that failed replications do not falsify theories

The nonreplications published by Shanks and colleagues (2013) cannot be taken as a falsification of that theory, because their study does not explain why previous research was successful in replicating the original findings of Dijksterhuis and van Knippenberg (1998).” (p. 64).

I am unaware of any theory in psychology that has been falsified. The reason for this is not that failed replication studies are not informative. The reason is that theories have been protected by hiding failed replication studies until recently. Only in recent years have social psychologists started to contemplate the possibility that some theories in social psychology might be false.  The most prominent example is ego-depletion theory, which has been one of the first prominent theories that has been put under the microscope of open science without the protection of questionable research practices in recent years. While ego-depletion theory is not entirely dead, few people still believe in the simple theory that 20 Stroop trials deplete individuals’ will power.  Falsification is hard, but falsification without disconfirming evidence is impossible.

Inconsistent Evidence

S&S argue that replication failures have to be evaluated in the context of replication successes.

Even multiple failures to replicate an established finding would not result in a rejection of the original hypothesis, if there are also multiple studies that supported that hypothesis.

Earlier S&S wrote

in social psychology, as in most sciences, empirical findings cannot always be replicated (this was one of the reasons for the development of meta-analytic methods).

Indeed. Unless studies have very high statistical power, inconsistent results are inevitable; which is one reason why publishing only significant results is a sign of low credibility (Schimmack, 2012). Meta-analysis is the only way to make sense of these inconsistent findings.  However, it is well known that publication bias makes meta-analytic results meaningless (e.g., meta-analysis show very strong evidence for supernatural abilities).  Thus, it is important that all tests of a theoretical prediction are reported to produce meaningful meta-analyses.  If social psychologists would take S&S seriously and continue to suppress non-significant results because they are uninformative, meta-analysis would continue to provide biased results that support even false theories.

Failed Replications are Uninformative II

Sorry that this is getting really long. But S&S keep on making the same arguments and the editor of this article didn’t tell them to shorten the article. Here they repeat the argument that failed replications are uninformative.

One reason why null findings are not very interesting is because they tell us only that a finding could not be replicated but not why this was the case. This conflict can be resolved only if researchers develop a theory that could explain the inconsistency in findings.

A related claim is that failed replications never demonstrate that original findings were false because the inconsistency is always due to some third variable; a hidden moderator.

Methodologically, however, nonreplications must be understood as interaction effects in that they suggest that the effect of the crucial influence depends on the idiosyncratic conditions under which the original experiment was conducted” (p. 64).

These statements reveal a fundamental misunderstanding of statistical inferences.  A significant result never proofs that the null-hypothesis is false.  The inference that a real effect rather than sampling error caused the observed result can be a mistake. This mistake is called a false positive or a type-I error. S&S seems to believe that type-I errors do not exist. Accordingly, Bem’s significant results show real supernatural abilities.  If this were the case, it would be meaningless to report statistical significance tests. The only possible error that could be made would be false negatives or type-II error; the theory makes the correct prediction, but a study failed to produce a significant result. And if theoretical predictions are always correct, it is also not necessary to subject theories to empirical tests, because these tests either correctly show that a prediction was confirmed or falsely fail to confirm a prediction.

S&S’s belief in published results has a religious quality.  Apparently we know nothing about the world, but once a significant result is published in a social psychology journal, ideally JPSP, it becomes a holy truth that defies any evidence that non-believers may produce under the misguided assumption that further inquiry is necessary. Elderly priming is real, amen.

More Confusing Nonsense

At some point, I was no longer surprised by S&S’s claims, but I did start to wonder about the reviewers and editors who allowed this manuscript to be published apparently with light or no editing.  Why would a self-respecting journal publish a sentence like this?

As a consequence, the mere coexistence of exact replications that are both successful and unsuccessful is likely to leave researchers helpless about what to conclude from such a pattern of outcomes.

Didn’t S&S claim that exact replication studies do not exist? Didn’t they tell readers that every inconsistent finding has to be interpreted as an interaction effect?  And where do they see inconsistent results if journals never publish non-significant results?

Aside from these inconsistencies, inconsistent results do not lead to a state of helpless paralysis. As S&S suggested themselves, they conduct a meta-analysis. Are S&S suggesting that we need to spare researchers from inconsistent results to protect them from a state of helpless confusion? Is this their justification for publishing only significant results?

Even Massive Replication Failures in Registered Replication Reports are Uninformative

In response to the replication crisis, some psychologists started to invest time and resources in major replication studies called many lab studies or registered replication studies.  A single study was replicated in many labs.  The total sample size of many labs gives these studies high precision in estimating the average effect size and makes it even possible to demonstrate that an effect size is close to zero, which suggests that the null-hypothesis may be true.  These studies have failed to find evidence for classic social psychology findings, including Strack’s facial feedback studies. S&S suggest that even these results are uninformative.

Conducting exact replications in a registered and coordinated fashion by different laboratories does not remove the described shortcomings. This is also the case if exact replications are proposed as a means to estimate the “true size” of an effect. As the size of an experimental effect always depends on the specific error variance that is generated by the context, exact replications can assess only the efficiency of an intervention in a given situation but not the generalized strength of a causal influence.

Their argument does not make any sense to me.  First, it is not clear what S&S mean by “the size of an experimental effect always depends on the specific error variance.”  Neither unstandardized nor standardized effect sizes depend on the error variance. This is simple to see because error variance depends on the sample size and effect sizes do not depend on sample size.  So, it makes no sense to claim that effect sizes depend on error variance.

Second, it is not clear what S&S mean by specific error variance that is generated by the context.  I simply cannot address this argument because the notion of context generated specific error variance is not a statistical construct and S&S do not explain what they are talking about.

Finally, it is not clear why meta-analysis of replication studies cannot be used to estimate the generalized strength of a causal influence, which I believe to mean “an effect size”?  Earlier S&S alluded to meta-analysis as a way to resolve inconsistencies in the literature, but now they seem to suggest that meta-analysis cannot be used.

If S&S really want to imply that meta-analyses are useless, it is unclear how they would make sense of inconsistent findings.  The only viable solution seems to be to avoid inconsistencies by suppressing non-significant results in order to give the impression that every theory in social psychology is correct because theoretical predictions are always confirmed.  Although this sounds absurd, it is the inevitable logical consequence of S&S’s claim that non-significant results are uninformative, even if over 20 labs independently and in combination failed to provide evidence for a theoretical predicted effect.

The Great History of Social Psychological Theories

S&S next present Über-social psychologist, Leon Festinger, as an example why theories are good and failed studies are bad.  The argument is that good theories make correct predictions, even if bad studies fail to show the effect.

“Although their theoretical analysis was valid, it took a decade before researchers were able to reliably replicate the findings reported by Festinger and Carlsmith (1959).”

As a former student, I was surprised by this statement because I had learned that Festinger’s theory was challenged by Bem’s theory and that social psychologists had been unable to resolve which of the two theories was correct.  Couldn’t some of these replication failures be explained by the fact that Festinger’s theory sometimes made the wrong prediction?

It is also not surprising that researchers had a hard time replicating Festinger and Carlsmith original findings.  The reason is that the original study had low statistical power and replication failures are expected even if the theory is correct. Finally, I have been around social psychologists long enough to have heard some rumors about Festinger and Carlsmith’s original studies.  Accordingly, some of Festinger’s graduate students also tried and failed to get the effect. Carlsmith was the ‘lucky’ one who got the effect, in one study p < .05, and he became the co-author of one of the most cited articles in the history of social psychology. Naturally, Festinger did not publish the failed studies of his other graduate students because surely they must have done something wrong. As I said, that is a rumor.  Even if the rumor is not true, and Carlsmith got lucky on the first try, luck played a factor and nobody should expect that a study replicates simply because a single published study reported a p-value less than .05.

Failed Replications Did Not Influence Social Psychological Theories

Argument quality reaches a new low with the next argument against replication studies.

“If we look at the history of social psychology, theories have rarely been abandoned because of failed replications.”

This is true, but it reveals the lack of progress in theory development in social psychology rather than the futility of replication studies.  From an evolutionary perspective, theory development requires selection pressure, but publication bias protects bad theories from failure.

The short history of open science shows how weak social psychological theories are and that even the most basic predictions cannot be confirmed in open replication studies that do not selectively report significant results.  So, even if it is true that failed replications have played a minor role in the past of social psychology, they are going to play a much bigger role in the future of social psychology.

The Red Herring: Fraud

S&S imply that Roediger suggested to use replication studies as a fraud detection tool.

if others had tried to replicate his [Stapel’s] work soon after its publication, his misdeeds might have been uncovered much more quickly

S&S dismiss this idea in part on the basis of Stroebe’s research on fraud detection.

To their own surprise, Stroebe and colleagues found that replications hardly played any role in the discovery of these fraud cases.

Now this is actually not surprising because failed replications were hardly ever published.  And if there is no variance in a predictor variable (significance), we cannot see a correlation between the predictor variable and an outcome (fraud).  Although failed replication studies may help to detect fraud in the future, this is neither their primary purpose, nor necessary to make replication studies valuable. Replication studies also do not bring world peace or bring an end to global warming.

For some inexplicable reason S&S continue to focus on fraud. For example, they also argue that meta-analyses are poor fraud detectors, which is as true as it is irrelevant.

They conclude their discussion with an observation by Stapel, who famously faked 50+ articles in social psychology journals.

As Stapel wrote in his autobiography, he was always pleased when his invented findings were replicated: “What seemed logical and was fantasized became true” (Stapel, 2012). Thus, neither can failures to replicate a research finding be used as indicators of fraud, nor can successful replications be invoked as indication that the original study was honestly conducted.

I am not sure why S&S spend so much time talking about fraud, but it is the only questionable research practice that they openly address.  In contrast, they do not discuss other questionable research practices, including suppressing failed studies, that are much more prevalent and much more important for the understanding of the replication crisis in social psychology than fraud.  The term “publication bias” is not mentioned once in the article. Sometimes what is hidden is more significant than what is being published.

Conclusion

The conclusion section correctly predicts that the results of the reproducibility project will make social psychology look bad and that social psychology will look worse than other areas of psychology.

But whereas it will certainly be useful to be informed about studies that are difficult to replicate, we are less confident about whether the investment of time and effort of the volunteers of the Open Science Collaboration is well spent on replicating studies published in three psychology journals. The result will be a reproducibility coefficient that will not be greatly informative, because of justified doubts about whether the “exact” replications succeeded in replicating the theoretical conditions realized in the original research.

As social psychologists, we are particularly concerned that one of the outcomes of this effort will be that results from our field will be perceived to be less “reproducible” than research in other areas of psychology. This is to be expected because for the reasons discussed earlier, attempts at “direct” replications of social psychological studies are less likely than exact replications of experiments in psychophysics to replicate the theoretical conditions that were established in the original study.

Although psychologists should not be complacent, there seem to be no reasons to panic the field into another crisis. Crises in psychology are not caused by methodological flaws but by the way people talk about them (Kruglanski & Stroebe, 2012).

S&S attribute the foreseen (how did they know?) bad outcome in the reproducibility project to the difficulty of replicating social psychological studies, but they fail to explain why social psychology journals publish as many successes as other disciplines.

The results of the reproducibility project provide an answer to this question.  Social psychologists use designs with less statistical power that have a lower chance of producing a significant result. Selection for significance ensures that the success rate is equally high in all areas of psychology, but lower power makes these successes less replicable.

To avoid further embarrassments in an increasingly open science, social psychologists must improve the statistical power of their studies. Which social psychological theories will survive actual empirical tests in the new world of open science is unclear.  In this regard, I think it makes more sense to compare social psychology to a ship wreck than a train wreck.  Somewhere down on the floor of the ocean is some gold. But it will take some deep diving and many failed attempts to find it.  Good luck!

Appendix

S&S’s article was published in a “prestigious” psychology journal and has already garnered 114 citations. It ranks #21 in my importance rankings of articles in meta-psychology.  So, I was curious why the article gets cited.  The appendix lists 51 citing articles with the relevant citation and the reason for citing S&S’s article.   The table shows the reasons for citations in decreasing order of frequency.

S&S are most frequently cited for the claim that exact replications are impossible, followed by the reason for this claim that effects in psychological research are sensitive to the unique context in which a study is conducted.  The next two reasons for citing the article are that only conceptual replications (CR) test theories, whereas the results of exact replications (ER) are uninformative.  The problem is that every study is a conceptual replication because exact replications are impossible. So, even if exact replications were uninformative this claim has no practical relevance because there are no exact replications.  Some articles cite S&S with no specific claim attached to the citation.  Only two articles cite them for the claim that there is no replication crisis and only 1 citation cites S&S for the claim that there is no evidence about the prevalence of QRPs.   In short, the article is mostly cited for the uncontroversial and inconsequential claim that exact replications are impossible and that effect sizes in psychological studies can vary as a function of unique features of a particular sample or study.  This observation is inconsequential because it is unclear how unknown unique characteristics of studies influence results.  The main implication of this observation is that study results will be more variable than we would expect from a set of exact replication studies. For this reason, meta-analysts often use random-effects model because fixed-effects meta-analysis assumes that all studies are exact replications.

 ER impossible 11 Contextual Sensitivity 8 CR test theory 8 ER uninformative 7 Mention 6 ER/CR Distinction 2 No replication crisis 2 Disagreement 1 CR Definition 1 ER informative 1 ER useful for applied research 1 ER cannot detect fraud 1 No evidence about prevalence of QRP 1 Contextual sensitivity greater in social psychology 1

the most influential citing articles and the relevant citation.  I haven’t had time to do a content analysis, but the article is mostly cited to say (a) exact replications are impossible, and (b) conceptual replications are valuable, and (c) social psychological findings are harder to replicate.  Few articles cite to article to claim that the replication crisis is overblown or that failed replications are uninformative.  Thus, even though the article is cited a lot, it is not cited for the main points S&S tried to make.  The high number of citation therefore does not mean that S&S’s claims have been widely accepted.

(Disagreement)
The value of replication studies.

Simmons, DJ.
“In this commentary, I challenge these claims.”

(ER/CR Distinction)
Bilingualism and cognition.

Valian, V.
“A host of methodological issues should be resolved. One is whether the field should undertake exact replications, conceptual replications, or both, in order to determine the conditions under which effects are reliably obtained (Paap, 2014; Simons, 2014; Stroebe & Strack, 2014).”

(Contextual Sensitivity)
Is Psychology Suffering From a Replication Crisis? What Does “Failure to Replicate” Really Mean?“
Maxwell et al. (2015)
A particular replication may fail to confirm the results of an original study for a variety of reasons, some of which may include intentional differences in procedures, measures, or samples as in a conceptual replication (Cesario, 2014; Simons, 2014; Stroebe & Strack, 2014).”

(ER impossible)
The Chicago face database: A free stimulus set of faces and norming data

Debbie S. Ma, Joshua Correll, & Bernd Wittenbrink.
The CFD will also make it easier to conduct exact replications, because researchers can use the same stimuli employed by other researchers (but see Stroebe & Strack, 2014).”

(Contextual Sensitivity)
“Contextual sensitivity in scientific reproducibility”
vanBavel et al. (2015)
“Many scientists have also argued that the failure to reproduce results might reflect contextual differences—often termed “hidden moderators”—between the original research and the replication attempt”

(Contextual Sensitivity)
Editorial Psychological Science

Linday,
As Nosek and his coauthors made clear, even ideal replications of ideal studies are expected to fail some of the time (Francis, 2012), and failure to replicate a previously observed effect can arise from differences between the original and replication studies and hence do not necessarily indicate flaws in the original study (Maxwell, Lau, & Howard, 2015; Stroebe & Strack, 2014). Still, it seems likely that psychology journals have too often reported spurious effects arising from Type I errors (e.g., Francis, 2014).

(ER impossible)
Best Research Practices in Psychology: Illustrating Epistemological and Pragmatic Considerations With the Case of Relationship Science

Finkel et al. (2015).
“Nevertheless, many scholars believe that direct replications are impossible in the human sciences—S&S (2014) call them “an illusion”— because certain factors, such as a moment in historical time or the precise conditions under which a sample was obtained and tested, that may have contributed to a result can never be reproduced identically.”

Conceptualizing and evaluating the replication of research results
Fabrigar and Wegener (2016)
(CR test theory)
“Traditionally, the primary presumed strength of conceptual replications has been their ability to address issues of construct validity (e.g., Brewer & Crano, 2014; Schmidt, 2009; Stroebe & Strack, 2014). “

(ER impossible)
“First, it should be recognized that an exact replication in the strictest sense of the term can never be achieved as it will always be impossible to fully recreate the contextual factors and participant characteristics present in the original experiment (see Schmidt (2009); S&S (2014).”

(Contextual Sensitivity)
“S&S (2014) have argued that there is good reason to expect that many traditional and contemporary experimental manipulations in social psychology would have different psychological properties and effects if used in contexts or populations different from the original experiments for which they were developed. For example, classic dissonance manipulations and fear manipulations or more contemporary priming procedures might work very differently if used in new contexts and/or populations. One could generate many additional examples beyond those mentioned by S&S.”

(ER impossible)
“Another important point illustrated by the above example is that the distinction between exact and conceptual replications is much more nebulous than many discussions of replication would suggest. Indeed, some critics of the exact/conceptual replication distinction have gone so far as to argue that the concept of exact replication is an “illusion” (Stroebe & Strack, 2014). Though we see some utility in the exact/conceptual distinction (especially regarding the goal of the researcher in the work), we agree with the sentiments expressed by S&S. Classifying studies on the basis of the exact/conceptual distinction is more difficult than is often appreciated, and the presumed strengths and weaknesses of the approaches are less straightforward than is often asserted or assumed.”

(Contextual Sensitivity)
“Furthermore, assuming that these failed replication experiments have used the same operationalizations of the independent and dependent variables, the most common inference drawn from such failures is that confidence in the existence of the originally demonstrated effect should be substantially undermined (e.g., see Francis (2012); Schimmack (2012)). Alternatively, a more optimistic interpretation of such failed replication experiments could be that the failed versus successful experiments differ as a function of one or more unknown moderators that regulate the emergence of the effect (e.g., Cesario, 2014; Stroebe & Strack, 2014).”

Replicating Studies in Which Samples of Participants Respond to Samples of Stimuli.
(CR Definition)
Westfall et al. (2015).
Nevertheless, the original finding is considered to be conceptually replicated if it can be convincingly argued that the same theoretical constructs thought to account for the results of the original study also account for the results of the replication study (Stroebe & Strack, 2014). Conceptual replications are thus “replications” in the sense that they establish the reproducibility of theoretical interpretations.”

(Mention)
“Although establishing the generalizability of research findings is undoubtedly important work, it is not the focus of this article (for opposing viewpoints on the value of conceptual replications, see Pashler & Harris, 2012; Stroebe & Strack, 2014).“

Introduction to the Special Section on Advancing Our Methods and Practices
(Mention)
Ledgerwood, A.
We can and surely should debate which problems are most pressing and which solutions most suitable (e.g., Cesario, 2014; Fiedler, Kutzner, & Krueger, 2012; Murayama, Pekrun, & Fiedler, 2013; Stroebe & Strack, 2014). But at this point, most can agree that there are some real problems with the status quo.

***Theory Building, Replication, and Behavioral Priming: Where Do We Need to Go From Here?
Locke, EA
(ER impossible)
As can be inferred from Table 1, I believe that the now popular push toward “exact” replication (e.g., see Simons, 2014) is not the best way to go. Everyone agrees that literal replication is impossible (e.g., Stroebe & Strack, 2014), but let us assume it is as close as one can get. What has been achieved?

The War on Prevention: Bellicose Cancer: Metaphors Hurt (Some) Prevention Intentions”
(CR test theory)
David J. Hauser1 and Norbert Schwarz
“As noted in recent discussions (Stroebe & Strack, 2014), consistent effects of multiple operationalizations of a conceptual variable across diverse content domains are a crucial criterion for the robustness of a theoretical approach.”

ON THE OTHER SIDE OF THE MIRROR: PRIMING IN COGNITIVE AND SOCIAL PSYCHOLOGY
Doyen et al. “
(CR test theory)
In contrast, social psychologists assume that the primes activate culturally and situationally contextualized representations (e.g., stereotypes, social norms), meaning that they can vary over time and culture and across individuals. Hence, social psychologists have advocated the use of “conceptual replications” that reproduce an experiment by relying on different operationalizations of the concepts under investigation (Stroebe & Strack, 2014). For example, in a society in which old age is associated not with slowness but with, say, talkativeness, the outcome variable could be the number of words uttered by the subject at the end of the experiment rather than walking speed.”

***Welcome back Theory
Ap Dijksterhuis
(ER uninformative)
“it is unavoidable, and indeed, this commentary is also about replication—it is done against the background of something we had almost forgotten: theory! S&S (2014, this issue) argue that focusing on the replication of a phenomenon without any reference to underlying theoretical mechanisms is uninformative”

On the scientific superiority of conceptual replications for scientific progress
Christian S. Crandall, Jeffrey W. Sherman
(ER impossible)
But in matters of social psychology, one can never step in the same river twice—our phenomena rely on culture, language, socially primed knowledge and ideas, political events, the meaning of questions and phrases, and an ever-shifting experience of participant populations (Ramscar, 2015). At a certain level, then, all replications are “conceptual” (Stroebe & Strack, 2014), and the distinction between direct and conceptual replication is continuous rather than categorical (McGrath, 1981). Indeed, many direct replications turn out, in fact, to be conceptual replications. At the same time, it is clear that direct replications are based on an attempt to be as exact as possible, whereas conceptual replications are not.

***Are most published social psychological findings false?
Stroebe, W.
(ER uninformative)
This near doubling of replication success after combining original and replication effects is puzzling. Because these replications were already highly powered, the increase is unlikely to be due to the greater power of a meta-analytic synthesis. The two most likely explanations are quality problems with the replications or publication bias in the original studies or. An evaluation of the quality of the replications is beyond the scope of this review and should be left to the original authors of the replicated studies. However, the fact that all replications were exact rather than conceptual replications of the original studies is likely to account to some extent for the lower replication rate of social psychological studies (Stroebe & Strack, 2014). There is no evidence either to support or to reject the second explanation.”

(ER impossible)
“All four projects relied on exact replications, often using the material used in the original studies. However, as I argued earlier (Stroebe & Strack, 2014), even if an experimental manipulation exactly replicates the one used in the original study, it may not reflect the same theoretical variable.”

(CR test theory)
“Gergen’s argument has important implications for decisions about the appropriateness of conceptual compared to exact replication. The more a phenomenon is susceptible to historical change, the more conceptual replication rather than exact replication becomes appropriate (Stroebe & Strack, 2014).”

(CR test theory)
“Moonesinghe et al. (2007) argued that any true replication should be an exact replication, “a precise processwhere the exact same finding is reexamined in the same way”. However, conceptual replications are often more informative than exact replications, at least in studies that are testing theoretical predictions (Stroebe & Strack, 2014). Because conceptual replications operationalize independent and/or dependent variables in a different way, successful conceptual replications increase our trust in the predictive validity of our theory.”

There’s More Than One Way to Conduct a Replication Study: Beyond Statistical Significance”
Anderson & Maxwell
(Mention)
“It is important to note some caveats regarding direct (exact) versus conceptual replications. While direct replications were once avoided for lack of originality, authors have recently urged the field to take note of the benefits and importance of direct replication. According to Simons (2014), this type of replication is “the only way to verify the reliability of an effect” (p. 76). With respect to this recent emphasis, the current article will assume direct replication. However, despite the push toward direct replication, some have still touted the benefits of conceptual replication (Stroebe & Strack, 2014). Importantly, many of the points and analyses suggested in this paper may translate well to conceptual replication.”

Reconceptualizing replication as a sequence of different studies: A replication typology
Joachim Hüffmeier, Jens Mazei, Thomas Schultze
(ER impossible)
The first type of replication study in our typology encompasses exact replication studies conducted by the author(s) of an original finding. Whereas we must acknowledge that replications can never be “exact” in a literal sense in psychology (Cesario, 2014; Stroebe & Strack, 2014), exact replications are studies that aspire to be comparable to the original study in all aspects (Schmidt, 2009). Exact replications—at least those that are not based on questionable research practices such as the arbitrary exclusion of critical outliers, sampling or reporting biases (John, Loewenstein, & Prelec, 2012; Simmons, Nelson, & Simonsohn, 2011)—serve the function of protecting against false positive effects (Type I errors) right from the start.

(ER informative)
Thus, this replication constitutes a valuable contribution to the research process. In fact, already some time ago, Lykken (1968; see also Mummendey, 2012) recommended that all experiments should be replicated  before publication. From our perspective, this recommendation applies in particular to new findings (i.e., previously uninvestigated theoretical relations), and there seems to be some consensus that new findings should be replicated at least once, especially when they were unexpected, surprising, or only loosely connected to existing theoretical models (Stroebe & Strack, 2014; see also Giner-Sorolla, 2012; Murayama et al., 2014).”

(Mention)
Although there is currently some debate about the epistemological value of close replication studies (e.g., Cesario, 2014; LeBel & Peters, 2011; Pashler & Harris, 2012; Simons, 2014; Stroebe & Strack, 2014), the possibility that each original finding can—in principal—be replicated by the scientific community represents a cornerstone of science (Kuhn, 1962; Popper, 1992).”

(CR test theory)
So far, we have presented “only” the conventional rationale used to stress the importance of close replications. Notably, however, we will now add another—and as we believe, logically necessary—point originally introduced by S&S (2014). This point protects close replications from being criticized (cf. Cesario, 2014; Stroebe & Strack, 2014; see also LeBel & Peters, 2011). Close replications can be informative only as long as they ensure that the theoretical processes investigated or at least invoked by the original study are shown to also operate in the replication study.

(CR test theory)
The question of how to conduct a close replication that is maximally informative entails a number of methodological choices. It is important to both adhere to the original study proceedings (Brandt et al., 2014; Schmidt, 2009) and focus on and meticulously measure the underlying theoretical mechanisms that were shown or at least proposed in the original studies (Stroebe & Strack, 2014). In fact, replication attempts are most informative when they clearly demonstrate either that the theoretical processes have unfolded as expected or at which point in the process the expected results could no longer be observed (e.g., a process ranging from a treatment check to a manipulation check and [consecutive] mediator variables to the dependent variable). Taking these measures is crucial to rule out that a null finding is simply due to unsuccessful manipulations or changes in a manipulation’s meaning and impact over time (cf. Stroebe & Strack, 2014). “

(CR test theory)
Conceptual replications in laboratory settings are the fourth type of replication study in our typology. In these replications, comparability to the original study is aspired to only in the aspects that are deemed theoretically relevant (Schmidt, 2009; Stroebe & Strack, 2014). In fact, most if not all aspects may differ as long as the theoretical processes that have been studied or at least invoked in the original study are also covered in a conceptual replication study in the laboratory.”

(ER useful for applied research)
For instance, conceptual replications may be less important for applied disciplines that focus on clinical phenomena and interventions. Here, it is important to ensure that there is an impact of a specific intervention and that the related procedure does not hurt the members of the target population (e.g., Larzelere et al., 2015; Stroebe & Strack, 2014).”

From intrapsychic to ecological theories in social psychology: Outlines of a functional theory approach
Klaus Fiedler
(ER uninformative)
Replicating an ill-understood finding is like repeating a complex sentence in an unknown language. Such a “replication” in the absence of deep understanding may appear funny, ridiculous, and embarrassing to a native speaker, who has full control over the foreign language. By analogy, blindly replicating or running new experiments on an ill-understood finding will rarely create real progress (cf. Stroebe & Strack, 2014). “

Into the wild: Field research can increase both replicability and real-world impact
Jon K. Maner
(CR test theory)
Although studies relying on homogeneous samples of laboratory or online participants might be highly replicable when conducted again in a similar homogeneous sample of laboratory or online participants, this is not the key criterion (or at least not the only criterion) on which we should judge replicability (Westfall, Judd & Kenny, 2015; see also Brandt et al., 2014; Stroebe & Strack, 2014). Just as important is whether studies replicate in samples that include participants who reflect the larger and more diverse population.”

Romance, Risk, and Replication: Can Consumer Choices and Risk-Taking Be Primed by Mating Motives?
Shanks et al.
(ER impossible)
There is no such thing as an “exact” replication (Stroebe & Strack, 2014) and hence it must be acknowledged that the published studies (notwithstanding the evidence for p-hacking and/or publication bias) may have obtained genuine effects and that undetected moderator variables explain why the present studies failed to obtain priming.   Some of the experiments reported here differed in important ways from those on which they were modeled (although others were closer replications and even these failed to obtain evidence of reliable romantic priming).

(CR test theory)
As S&S (2014) point out, what is crucial is not so much exact surface replication but rather identical operationalization of the theoretically relevant variables. In the present case, the crucial factors are the activation of romantic motives and the appropriate assessment of consumption, risk-taking, and other measures.”

A Duty to Describe: Better the Devil You Know Than the Devil You Don’t
Brown, Sacha D et al.
(Mention)
Ioannidis (2005) has been at the forefront of researchers identifying factors interfering with self-correction. He has claimed that journal editors selectively publish positive findings and discriminate against study replications, permitting errors in data and theory to enjoy a long half-life (see also Ferguson & Brannick, 2012; Ioannidis, 2008, 2012; Shadish, Doherty, & Montgomery, 1989; Stroebe & Strack, 2014). We contend there are other equally important, yet relatively unexplored, problems.

A Room with a Viewpoint Revisited: Descriptive Norms and Hotel Guests’ Towel Reuse Behavior
(Contextual Sensitivity)
Bohner, Gerd; Schlueter, Lena E.
On the other hand, our pilot participants’ estimates of towel reuse rates were generally well below 75%, so we may assume that the guests participating in our experiments did not perceive the normative messages as presenting a surprisingly low figure. In a more general sense, the issue of greatly diverging baselines points to conceptual issues in trying to devise a ‘‘direct’’ replication: Identical operationalizations simply may take on different meanings for people in different cultures.

***The empirical benefits of conceptual rigor: Systematic articulation of conceptual hypotheses can reduce the risk of non-replicable results (and facilitate novel discoveries too)
Mark Schaller
(Contextual Sensitivity)
Unless these subsequent studies employ methods that exactly replicate the idiosyncratic context in which the effect was originally detected, these studies are unlikely to replicate the effect. Indeed, because many psychologically important contextual variables may lie outside the awareness of researchers, even ostensibly “exact” replications may fail to create the conditions necessary for a fragile effect to emerge (Stroebe & Strack, 2014)

A Concise Set of Core Recommendations to Improve the Dependability of Psychological Research
David A. Lishner
(CR test theory)
The claim that direct replication produces more dependable findings across replicated studies than does conceptual replication seems contrary to conventional wisdom that conceptual replication is preferable to direct replication (Dijksterhuis, 2014; Neulip & Crandall, 1990, 1993a, 1993b; Stroebe & Strack, 2014).
(CR test theory)
However, most arguments advocating conceptual replication over direct replication are attempting to promote the advancement or refinement of theoretical understanding (see Dijksterhuis, 2014; Murayama et al., 2014; Stroebe & Strack, 2014). The argument is that successful conceptual replication demonstrates a hypothesis (and by extension the theory from which it derives) is able to make successful predictions even when one alters the sampled population, setting, operations, or data analytic approach. Such an outcome not only suggests the presence of an organizing principle, but also the quality of the constructs linked by the organizing principle (their theoretical meanings). Of course this argument assumes that the consistency across the replicated findings is not an artifact of data acquisition or data analytic approaches that differ among studies. The advantage of direct replication is that regardless of how flexible or creative one is in data acquisition or analysis, the approach is highly similar across replication studies. This duplication ensures that any false finding based on using a flexible approach is unlikely to be repeated multiple times.

(CR test theory)
Does this mean conceptual replication should be abandoned in favor of direct replication? No, absolutely not. Conceptual replication is essential for the theoretical advancement of psychological science (Dijksterhuis, 2014; Murayama et al., 2014; Stroebe & Strack, 2014), but only if dependability in findings via direct replication is first established (Cesario, 2014; Simons, 2014). Interestingly, in instances where one is able to conduct multiple studies for inclusion in a research report, one approach that can produce confidence in both dependability of findings and theoretical generalizability is to employ nested replications.

(ER cannot detect fraud)
A second advantage of direct replications is that they can protect against fraudulent findings (Schmidt, 2009), particularly when different research groups conduct direct replication studies of each other’s research. S&S (2014) make a compelling argument that direct replication is unlikely to prove useful in detection of fraudulent research. However, even if a fraudulent study remains unknown or undetected, its impact on the literature would be lessened when aggregated with nonfraudulent direct replication studies conducted by honest researchers.

***Does cleanliness influence moral judgments? Response effort moderates the effect of cleanliness priming on moral judgments.
Huang
(ER uninformative)
Indeed, behavioral priming effects in general have been the subject of increased scrutiny (see Cesario, 2014), and researchers have suggested different causes for failed replication, such as measurement and sampling errors (Stanley and Spence,2014), variation in subject populations (Cesario, 2014), discrepancy in operationalizations (S&S, 2014), and unidentified moderators (Dijksterhuis,2014).

UNDERSTANDING PRIMING EFFECTS IN SOCIAL PSYCHOLOGY: AN OVERVIEW AND INTEGRATION
Daniel C. Molden
(ER uninformative)
Therefore, some greater emphasis on direct replication in addition to conceptual replication is likely necessary to maximize what can be learned from further research on priming (but see Stroebe and Strack, 2014, for costs of overemphasizing direct replication as well).

On the automatic link between affect and tendencies to approach and avoid: Chen and Bargh (1999) revisited
Mark Rotteveel et al.
(no replication crisis)
Although opinions differ with regard to the extent of this “replication crisis” (e.g., Pashler and Harris, 2012; S&S, 2014), the scientific community seems to be shifting its focus more toward direct replication.

(ER uninformative)

Single-Paper Meta-Analysis: Benefits for Study Summary, Theory Testing, and Replicability
McShane and Bockenholt
(ER impossible)
The purpose of meta-analysis is to synthesize a set of studies of a common phenomenon. This task is complicated in behavioral research by the fact that behavioral research studies can never be direct or exact replications of one another (Brandt et al. 2014; Fabrigar and Wegener 2016; Rosenthal 1991; S&S 2014; Tsang and Kwan 1999).

(ER impossible)
Further, because behavioral research studies can never be direct or exact replications of one another (Brandt et al. 2014; Fabrigar and Wegener 2016; Rosenthal 1991; S&S 2014; Tsang and Kwan 1999), our SPM methodology estimates and accounts for heterogeneity, which has been shown to be important in a wide variety of behavioral research settings (Hedges and Pigott 2001; Klein et al. 2014; Pigott 2012).

A Closer Look at Social Psychologists’ Silver Bullet: Inevitable and Evitable Side   Effects of the Experimental Approach
Herbert Bless and Axel M. Burger
(ER/CR Distinction)
Given the above perspective, it becomes obvious that in the long run, conceptual replications can provide very fruitful answers because they address the question of whether the initially observed effects are potentially caused by some perhaps unknown aspects of the experimental procedure (for a discussion of conceptual versus direct replications, see e.g., Stroebe & Strack, 2014; see also Brandt et al., 2014; Cesario, 2014; Lykken, 1968; Schwarz & Strack, 2014).  Whereas conceptual replications are adequate solutions for broadening the sample of situations (for examples, see Stroebe & Strack, 2014), the present perspective, in addition, emphasizes that it is important that the different conceptual replications do not share too much overlap in general aspects of the experiment (see also Schwartz, 2015, advocating for  conceptual replications)

Men in red: A reexamination of the red-attractiveness effect
Vera M. Hesslinger, Lisa Goldbach, & Claus-Christian Carbon
(ER impossible)
As Brandt et al. (2014) pointed out, a replication in psychological research will never be absolutely exact or direct (see also, Stroebe & Strack, 2014), which is, of course, also the case in the present research.

***On the challenges of drawing conclusions from p-values just below 0.05
Daniel Lakens
In recent years, researchers have become more aware of how flexibility during the data-analysis can increase false positive results (e.g., Simmons, Nelson & Simonsohn, 2011). If the true Type 1 error rate is substantially inflated, for example because researchers analyze their data until a p-value smaller than 0.05 is observed, the robustness of scientific knowledge can substantially decrease. However, as Stroebe & Strack (2014, p. 60) have pointed out: ‘Thus far, however, no solid data exist on the prevalence of such research practices.’

***Does Merely Going Through the Same Moves Make for a ‘‘Direct’’ Replication? Concepts, Contexts, and Operationalizations
Norbert Schwarz and Fritz Strack
(Contextual Sensitivity)
In general, meaningful replications need to realize the psychological conditions of the original study. The easier option of merely running through technically identical procedures implies the assumption that psychological processes are context insensitive and independent of social, cultural, and historical differences (Cesario, 2014; Stroebe & Strack, 2014). Few social (let alone cross-cultural) psychologists would be willing to endorse this assumption with a straight face. If so, mere procedural equivalence is an insufficient criterion for assessing the quality of a replication.

The Replication Paradox: Combining Studies can Decrease Accuracy of Effect Size Estimates
(ER uninformative)
Michèle B. Nuijten, Marcel A. L. M. van Assen, Coosje L. S. Veldkamp, and Jelte M. Wicherts
Replications with nonsignificant results are easily dismissed with the argument that the replication might contain a confound that caused the null finding (Stroebe & Strack, 2014).

Retro-priming, priming, and double testing: psi and replication in a test-retest design
Rabeyron, T
(Mention)
Bem’s paper spawned numerous attempts to replicate it (see e.g., Galak et al., 2012; Bem et al., submitted) and reflections on the difficulty of direct replications in psychology (Ritchie et al., 2012). This aspect has been associated more generally with debates concerning the “decline effect” in science (Schooler, 2011) and a potential “replication crisis” (S&S, 2014) especially in the fields of psychology and medical sciences (De Winter and Happee, 2013).

Do p Values Lose Their Meaning in Exploratory Analyses? It Depends How You Define the Familywise Error Rate
Mark Rubin
(ER impossible)
Consequently, the Type I error rate remains constant if researchers simply repeat the same test over and over again using different samples that have been randomly drawn from the exact same population. However, this first situation is somewhat hypothetical and may even be regarded as impossible in the social sciences because populations of people change over time and location (e.g., Gergen, 1973; Iso-Ahola, 2017; Schneider, 2015; Serlin, 1987; Stroebe & Strack, 2014). Yesterday’s population of psychology undergraduate students from the University of Newcastle, Australia, will be a different population to today’s population of psychology undergraduate students from the University of Newcastle, Australia.

***Learning and the replicability of priming effects
Michael Ramscar
(ER uninformative)
In the limit, this means that in the absence of a means for objectively determining what the information that produces a priming effect is, and for determining that the same information is available to the population in a replication, all learned priming effects are scientifically unfalsifiable. (Which also means that in the absence of an account of what the relevant information is in a set of primes, and how it produces a specific effect, reports of a specific priming result — or failures to replicate it — are scientifically uninformative; see also [Stroebe & Strack, 2014.)

***Evaluating Psychological Research Requires More Than Attention to the N: A Comment on Simonsohn’s (2015) “Small Telescopes”
Norbert Schwarz and Gerald L. Clore
(CR test theory)
Simonsohn’s decision to equate a conceptual variable (mood) with its manipulation (weather) is compatible with the logic of clinical trials, but not with the logic of theory testing. In clinical trials, which have inspired much of the replicability debate and its statistical focus, the operationalization (e.g., 10 mg of a drug) is itself the variable of interest; in theory testing, any given operationalization is merely one, usually imperfect, way to realize the conceptual variable. For this reason, theory tests are more compelling when the results of different operationalizations converge (Stroebe & Strack, 2014), thus ensuring, in the case in point, that it is not “the weather” but indeed participants’ (sometimes weather-induced) mood that drives the observed effect.

Internal conceptual replications do not increase independent replication success
Kunert, R
(Contextual Sensitivity)
According to the unknown moderator account of independent replication failure, successful internal replications should correlate with independent replication success. This account suggests that replication failure is due to the fact that psychological phenomena are highly context-dependent, and replicating seemingly irrelevant contexts (i.e. unknown moderators) is rare (e.g., Barrett, 2015; DGPS, 2015; Fleming Crim, 2015; see also Stroebe & Strack, 2014; for a critique, see Simons, 2014). For example, some psychological phenomenon may unknowingly be dependent on time of day.

(Contextual Sensitivity greater in social psychology)
When the chances of unknown moderator influences are greater and replicability is achieved (internal, conceptual replications), then the same should be true when chances are smaller (independent, direct replications). Second, the unknown moderator account is usually invoked for social psychological effects (e.g. Cesario, 2014; Stroebe & Strack, 2014). However, the lack of influence of internal replications on independent replication success is not limited to social psychology. Even for cognitive psychology a similar pattern appears to hold.

On Klatzky and Creswell (2014): Saving Social Priming Effects But Losing Science as We Know It?
Barry Schwartz
(ER uninformative)
The recent controversy over what counts as “replication” illustrates the power of this presumption. Does “conceptual replication” count? In one respect, conceptual replication is a real advance, as conceptual replication extends the generality of the phenomena that were initially discovered. But what if it fails? Is it because the phenomena are unreliable, because the conceptual equivalency that justified the new study was logically flawed, or because the conceptual replication has permitted the intrusion of extraneous variables that obscure the original phenomenon? This ambiguity has led some to argue that there is no substitute for strict replication (see Pashler & Harris, 2012; Simons, 2014, and Stroebe & Strack, 2014, for recent manifestations of this controversy). A significant reason for this view, however, is less a critique of the logic of conceptual replication than it is a comment on the sociology (or politics, or economics) of science. As Pashler and Harris (2012) point out, publication bias virtually guarantees that successful conceptual replications will be published whereas failed conceptual replications will live out their lives in a file drawer.  I think Pashler and Harris’ surmise is probably correct, but it is not an argument for strict replication so much as it is an argument for publication of failed conceptual replication.

Commentary and Rejoinder on Lynott et al. (2014)
Lawrence E. Williams
(CR test theory)
On the basis of their investigations, Lynott and colleagues (2014) conclude ‘‘there is no evidence that brief exposure to warm therapeutic packs induces greater prosocial responding than exposure to cold therapeutic packs’’ (p. 219). This conclusion, however, does not take into account other related data speaking to the connection between physical warmth and prosociality. There is a fuller body of evidence to be considered, in which both direct and conceptual replications are instructive. The former are useful if researchers particularly care about the validity of a specific phenomenon; the latter are useful if researchers particularly care about theory testing (Stroebe & Strack, 2014).

The State of Social and Personality Science: Rotten to the Core, Not So Bad, Getting Better, or Getting Worse?
(no replication crisis)
Motyl et al. (2017) “The claim of a replicability crisis is greatly exaggerated.” Wolfgang Stroebe and Fritz Strack, 2014

Harry T. Reis, Karisa Y. Lee
(ER impossible)
Much of the current debate, however, is focused narrowly on direct or exact replications—whether the findings of a given study, carried out in a particular way with certain specific operations, would be repeated. Although exact replications are surely desirable, the papers by Fabrigar and by Crandall and Sherman remind us that in an absolute sense they are fundamentally impossible in social–personality psychology (see also S&S, 2014).

Show me the money
(Contextual Sensitivity)
Of course, it is possible that additional factors, which varied or could have varied among our studies and previously published studies (e.g., participants’ attitudes toward money) or among the online studies and laboratory study in this article (e.g., participants’ level of distraction), might account for these apparent inconsistencies. We did not aim to conduct a direct replication of any specific past study, and therefore we encourage special care when using our findings to evaluate existing ones (Doyen, Klein, Simons, & Cleeremans, 2014; Stroebe & Strack, 2014).

***From Data to Truth in Psychological Science. A Personal Perspective.
Strack
(ER uninformative)
In their introduction to the 2016 volume of the Annual Review of Psychology, Susan Fiske, Dan Schacter, and Shelley Taylor point out that a replication failure is not a scientific problem but an opportunity to find limiting conditions and contextual effects. To allow non-replications to regain this constructive role, they must come with conclusions that enter and stimulate a critical debate. It is even better if replication studies are endowed with a hypothesis that relates to the state of the scientific discourse. To show that an effect occurs only under one but not under another condition is more informative than simply demonstrating noneffects (S&S, 2014). But this may require expertise and effort.

# Random measurement error and the replication crisis: A statistical analysis

This is a draft of a commentary on Loken and Gelman’s Science article “Measurement error and the replication crisis. Comments are welcome.

Random Measurement Error Reduces Power, Replicability, and Observed Effect Sizes After Selection for Significance

Ulrich Schimmack and Rickard Carlsson

In the article “Measurement error and the replication crisis” Loken and Gelman (LG) “caution against the fallacy of assuming that that which does not kill statistical significance makes it stronger” (1). We agree with the overall message that it is a fallacy to interpret observed effect size estimates in small samples as accurate estimates of population effect sizes.  We think it is helpful to recognize the key role of statistical power in significance testing.  If studies have less than 50% power, effect sizes must be inflated to be significant. Thus, all observed effect sizes in these studies are inflated.  Once power is greater than 50%, it is possible to obtain significance with observed effect sizes that underestimate the population effect size. However, even with 80% power, the probability of overestimation is 62.5%. [corrected]. As studies with small samples and small effect sizes often have less than 50% power (2), we can safely assume that observed effect sizes overestimate the population effect size. The best way to make claims about effect sizes in small samples is to avoid interpreting the point estimate and to interpret the 95% confidence interval. It will often show that significant large effect sizes in small samples have wide confidence intervals that also include values close to zero, which shows that any strong claims about effect sizes in small samples are a fallacy (3).

Although we agree with Loken and Gelman’s general message, we believe that their article may have created some confusion about the effect of random measurement error in small samples with small effect sizes when they wrote “In a low-noise setting, the theoretical results of Hausman and others correctly show that measurement error will attenuate coefficient estimates. But we can demonstrate with a simple exercise that the opposite occurs in the presence of high noise and selection on statistical significance” (p. 584).  We both read this sentence as suggesting that under the specified conditions random error may produce even more inflated estimates than perfectly reliable measure. We show that this interpretation of their sentence would be incorrect and that random measurement error always leads to an underestimation of observed effect sizes, even if effect sizes are selected for significance. We demonstrate this fact with a simple equation that shows that true power before selection for significance is monotonically related to observed power after selection for significance. As random measurement error always attenuates population effect sizes, the monotonic relationship implies that observed effect sizes with unreliable measures are also always attenuated.  We provide the formula and R-Code in a Supplement. Here we just give a brief description of the steps that are involved in predicting the effect of measurement error on observed effect sizes after selection for significance.

The effect of random measurement error on population effect sizes is well known. Random measurement error adds variance to the observed measures X and Y, which lowers the observable correlation between two measures. Random error also increases the sampling error. As the non-central t-value is the proportion of these two parameters, it follows that random measurement error always attenuates power. Without selection for significance, median observed effect sizes are unbiased estimates of population effect sizes and median observed power matches true power (4,5). However, with selection for significance, non-significant results with low observed power estimates are excluded and median observed power is inflated. The amount of inflation is proportional to true power. With high power, most results are significant and inflation is small. With low power, most results are non-significant and inflation is large.

Schimmack developed a formula that specifies the relationship between true power and median observed power after selection for significance (6). Figure 1 shows that median observed power after selection for significant is a monotonic function of true power.  It is straightforward to transform inflated median observed power into median observed effect sizes.  We applied this approach to Locken and Gelman’s simulation with a true population correlation of r = .15. We changed the range of sample sizes from 50 to 3050 to 25 to 1000 because this range provides a better picture of the effect of small samples on the results. We also increased the range of reliabilities to show that the results hold across a wide range of reliabilities. Figure 2 shows that random error always attenuates observed effect sizes, even after selection for significance in small samples. However, the effect is non-linear and in small samples with small effects, observed effect sizes are nearly identical for different levels of unreliability. The reason is that in studies with low power, most of the observed effect is driven by the noise in the data and it is irrelevant whether the noise is due to measurement error or unexplained reliable variance.

In conclusion, we believe that our commentary clarifies how random measurement error contributes to the replication crisis.  Consistent with classic test theory, random measurement error always attenuates population effect sizes. This reduces statistical power to obtain significant results. These non-significant results typically remain unreported. The selective reporting of significant results leads to the publication of inflated effect size estimates. It would be a fallacy to consider these effect size estimates reliable and unbiased estimates of population effect sizes and to expect that an exact replication study would also produce a significant result.  The reason is that replicability is determined by true power and observed power is systematically inflated by selection for significance.  Our commentary also provides researchers with a tool to correct for the inflation by selection for significance. The function in Figure 1 can be used to deflate observed effect sizes. These deflated observed effect sizes provide more realistic estimates of population effect sizes when selection bias is present. The same approach can also be used to correct effect size estimates in meta-analyses (7).

References

1. Loken, E., & Gelman, A. (2017). Measurement error and the replication crisis. Science,

355 (6325), 584-585. [doi: 10.1126/science.aal3618]

2. Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145-153, http://dx.doi.org/10.1037/h004518

3. Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003. http://dx.doi.org/10.1037/0003-066X.49.12.99

4. Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551-566. http://dx.doi.org/10.1037/a0029487

5. Schimmack, U. (2016). A revised introduction to the R-Index. https://replicationindex.com/2016/01/31/a-revised-introduction-to-the-r-index

6. Schimmack, U. (2017). How selection for significance influences observed power. https://replicationindex.com/2017/02/21/how-selection-for-significance-influences-observed-power/

7. van Assen, M.A., van Aert, R.C., Wicherts, J.M. (2015). Meta-analysis using effect size distributions of only statistically significant studies. Psychological Methods, 293-309. doi: 10.1037/met0000025.

################################################################

#### R-CODE ###

################################################################

### sample sizes

N = seq(25,500,5)

### true population correlation

true.pop.r = .15

### reliability

rel = 1-seq(0,.9,.20)

### create matrix of population correlations between measures X and Y.

obs.pop.r = matrix(rep(true.pop.r*rel),length(N),length(rel),byrow=TRUE)

### create a matching matrix of sample sizes

N = matrix(rep(N),length(N),length(rel))

### compute non-central t-values

ncp.t = obs.pop.r / ( (1-obs.pop.r^2)/(sqrt(N – 2)))

### compute true power

true.power = pt(ncp.t,N-2,qt(.975,N-2))

###  Get Inflated Observed Power After Selection for Significance

inf.obs.pow = pnorm(qnorm(true.power/2+(1-true.power),qnorm(true.power,qnorm(.975))),qnorm(.975))

### Transform Into Inflated Observed t-values

inf.obs.t = qt(inf.obs.pow,N-2,qt(.975,N-2))

### Transform inflated observed t-values into inflated observed effect sizes

inf.obs.es = (sqrt(N + 4*inf.obs.t^2 -2) – sqrt(N – 2))/(2*inf.obs.t)

### Set parameters for Figure

x.min = 0

x.max = 500

y.min = 0.10

y.max = 0.45

ylab = “Inflated Observed Effect Size”

title = “Effect of Selection for Significance on Observed Effect Size”

### Create Figure

for (i in 1:length(rel)) {

print(i)

plot(N[,1],inf.obs.es[,i],type=”l”,xlim=c(x.min,x.max),ylim=c(y.min,y.max),col=col[i],xlab=”Sample Size”,ylab=”Median Observed Effect Size After Selection for Significance”,lwd=3,main=title)

segments(x0 = 600,y0 = y.max-.05-i*.02, x1 = 650,col=col[i], lwd=5)

text(730,y.max-.05-i*.02,paste0(“Rel = “,format(rel[i],nsmall=1)))

par(new=TRUE)

}

abline(h = .15,lty=2)

##################### THE END #################################

# Klaus Fiedler “it is beyond the scope of this article to discuss whether publication bias actually exists”

Urban Dictionary: Waffle

The article by Murayama, Pekrun, and Fiedler (MPK) discusses the probability of false positive results (evidence for an effect when no effect is present also known as type-I error) in multiple study articles. When researchers conduct a single study the nominal probability of obtaining a significant result without a real effect (a type-I error) is typically set to 5% (p < .05, two-tailed). Thus, for every significant result one would expect 19 non-significant results. A false-positive finding (type-I error) would be followed by several failed replications. Thus, replication studies can quickly correct false discoveries. Or so, one would like to believe. However, traditionally journals reported only significant results. Thus, false positive results remained uncorrected in the literature because failed replications were not published.

In the 1990s, experimental psychologists that run relatively cheap studies found a solution to this problem. Journals demanded that researchers replicate their findings in a series of studies that were then published in a single article.

MPK point out that the probability of a type-I error decreases exponentially as the number of studies increases. With two studies, the probability is less than 1% (.05 * .05 = .0025). It is easier to see the exponential effect in terms or ratios (1 out of 20, 1 out of 400, 1 out of 8000, etc. In top journals of experimental social psychology, a typical article contains four studies. The probability that all four studies produce a type-I error is only 1 out of 160,000. The corresponding value on a standard normal distribution is z = 4.52, which means the strength of evidence is 4.5 standard deviations away from 0, which represents the absence of an effect. In particle physics a value of z = 5 is used to rule out false-positives. Thus, getting 4 out of 4 significant results in four independent tests of an effect provides strong evidence for an effect.

I am in full agreement with MPK and I made the same point in Schimmack (2012). The only difference is that I also point out that there is no difference between a series of 4 studies with small samples (e.g., n = 20 in 2 conditions for a total of N = 40) or a single study with the total number of participants (N = 160). A real effect will produce stronger evidence for an effect as sample size increase. Getting four significant results at the 5% level is not more impressive than getting a single significant result at the p < .00001 level.

However, the strength of evidence from multiple study articles depends on one crucial condition. This condition is so elementary and self-evidence that it is not even mentioned in statistics. The condition is that a researcher honestly reports all results. 4 significant results is only impressive when a researcher went into the lab, conducted four studies, and obtained significant results in all studies. Similarly, 4 free throws are only impressive when there were only 4 attempts. 4 out of 20 free-throws is not that impressive and 4 out of 80 attempts is horrible. Thus, the absolute number of successes is not important. What matters is the relative frequency of successes for all attempts that were made.

Schimmack (2012) developed the incredibility index to examine whether a set of significant results is based on honest reporting or whether it was obtained by omitting non-significant results or by using questionable statistical practices to produce significant results. Evidence for dishonest reporting of results would undermine the credibility of the published results.

MPK have the following to say about dishonest reporting of results.

“On a related note, Francis (2012a, 2012b, 2012c, 2012d; see also Schimmack, 2012) recently published a series of analyses that indicated the prevalence of publication bias (i.e., file-drawer problem) in multi-study papers in the psychological literature.” (p. 111).   They also note that Francis used a related method to reveal that many multiple-study articles show statistical evidence of dishonest reporting. “Francis argued that there may be many cases in which the findings reported in multi-study papers are too good to be true” (p. 111).

In short, Schimmack and Francis argued that multiple study articles can be misleading because the provide the illusion of replicability (a researcher was able to demonstrate the effect again, and again, and again, therefore it must be a robust effect), but in reality it is not clear how robust the effect is because the results were not obtain in the way as the studies are described in the article (first we did Study 1, then we did Study 2, etc. and voila all of the studies worked and showed the effect).

One objection to Schimmack and Francis would be to find a problem with their method of detecting bias. However, MPK do not comment on the method at all. They sidestep this issue when they write “it is beyond the scope of this article to discuss whether publication bias actually exists in these articles or. or how prevalent it is in general” (p. 111).

After sidestepping the issue, MPK are faced with a dilemma or paradox. Do multiple study articles strengthen the evidence because the combined type-I error probability decreases or do multiple study articles weaken the evidence because the probability that researchers did not report the results of their research program honestly? “Should multi-study findings be regarded as reliable or shaky evidence?” (p. 111).

MPK solve this paradox with a semantic trick. First, they point out that dishonest reporting has undesirable effects on effect size estimates.

“A publication bias, if it exists, leads to overestimation of effect sizes because some null findings are not reported (i.e., only studies with relatively large effect sizes that produce significant results are reported). The overestimation of effect sizes is problematic” (p. 111).

They do not explain why researchers should be allowed to omit studies with non-significant results from an article, given that this practice leads to the undesirable consequences of inflated effect sizes. Accurate estimates of effect sizes would be obtained if researchers published all of their results. In fact, Schimmack (2012) suggested that researchers report all results and then conduct a meta-analysis of their set of studies to examine how strong the evidence of a set of studies is. This meta-analysis would provide an unbiased measure of the true effect size and unbiased evidence about the probability that the results of all studies were obtained in the absence of an effect.

The semantic trick occurs when the authors suggest that dishonest reporting practices are only a problem for effect size estimates, but not for the question whether an effect actually exists.

“However, the presence of publication bias does not necessarily mean that the effect is absent (i.e., that the findings are falsely positive).” (p. 111) and “Publication bias simply means that the effect size is overestimated—it does not necessarily imply that the effect is not real (i.e., falsely positive).” (p. 112).

This statement is true because it is practically impossible to demonstrate false positives, which would require demonstrating that the true effect size is exactly 0.   The presence of bias does not warrant the conclusion that the effect size is zero and that reported results are false positives.

However, this is not the point of revealing dishonest practices. The point is that dishonest reporting of results undermines the credibility of the evidence that was used to claim that an effect exists. The issue is the lack of credible evidence for an effect, not credible evidence for the lack of an effect. These two statements are distinct and MPK use the truth of the second statement to suggest that we can ignore whether the first statement is true.

Finally, MPK present a scenario of a multiple study article with 8 studies that all produced significant results. The state that it is “unrealistic that as many as eight statistically significant results were produced by a non-existent effect” (p. 112).

This blue-eyed view of multiple study articles ignores the fact that the replication crisis in psychology was triggered by Bem’s (2011) infamous article that contained 9 out of 9 statistically significant results (one marginal result was attributed to methodological problems, see Schimmack, 2012, for details) that supposedly demonstrated humans ability to foresee the future and to influence the past (e.g., learning after a test increased performance on a test that was taken before learning for the test). Schimmack (2012) used this article to demonstrate how important it can be to evaluate the credibility of multiple study articles and the incredibility index predicted correctly that these results would not replicate. So, it is simply naïve to assume that articles with more studies automatically strengthen evidence for the existence of an effect and that 8 significant results cannot occur in the absence of a true effect (maybe MPK believe in ESP).

It is also not clear why researchers should wonder about the credibility of results in multiple study articles.  A simple solution to the paradox is to reported all results honestly.  If an honest set of studies provides evidence for an effect, it is not clear why researchers would prefer to engage in dishonest reporting practices. MPK provide no explanation for this practices and make no recommendation to increase honesty in reporting of results as a simple solution to the replicability crisis in psychology.

They write, “the researcher may have conducted 10, or even 20, experiments until he/she obtained 8 successful experiments, but far more studies would have been needed had the effect not existed at all”. This is true, but we do not know how many studies a researcher conducted or what else a researcher did to the data unless all of this information is reported. If the combined evidence of 20 studies with 8 significant results shows that an effect is present, a researcher could just publish all 20 studies. What is the reason to hide over 50% of the evidence?

In the end, MPK assure readers that they “do not intend to defend underpowered studies” and they do suggest that “the most straightforward solution to this paradox is to conduct studies that have sufficient statistical power” (p. 112). I fully agree with these recommendations because powerful studies can provide real evidence for an effect and decrease the incentive to engage in dishonest practices.

It is discouraging that this article was published in a major review journal in social psychology. It is difficult to see how social psychology can regain trust, if social psychologists believe they can simply continue to engaging in dishonest reporting of results.  Unfortunately, social psychologists continue to downplay the replication crisis and the shaky foundations of many textbook claims.

# When Exact Replications Are Too Exact: The Lucky-Bounce-Test for Pairs of Exact Replication Studies

Imagine an NBA player has an 80% chance to make one free throw. What is the chance that he makes both free throws? The correct answer is 64% (80% * 80%).

Now consider the possibility that it is possible to distinguish between two types of free throws. Some free throws are good; they don’t touch the rim and make a swishing sound when they go through the net (all net). The other free throws bounce of the rim and go in (rattling in).

What is the probability that an NBA player with an 80% free throw percentage makes a free throw that is all net or rattles in? It is more likely that an NBA player with an 80% free throw average makes a perfect free throw because a free throw that rattles in could easily have bounded the wrong way, which would lower the free throw percentage. To achieve an 80% free throw percentage, most free throws have to be close to perfect.

Let’s say the probability of hitting the rim and going in is 30%. With an 80% free throw average, this means that the majority of free throws are in the close-to-perfect category (20% misses, 30% rattle-in, 50% close-to-perfect).

What does this have to do with science? A lot!

The reason is that the outcome of a scientific study is a bit like throwing free throws. One factor that contributes to a successful study is skill (making correct predictions, avoiding experimenter errors, and conducting studies with high statistical power). However, another factor is random (a lucky or unlucky bounce).

The concept of statistical power is similar to an NBA players’ free throw percentage. A researcher who conducts studies with 80% statistical power is going to have an 80% success rate (that is, if all predictions are correct). In the remaining 20% of studies, a study will not produce a statistically significant result, which is equivalent to missing a free throw and not getting a point.

Many years ago, Jacob Cohen observed that researchers often conduct studies with relatively low power to produce a statistically significant result. Let’s just assume right now that a researcher conducts studies with 60% power. This means, researchers would be like NBA players with a 60% free-throw average.

Now imagine that researchers have to demonstrate an effect not only once, but also a second time in an exact replication study. That is researchers have to make two free throws in a row. With 60% power, the probability to get two significant results in a row is only 36% (60% * 60%). Moreover, many of the freethrows that are made rattle in rather than being all net. The percentages are about 40% misses, 30% rattling in and 30% all net.

One major difference between NBA players and scientists is that NBA players have to demonstrate their abilities in front of large crowds and TV cameras, whereas scientists conduct their studies in private.

Imagine an NBA player could just go into a private room, throw two free throws and then report back how many free throws he made and the outcome of these free throws determine who wins game 7 in the playoff finals. Would you trust the player to tell the truth?

If you would not trust the NBA player, why would you trust scientists to report failed studies? You should not.

It can be demonstrated statistically that scientists are reporting more successes than the power of their studies would justify (Sterling et al., 1995; Schimmack, 2012). Amongst scientists this fact is well known, but the general public may not fully appreciate the fact that a pair of exact replication studies with significant results is often just a selection of studies that included failed studies that were not reported.

Fortunately, it is possible to use statistics to examine whether the results of a pair of studies are likely to be honest or whether failed studies were excluded. The reason is that an amateur is not only more likely to miss a free throw. An amateur is also less likely to make a perfect free throw.

Based on the theory of statistical power developed by Nyman and Pearson and popularized by Jacob Cohen, it is possible to make predictions about the relative frequency of p-values in the non-significant (failure), just significant (rattling in), and highly significant (all net) ranges.

As for made-free-throws, the distinction between lucky and clear successes is somewhat arbitrary because power is continuous. A study with a p-value of .0499 is very lucky because p = .501 would have been not significant (rattled in after three bounces on the rim). A study with p = .000001 is a clear success. Lower p-values are better, but where to draw the line?

As it turns out, Jacob Cohen’s recommendation to conduct studies with 80% power provides a useful criterion to distinguish lucky outcomes and clear successes.

Imagine a scientist conducts studies with 80% power. The distribution of observed test-statistics (e.g. z-scores) shows that this researcher has a 20% chance to get a non-significant result, a 30% chance to get a lucky significant result (p-value between .050 and .005), and a 50% chance to get a clear significant result (p < .005). If the 20% failed studies are hidden, the percentage of results that rattled in versus studies with all-net results are 37 vs. 63%. However, if true power is just 20% (an amateur), 80% of studies fail, 15% rattle in, and 5% are clear successes. If the 80% failed studies are hidden, only 25% of the successful studies are all-net and 75% rattle in.

One problem with using this test to draw conclusions about the outcome of a pair of exact replication studies is that true power is unknown. To avoid this problem, it is possible to compute the maximum probability of a rattling-in result. As it turns out, the optimal true power to maximize the percentage of lucky outcomes is 66% power. With true power of 66%, one would expect 34% misses (p > .05), 32% lucky successes (.050 < p < .005), and 34% clear successes (p < .005).

For a pair of exact replication studies, this means that there is only a 10% chance (32% * 32%) to get two rattle-in successes in a row. In contrast, there is a 90% chance that misses were not reported or that an honest report of successful studies would have produced at least one all-net result (z > 2.8, p < .005).

Example: Unconscious Priming Influences Behavior

I used this test to examine a famous and controversial set of exact replication studies. In Bargh, Chen, and Burrows (1996), Dr. Bargh reported two exact replication studies (studies 2a and 2b) that showed an effect of a subtle priming manipulation on behavior. Undergraduate students were primed with words that are stereotypically associated with old age. The researchers then measured the walking speed of primed participants (n = 15) and participants in a control group (n = 15).

The two studies were not only exact replications of each other; they also produced very similar results. Most readers probably expected this outcome because similar studies should produce similar results, but this false belief ignores the influence of random factors that are not under the control of a researcher. We do not expect lotto winners to win the lottery again because it is an entirely random and unlikely event. Experiments are different because there could be a systematic effect that makes a replication more likely, but in studies with low power results should not replicate exactly because random sampling error influences results.

Study 1: t(28) = 2.86, p = .008 (two-tailed), z = 2.66, observed power = 76%
Study 2: t(28) = 2.16, p = .039 (two-tailed), z = 2.06, observed power = 54%

The median power of these two studies is 65%. However, even if median power were lower or higher, the maximum probability of obtaining two p-values in the range between .050 and .005 remains just 10%.

Although this study has been cited over 1,000 times, replication studies are rare.

One of the few published replication studies was reported by Cesario, Plaks, and Higgins (2006). Naïve readers might take the significant results in this replication study as evidence that the effect is real. However, this study produced yet another lucky success.

Study 3: t(62) = 2.41, p = .019, z = 2.35, observed power = 65%.

The chances of obtaining three lucky successes in a row is only 3% (32% *32% * 32*). Moreover, with a median power of 65% and a reported success rate of 100%, the success rate is inflated by 35%. This suggests that the true power of the reported studies is considerably lower than the observed power of 65% and that observed power is inflated because failed studies were not reported.

The R-Index corrects for inflation by subtracting the inflation rate from observed power (65% – 35%). This means the R-Index for this set of published studies is 30%.

This R-Index can be compared to several benchmarks.

An R-Index of 22% is consistent with the null-hypothesis being true and failed attempts are not reported.

An R-Index of 40% is consistent with 30% true power and all failed attempts are not reported.

It is therefore not surprising that other researchers were not able to replicate Bargh’s original results, even though they increased statistical power by using larger samples (Pashler et al. 2011, Doyen et al., 2011).

In conclusion, it is unlikely that Dr. Bargh’s original results were the only studies that they conducted. In an interview, Dr. Bargh revealed that the studies were conducted in 1990 and 1991 and that they conducted additional studies until the publication of the two studies in 1996. Dr. Bargh did not reveal how many studies they conducted over the span of 5 years and how many of these studies failed to produce significant evidence of priming. If Dr. Bargh himself conducted studies that failed, it would not be surprising that others also failed to replicate the published results. However, in a personal email, Dr. Bargh assured me that “we did not as skeptics might presume run many studies and only reported the significant ones. We ran it once, and then ran it again (exact replication) in order to make sure it was a real effect.” With a 10% probability, it is possible that Dr. Bargh was indeed lucky to get two rattling-in findings in a row. However, his aim to demonstrate the robustness of an effect by trying to show it again in a second small study is misguided. The reason is that it is highly likely that the effect will not replicate or that the first study was already a lucky finding after some failed pilot studies. Underpowered studies cannot provide strong evidence for the presence of an effect and conducting multiple underpowered studies reduces the credibility of successes because the probability of this outcome to occur even when an effect is present decreases with each study (Schimmack, 2012). Moreover, even if Bargh was lucky to get two rattling-in results in a row, others will not be so lucky and it is likely that many other researchers tried to replicate this sensational finding, but failed to do so. Thus, publishing lucky results hurts science nearly as much as the failure to report failed studies by the original author.

Dr. Bargh also failed to realize how lucky he was to obtain his results, in his response to a published failed-replication study by Doyen. Rather than acknowledging that failures of replication are to be expected, Dr. Bargh criticized the replication study on methodological grounds. There would be a simple solution to test Dr. Bargh’s hypothesis that he is a better researcher and that his results are replicable when the study is properly conducted. He should demonstrate that he can replicate the result himself.

In an interview, Tom Bartlett asked Dr. Bargh why he didn’t conduct another replication study to demonstrate that the effect is real. Dr. Bargh’s response was that “he is aware that some critics believe he’s been pulling tricks, that he has a “special touch” when it comes to priming, a comment that sounds like a compliment but isn’t. “I don’t think anyone would believe me,” he says.” The problem for Dr. Bargh is that there is no reason to believe his original results, either. Two rattling-in results alone do not constitute evidence for an effect, especially when this result could not be replicated in an independent study. NBA players have to make free-throws in front of a large audience for a free-throw to count. If Dr. Bargh wants his findings to count, he should demonstrate his famous effect in an open replication study. To avoid embarrassment, it would be necessary to increase the power of the replication study because it is highly unlikely that even Dr. Bargh can continuously produce significant results with samples of N = 30 participants. Even if the effect is real, sampling error is simply too large to demonstrate the effect consistently. Knowledge about statistical power is power. Knowledge about post-hoc power can be used to detect incredible results. Knowledge about a priori power can be used to produce credible results.

Swish!

# The Association for Psychological Science Improves Success Rate from 95% to 100% by Dropping Hypothesis Testing: The Sample Mean is the Sample Mean, Type-I Error 0%

The editor of Psychological Science published an Editorial with the title “Business Not as Usual.” (see also Observer interview and new Submission Guidelines) The new submission guidelines recommend the following statistical approach.

Effective January 2014, Psychological Science recommends the use of the “new statistics”—effect sizes, confidence intervals, and meta-analysis—to avoid problems associated with null-hypothesis significance testing (NHST). Authors are encouraged to consult this Psychological Science tutorial by Geoff Cumming, which shows why estimation and meta-analysis are more informative than NHST and how they foster development of a cumulative, quantitative discipline. Cumming has also prepared a video workshop on the new statistics that can be found here.

The editorial is a response to the current crisis in psychology that many findings cannot be replicated and the discovery that numerous articles in Psychological Science show clear evidence of reporting biases that lead to inflated false-positive rates and effect sizes (Francis, 2013).

The editorial is titled “Business not as usual.”  So what is the radical response that will ensure increased replicability of results published in Psychological Science? One solution is to increase transparency and openness to discourage the use of deceptive research practices (e.g., not publishing undesirable results or selective reporting of dependent variables that showed desirable results). The other solution is to abandon null-hypothesis significance testing.

Problem of the Old Statistics: Researchers had to demonstrate that their empirical results could have occurred only with a 5% probability if there is no effect in the population.

Null-hypothesis testing has been the main method to relate theories to empirical data. An article typically first states a theory and then derives a theoretical prediction from the theory. The theoretical prediction is then used to design a study that can be used to test the theoretical prediction. The prediction is tested by computing the ratio of the effect size and sampling error (signal-to-noise) ratio. The next step is to determine the probability of obtaining the observed signal-to-noise ratio or an even more extreme one under the assumption that the true effect size is zero. If this probability is smaller than a criterion value, typically p < .05, the results are interpreted as evidence that the theoretical prediction is true. If the probability does not meet the criterion, the data are considered inconclusive.

However, non-significant results are irrelevant because Psychological Science is only interested in publishing research that supports innovative novel findings. Nobody wants to know that drinking fennel tea does not cure cancer, but everybody wants to know about a treatment that actually cures cancer. So, the main objective of statistical analyses was to provide empirical evidence for a predicted effect by demonstrating that an obtained result would occur only with a 5% probability if the hypothesis were false.

Solution to the problem of Significance Testing: Drop the Significance Criterion. Just report your sample mean and the 95% confidence interval around it.

Eich claims that “researchers have recognized,…, essential problems with NHST in general, and with dichotomous thinking (“significant” vs. “non-significant” ) thinking it engenders in particular. It is true that statisticians have been arguing about the best way to test theoretical predictions with empirical data. In fact, they are still arguing. Thus, it is interesting to examine how Psychological Science found a solution to the elusive problem of statistical inference. The answer is to avoid statistical inferences altogether and to avoid dichotomous thinking. Does fennel tea cure cancer? Maybe, 95%CI d = -.4 to d = +4. No need to test for statistical significance. No need to worry about inadequate sample sizes. Just do a study and report your sample means with a confidence interval. It is that easy to fix the problems of psychological science.

The problem is that every study produces a sample mean and a confidence interval. So, how do the editors of Psychological Science pick the 5% of submitted manuscripts that will be accepted for publication? Eich lists three criteria.

1. What will the reader of this article learn about psychology that he or she did not know (or could not have known) before?

The effect of manipulation X on dependent variable Y is d = .2, 95%CI = -.2 to .6. We can conclude from this result that it is unlikely that the manipulation leads to a moderate decrease or a strong increase in the dependent variable Y.

1. Why is that knowledge important for the field?

The finding that the experimental manipulation of Y in the laboratory is somewhat more likely to produce an increase than a decrease, but could also have no effect at all has important implications for public policy.

1. How are the claims made in the article justified by the methods used?

The claims made in this article are supported by the use of Cumming’s New Statistics. Based on a precision analysis, the sample size was N = 100 (n = 50 per condition) to achieve a precision of .4 standard deviations. The study was preregistered and the data are publicly available with the code to analyze the data (SPPS t-test groups x (1,2) / var y.).

If this sounds wrong to you and you are a member of APS, you may want to write to Erich Eich and ask for some better guidelines that can be used to evaluate whether a sample mean or two or three or four sample means should be published in your top journal.

# Power Analysis for Bayes-Factor: What is the Probability that a Study Produces an Informative Bayes-Factor?

Jacob Cohen has warned fellow psychologists about the problem of conducting studies with insufficient statistical power to demonstrate predicted effects in 1962. The problem is simple enough. An underpowered study has only a small chance to produce the correct result; that is, a statistically significant result when an effect is present.

Many researchers have ignored Cohen’s advice to conduct studies with at least 80% power, that is, an 80% probability to produce the correct result when an effect is present because they were willing to pay low odds. Rather than conducting a single powerful study with 80% power, it seemed less risky to conduct three underpowered studies with 30% power. The chances of getting a significant result are similar (the power to get a significant result in at least 1 out of 3 studies with 30% power is 66%). Moreover, the use of smaller samples is even less problematic if a study tests multiple hypotheses. With 80% power to detect a single effect, a study with two hypotheses has a 96% probability that at least one of the two effects will produce a significant result. Three studies allow for six hypotheses tests. With 30% power to detect at least one of the two effects in six attempts, power to obtain at least one significant result is 88%. Smaller samples also provide additional opportunities to increase power by increasing sample sizes until a significant result is obtained (optional stopping) or by eliminating outliers. The reason is that these questionable practices have larger effects on the results in smaller samples. Thus, for a long time researchers did not feel a need to conduct adequately powered studies because there was no shortage of significant results to report (Schimmack, 2012).

Psychologists have ignored the negative consequences of relying on underpowered studies to support their conclusions. The problem is that the reported p-values are no longer valid. A significant result that was obtained by conducting three studies no longer has a 5% chance to be a random event. By playing the sampling-error lottery three times, the probability of obtaining a significant result by chance alone is now 15%. By conducting three studies with two hypothesis tests, the probability of obtaining a significant result by chance alone is 30%. When researchers use questionable research practices, the probability of obtaining a significant result by chance can further increase. As a result, a significant result no longer provides strong statistical evidence that the result was not just a random event.

It would be easy to distinguish real effects from type-I errors (significant results when the null-hypothesis is true) by conducting replication studies. Even underpowered studies with 30% power will replicate in every third study. In contrast, when the null-hypothesis is true, type-I errors will replicate only in 1 out of 20 studies, when the criterion is set to 5%. This is what a 5% criterion means. There is only a 5% chance (1 out of 20) to get a significant result when the null-hypothesis is true. However, this self-correcting mechanism failed because psychologists considered failed replication studies as uninformative. The perverse logic was that failed replications are to be expected because studies have low power. After all, if a study has only 30% power, a non-significant result is more likely than a significant result. So, non-significant results in underpowered studies cannot be used to challenge a significant result in an underpowered study. By this perverse logic, even false hypothesis will only receive empirical support because only significant results will be reported, no matter whether an effect is present or not.

The perverse consequences of abusing statistical significance tests became apparent when Bem (2011) published 10 studies that appeared to demonstrate that people can anticipate random future events and that practicing for an exam after writing an exam can increase grades. These claims were so implausible that few researchers were willing to accept Bem’s claims despite his presentation of 9 significant results in 10 studies. Although the probability that this even occurred by chance alone is less than 1 in a billion, few researchers felt compelled to abandon the null-hypothesis that studying for an exam today can increase performance on yesterday’s exam.   In fact, most researchers knew all too well that these results could not be trusted because they were aware that published results are not an honest report of what happens in a lab. Thus, a much more plausible explanation for Bem’s incredible results was that he used questionable research practices to obtain significant results. Consistent with this hypothesis, closer inspection of Bem’s results shows statistical evidence that Bem used questionable research practices (Schimmack, 2012).

As the negative consequences of underpowered studies have become more apparent, interest in statistical power has increased. Computer programs make it easy to conduct power analysis for simple designs. However, so far power analysis has been limited to conventional statistical methods that use p-values and a criterion value to draw conclusions about the presence of an effect (Neyman-Pearson Significance Testing, NPST).

Some researchers have proposed Bayesian statistics as an alternative approach to hypothesis testing. As far as I know, these researchers have not provided tools for the planning of sample sizes. One reason is that Bayesian statistics can be used with optional stopping. That is, a study can be terminated early when a criterion value is reached. However, an optional stopping rule also needs a rule when data collection will be terminated in case the criterion value is not reached. It may sound appealing to be able to finish a study at any moment, but if this event is unlikely to occur in a reasonably sized sample, the study would produce an inconclusive result. Thus, even Bayesian statisticians may be interested in the effect of sample sizes on the ability to obtain a desired Bayes-Factor. Thus, I wrote some r-code to conduct power analysis for Bayes-Factors.

The code uses the Bayes-Factor package in r for the default Bayesian t-test (see also blog post on Replication-Index blog). The code is posted at the end of this blog. Here I present results for typical sample sizes in the between-subject design for effect sizes ranging from 0 (the null-hypothesis is true) to Cohen’s d = .5 (a moderate effect). Larger effect sizes are not reported because large effects are relatively easy to detect.

The first table shows the percentage of studies that meet a specified criterion value based on 10,000 simulations of a between-subject design. For Bayes-Factors the criterion values are 3 and 10. For p-values the criterion values are .05, .01, and .001. For Bayes-Factors, a higher number provides stronger support for a hypothesis. For p-values, lower values provide stronger support for a hypothesis. For p-values, percentages correspond to the power of a study. Bayesian statistics has no equivalent concept, but percentages can be used in the same way. If a researcher aims to provide empirical support for a hypothesis with a Bayes-Factor greater than 3 or 10, the table gives the probability of obtaining the desired outcome (success) as a function of the effect size and sample size.

d   n     N     3   10     .05 .01     .001
.5   20   40   17   06     31     11     02
.4   20   40   12   03     22     07     01
.3   20   40   07   02     14     04     00
.2   20   40   04   01     09     02     00
.1   20   40   02   00     06     01     00
.0   20   40   33   00     95     99   100

For an effect size of zero, the interpretation of results switches. Bayes-Factors of 1/3 or 1/10 are interpreted as evidence for the null-hypothesis. The table shows how often Bayes-Factors provide support for the null-hypothesis as a function of the effect size, which is zero, and sample size. For p-values, the percentage is 1 – p. That is, when the effect is zero, the p-value will correctly show a non-significant result with a probability of 1 – p and it will falsely reject the null-hypothesis with the specified type-I error.

Typically, researchers do not interpret non-significant results as evidence for the null-hypothesis. However, it is possible to interpret non-significant results in this way, but it is important to take the type-II error rate into account. Practically, it makes little difference whether a non-significant result is not interpreted or whether it is taken as evidence for the null-hypothesis with a high type-II error probability. To illustrate this consider a study with N = 40 (n = 20 per group) and an effect size of d = .2 (a small effect). As there is a small effect, the null-hypothesis is false. However, the power to detect this effect in a small sample is very low. With p = .05 as the criterion, power is only 9%. As a result, there is a 91% probability to end up with a non-significant result even though the null-hypothesis is false. This probability is only slightly lower than the probability to get a non-significant result when the null-hypothesis is true (95%). Even if the effect size were d = .5, a moderate effect, power is only 31% and the type-II error rate is 69%. With type-II error rates of this magnitude, it makes practically no difference whether a null-hypothesis is accepted with a warning that the type-II error rate is high or whether the non-significant result is simply not interpreted because it provides insufficient information about the presence or absence of small to moderate effects.

The main observation in Table 1 is that small samples provide insufficient information to distinguish between the null-hypothesis and small to moderate effects. Small studies with N = 40 are only meaningful to demonstrate the presence of moderate to large effects, but they have insufficient power to show effects and insufficient power to show the absence of effects. Even when the null-hypothesis is true, a Bayes-Factor of 3 is reached only 33% of the time. A Bayes-Factor of 10 is never reached because the sample size is too small to provide such strong evidence for the null-hypothesis when the null-hypothesis is true. Even more problematic is that a Bayes-Factor of 3 is reached only 17% of the time when a moderate effect is present. Thus, the most likely outcome in small samples is an inconclusive result unless a strong effect is present. This means that Bayes-Factors in these studies have the same problem as p-values. They can only provide evidence that an effect is present when a strong effect is present, but they cannot provide sufficient evidence for the null-hypothesis when the null-hypothesis is true.

d   n     N     3   10     .05 .01     .001
.5   50 100   49   29     68     43     16
.4   50 100   30   15     49     24     07
.3   50 100   34   18     56     32     12
.2   50 100   07   02     16     05     01
.1   50 100   03   01     08     02     00
.0   50 100   68   00     95     99   100

In Table 2 the sample size has been increased to N = 100 participants (n = 50 per cell). This is already a large sample size by past standards in social psychology. Moreover, in several articles Wagenmakers has implemented a stopping rule that terminates data collection at this point. The table shows that a sample size of N = 100 in a between-subject design has modest power to demonstrate even moderate effect sizes of d = .5 with a Bayes-Factor of 3 as a criterion (49%). In comparison, a traditional p-value of .05 would provide 68% power.

The main argument for using Bayesian statistics is that it can also provide evidence for the null-hypothesis. With a criterion value of BF = 3, the default test correctly favors the null-hypothesis 68% of the time (see last row of the table). However, the sample size is too small to produce Bayes-Factors greater than 10. In sum, the default-Bayesian t-test with N = 100 can be used to demonstrate the presence of a moderate to large effects and with a criterion value of 3 it can be used to provide evidence for the null-hypothesis when the null-hypothesis is true. However it cannot be used to demonstrate that provide evidence for small to moderate effects.

The Neyman-Pearson approach to significance testing would reveal this fact in terms of the type-I I error rates associated with non-significant results. Using the .05 criterion, a non-significant result would be interpreted as evidence for the null-hypothesis. This conclusion is correct in 95% of all tests when the null-hypothesis is actually true. This is higher than the 68% criterion for a Bayes-Factor of 3. However, the type-II error rates associated with this inference when the null-hypothesis is false are 32% for d = .5, 51% for d = .4, 44% for d = .3, 84% for d = .2, and 92% for d = .1. If we consider effect size of d = .2 as important enough to be detected (small effect size according to Cohen), the type-II error rate could be as high as 84%.

In sum, a sample size of N = 100 in a between-subject design is still insufficient to test for the presence of a moderate effect size (d = .5) with a reasonable chance to find it (80% power). Moreover, a non-significant result is unlikely to occur for moderate to large effect sizes, but the sample size is insufficient to discriminate accurately between the null-hypothesis and small to moderate effects. A Bayes-Factor greater than 3 in favor of the null-hypothesis is most likely to occur when the null-hypothesis is true, but it can also occur when a small effect is present (Simonsohn, 2015).

The next table increases the total sample size to 200 for a between-subject design. The pattern doesn’t change qualitatively. So the discussion will be brief and focus on the power of a study with 200 participants to provide evidence for small to moderate effects and to distinguish small to moderate effects from the null-hypothesis.

d   n     N     3   10     .05 .01     .001
.5 100 200   83   67     94     82     58
.4 100 200   60   41     80     59     31
.3 100 200   16   06     31     13     03
.2 100 200   13   06     29     12     03
.1 100 200   04   01     11     03     00
.0 100 200   80   00     95     95     95

Using Cohen’s guideline of 80% success rate (power), a study with N = 200 participants has sufficient power to show a moderate effect of d = .5 with p = .05, p = .01, and Bayes-Factor = 3 as criterion values. For d = .4, only the criterion value of p = .05 has sufficient power. For all smaller effects, the sample size is still too small to have 80% power. A sample of N = 200 also provides 80% power to provide evidence for the null-hypothesis with a Bayes-Factor of 3. Power for a Bayes-Factor of 10 is still 0 because this value cannot be reached with N = 200. Finally, with N = 200, the type-II error rate for d = .5 is just shy of .05 (1 – .94 = .06). Thus, it is justified to conclude from a non-significant result with a 6% error rate that the true effect size cannot be moderate to large (d >= .5). However, type-II error rates for smaller effect sizes are too high to test the null-hypothesis against these effect sizes.

d   n     N     3   10     .05 .01     .001
.5 200 400   99   97   100     99     95
.4 200 400   92   82     98     92     75
.3 200 400   64   46     85     65     36
.2 200 400   27   14     52     28     10
.1 200 400   05   02     17     06     01
.0 200 400   87   00     95     99     95

The next sample size doubles the number of participants. The reason is that sampling error decreases in a log-function and large increases in sample sizes are needed to further decrease sampling error. A sample size of N = 200 yields a standard error of 2 / sqrt(200) = .14. (14/100 of a standard deviation). A sample size of N = 400 is needed to reduce this to .10 (2 / sqrt (400) = 2 / 20 = .10; 2/10 of a standard deviation).   This is the reason why it is so difficult to find small effects.

Even with N = 400, power is only sufficient to show effect sizes of .3 or greater with p = .05, or effect sizes of d = .4 with p = .01 or Bayes-Factor 3. Only d = .5 can be expected to meet the criterion p = .001 more than 80% of the time. Power for Bayes-Factors to show evidence for the null-hypothesis also hardly changed. It increased from 80% to 87% with Bayes-Factor = 3 as criterion. The chance to get a Bayes-Factor of 10 is still 0 because the sample size is too small to produce such extreme values. Using Neyman-Pearson’s approach with a 5% type-II error rate as criterion, it is possible to interpret non-significant results as evidence that the true effect size cannot be .4 or larger. With a 1% criterion it is possible to say that a moderate to large effect would produce a significant result 99% of the time and the null-hypothesis would produce a non-significant result 99% of the time.

Doubling the sample size to N = 800 reduces sampling error from SE = .1 to SE = .07.

d   n     N     3     10     .05   .01     .001
.5 400 800 100 100   100  100     100
.4 400 800 100   99   100  100       99
.3 400 800   94   86     99     95      82
.2 400 800   54   38     81     60      32
.1 400 800   09   04     17     06      01
.0 400 800   91   52     95     95      95

A sample size of N = 800 is sufficient to have 80% power to detect a small effect according to Cohen’s classification of effect sizes (d = .2) with p = .05 as criterion. Power to demonstrate a small effect with Bayes-Factor = 3 as criterion is only 54%. Power to demonstrate evidence for the null-hypothesis with Bayes-Factor = 3 as criterion increased only slightly from 87% to 91%, but a sample size of N = 100 is sufficient to produce Bayes-Factors greater than 10 in favor of the null-hypothesis 52% of the time. Thus, researchers who aim for this criterion value need to plan their studies with N = 800. Smaller samples cannot produce these values with the default Bayesian t-test. Following Neyman-Pearson, a non-significant result can be interpreted as evidence that the true effect cannot be larger than d = .3, with a type-II error rate of 1%.

Conclusion

A common argument in favor of Bayes-Factors has been that Bayes-Factors can be used to test the null-hypothesis, whereas p-values can only reject the null-hypothesis. There are two problems with this claim. First, it confuses Null-Significance-Testing (NHST) and Neyman-Pearson-Significance-Testing (NPST). NPST also allows researchers to accept the null-hypothesis. In fact, it makes it easier to accept the null-hypothesis because every non-significant result favors the null-hypothesis. Of course, this does not mean that all non-significant results show that the null-hypothesis is true. In NPST the error of falsely accepting the null-hypothesis depends on the amount of sampling error. The tables here make it possible to compare Bayes-Factors and NPST. No matter which statistical approach is being used, it is clear that meaningful evidence for the null-hypothesis requires rather large samples. The r-code below can be used to compute power for different criterion values, effect sizes, and sample sizes. Hopefully, this will help researchers to better plan sample sizes and to better understand Bayes-Factors that favor the null-hypothesis.

########################################################################
###                       R-Code for Power Analysis for Bayes-Factor and P-Values                ###
########################################################################

## setup
rm(list = ls())                       # clear memory

## set parameters
nsim = 10000      #set number of simulations
es 1 favor effect)
BF10_crit = 3      #set criterion value for BF favoring effect (> 1 = favor null)
p_crit = .05          #set criterion value for two-tailed p-value (e.g., .05

## computations
Z <- matrix(rnorm(groups*n*nsim,mean=0,sd=1),nsim,groups*n)   # create observations
Z[,1:n] <- Z[,1:n] + es                                                                                                #add effect size
tt <- function(x) {                                                                                                       #compute t-statistic (t-test)
oes <- mean(x[1:n])                                                                                    #compute mean group 1
if (groups == 2) oes = oes – mean(x[(n+1):(2*n)])                                  #compute mean for 2 groups
oes <- oes / sd(x[1:n*groups])                                                                  #compute observed effect size
t <- abs(oes) / (groups / sqrt(n*groups))                                                 #compute t-value
}

t <- apply(Z,1,function(x) tt(x))                                                                                 #get t-values for all simulations
df <- t – t + n*groups-groups                                                                                    #get degrees of freedom
p2t <- (1 – pt(abs(t),df))*2                                                                                         #compute two-tailed p-value
getBF <- function(x) {                                                                                                 #function to get Bayes-Factor
t <- x[1]
df <- x[2]
bf <- exp(ttest.tstat(t,(df+2)/2,(df+2)/2,rscale=rsc)\$bf)
}              # end of function to get Bayes-Factor

input = matrix(cbind(t,df),,2)                                                                  # combine t and df values
BF10 <- apply(input,1, function(x) getBF(x) )                                        # get BF10 for all simulations
powerBF10 = length(subset(BF10, BF10 > BF10_crit))/nsim*100        # % results support for effect
powerBF01 = length(subset(BF10, BF10 < 1/BF10))/nsim*100            # % results support for null
powerP = length(subset(p2t, p2t < .05))/nsim*100                                # % significant, p < p-criterion

##output of results
cat(
” Power to support effect with BF10 >”,BF10_crit,”: “,powerBF10,
“\n”,
“Power to support null with BF01 >”,BF01_crit,” : “,powerBF01,
“\n”,
“Power to show effect with p < “,p_crit,” : “,powerP,
“\n”)

# A Critical Review of Cumming’s (2014) New Statistics: Reselling Old Statistics as New Statistics

Cumming (2014) wrote an article “The New Statistics: Why and How” that was published in the prestigious journal Psychological Science.   On his website, Cumming uses this article to promote his book “Cumming, G. (2012). Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. New York: Routledge.”

The article clear states the conflict of interest. “The author declared that he earns royalties on his book (Cumming, 2012) that is referred to in this article.” Readers are therefore warned that the article may at least inadvertently give an overly positive account of the new statistics and an overly negative account of the old statistics. After all, why would anybody buy a book about new statistics when the old statistics are working just fine.

This blog post critically examines Cumming’s claim that his “new statistics” can solve endemic problems in psychological research that have created a replication crisis and that the old statistics are the cause of this crisis.

Like many other statisticians who are using the current replication crisis as an opportunity to sell their statistical approach, Cumming’s blames null-hypothesis significance testing (NHST) for the low credibility of research articles in Psychological Science (Francis, 2013).

In a nutshell, null-hypothesis significance testing entails 5 steps. First, researchers conduct a study that yields an observed effect size. Second, the sampling error of the design is estimated. Third, the ratio of the observed effect size and sampling error (signal-to-noise ratio) is computed to create a test-statistic (t, F, chi-square). The test-statistic is then used to compute the probability of obtaining the observed test-statistic or a larger one under the assumption that the true effect size in the population is zero (there is no effect or systematic relationship). The last step is to compare the test statistic to a criterion value. If the probability (p-value) is less than a criterion value (typically 5%), the null-hypothesis is rejected and it is concluded that an effect was present.

Cumming’s (2014) claims that we need a new way to analyze data because there is “renewed recognition of the severe flaws of null-hypothesis significance testing (NHST)” (p. 7). His new statistical approach has “no place for NHST” (p. 7). His advice is to “whenever possible, avoid using statistical significance or p values” (p. 8).

So what is wrong with NHST?

The first argument against NHST is that Ioannidis (2005) wrote an influential article with the eye-catching title “Why most published research findings are false” and most research articles use NHST to draw inferences from the observed results. Thus, NHST seems to be a flawed method because it produces mostly false results. The problem with this argument is that Ioannidis (2005) did not provide empirical evidence that most research findings are false, nor is this a particularly credible claim for all areas of science that use NHST, including partical physics.

The second argument against NHST is that researchers can use questionable research practices to produce significant results. This is not really a criticism of NHST, because researchers under pressure to publish are motivated to meet any criteria that are used to select articles for publication. A simple solution to this problem would be to publish all submitted articles in a single journal. As a result, there would be no competition for limited publication space in more prestigious journals. However, better studies would be cited more often and researchers will present their results in ways that lead to more citations. It is also difficult to see how psychology can improve its credibility by lowering standards for publication. A better solution would be to ensure that researchers are honestly reporting their results and report credible evidence that can provide a solid empirical foundation for theories of human behavior.

Cummings agrees. “To ensure integrity of the literature, we must report all research conducted to a reasonable standard, and reporting must be full and accurate” (p. 9). If a researcher conducted five studies with only a 20% chance to get a significant result and would honestly report all five studies, p-values would provide meaningful evidence about the strength of the evidence, namely most p-values would be non-significant and show that the evidence is weak. Moreover, post-hoc power analysis would reveal that the studies had indeed low power to test a theoretical prediction. Thus, I agree with Cumming’s that honesty and research integrity are important, but I see no reason to abandon NHST as a systematic way to draw inferences from a sample about the population because researchers have failed to disclose non-significant results in the past.

Cumming’s then cites a chapter by Kline (2014) that “provided an excellent summary of the deep flaws in NHST and how we use it” (p. 11). Apparently, the summary is so excellent that readers are better off by reading the actual chapter because Cumming’s does not explain what these deep flaws are. He then observes that “very few defenses of NHST have been attempted” (p. 11). He doesn’t even list a single reference. Here is one by a statistician: “In defence of p-values” (Murtaugh, 2014). In a response, Gelman agrees that the problem is more with the way p-values are used rather than with the p-value and NHST per se.

Cumming’s then states a single problem of NHST. Namely that it forces researchers to make a dichotomous decision. If the signal-to-noise ratio is above a criterion value, the null-hypothesis is rejected and it is concluded that an effect is present. If the signal-to-noise ratio is below the criterion value the null-hypothesis is not rejected. If Cumming’s has a problem with decision making, it would be possible to simply report the signal-to-noise ratio or simply to report the effect size that was observed in a sample. For example, mortality in an experimental Ebola drug trial was 90% in the control condition and 80% in the experimental condition. As this is the only evidence, it is not necessary to compute sampling error, signal-to-noise ratios, or p-values. Given all of the available evidence, the drug seems to improve survival rates. But wait. Now a dichotomous decision is made based on the observed mean difference and there is no information about the probability that the results in the drug trial generalize to the population. Maybe the finding was a chance finding and the drug actually increases mortality. Should we really make life-and-death decision if the decision were based on the fact that 8 out of 10 patients died in one condition and 9 out of 10 patients died in the other condition?

Even in a theoretical research context decisions have to be made. Editors need to decide whether they accept or reject a submitted manuscript and readers of published studies need to decide whether they want to incorporate new theoretical claims in their theories or whether they want to conduct follow-up studies that build on a published finding. It may not be helpful to have a fixed 5% criterion, but some objective information about the probability of drawing the right or wrong conclusions seems useful.

Based on this rather unconvincing critique of p-values, Cumming’s (2014) recommends that “the best policy is, whenever possible, not to use NHST at all” (p. 12).

So what is better than NHST?

Cumming then explains how his new statistics overcome the flaws of NHST. The solution is simple. What is astonishing about this new statistic is that it uses the exact same components as NHST, namely the observed effect size and sampling error.

NHST uses the ratio of the effect size and sampling error. When the ratio reaches a value of 2, p-values reach the criterion value of .05 and are considered sufficient to reject the null-hypothesis.

The new statistical approach is to multiple the standard error by a factor of 2 and to add and subtract this value from the observed mean. The interval from the lower value to the higher value is called a confidence interval. The factor of 2 was chosen to obtain a 95% confidence interval.  However, drawing a confidence interval alone is not sufficient to draw conclusions from the data. Whether we describe the results in terms of a ratio, .5/.2 = 2.5 or in terms of a 95%CI = .5 +/- .2 or CI = .1 to .7, is not a qualitative difference. It is simply different ways to provide information about the effect size and sampling error. Moreover, it is arbitrary to multiply the standard error by a factor of 2. It would also be possible to multiply it by a factor of 1, 3, or 5. A factor of 2 is used to obtain a 95% confidence interval rather than a 20%, 50%, 80%, or 99% confidence interval. A 95% confidence is commonly used because it corresponds to a 5% error rate (100 – 95 = 5!). A 95% confidence interval is as arbitrary as a p-value of .05.

So, how can a p-value be fundamentally wrong and how can a confidence interval be the solution to all problems if they provide the same information about effect size and sampling error? In particular how do confidence intervals solve the main problem of making inferences from an observed mean in a sample about the mean in a population?

To sell confidence intervals, Cumming’s uses a seductive example.

“I suggest that, once freed from the requirement to report p values, we may appreciate how simple, natural, and informative it is to report that “support for Proposition X is 53%, with a 95% CI of [51, 55],” and then interpret those point and interval estimates in practical terms” (p 14).

Support for proposition X is a rather unusual dependent variable in psychology. However, let us assume that Cumming refers to an opinion poll among psychologists whether NHST should be abandoned. The response format is a simple yes/no format. The average in the sample is 53%. The null-hypothesis is 50%. The observed mean of 53% in the sample shows more responses in favor of the proposition. To compute a significance test or to compute a confidence interval, we need to know the standard error. The confidence interval ranges from 51% to 55%. As the 95% confidence interval is defined by the observed mean plus/minus two standard errors, it is easy to see that the standard error is SE = (53-51)/2 = 1% or .01. The formula for the standard error in a one sample test with a dichotomous dependent variable is sqrt(p * (p-1) / n)). Solving for n yields a sample size of N = 2,491. This is not surprising because public opinion polls often use large samples to predict election outcomes because small samples would not be informative. Thus, Cumming’s example shows how easy it is to draw inferences from confidence intervals when sample sizes are large and confidence intervals are tight. However, it is unrealistic to assume that psychologists can and will conduct every study with samples of N = 1,000. Thus, the real question is how useful confidence intervals are in a typical research context, when researchers do not have sufficient resources to collect data from hundreds of participants for a single hypothesis test.

For example, sampling error for a between-subject design with N = 100 (n = 50 per cell) is SE = 2 / sqrt(100) = .2. Thus, the lower and upper limit of the 95%CI are 4/10 of a standard deviation away from the observed mean and the full width of the confidence interval covers 8/10th of a standard deviation. If the true effect size is small to moderate (d = .3) and a researcher happens to obtain the true effect size in a sample, the confidence interval would range from d = -.1 to d = .7. Does this result support the presence of a positive effect in the population? Should this finding be published? Should this finding be reported in newspaper articles as evidence for a positive effect? To answer this question, it is necessary to have a decision criterion.

One way to answer this question is to compute the signal-to-noise ratio, .3/.2 = 1.5 and to compute the probability that the positive effect in the sample could have occurred just by chance, t(98) = .3/.2 = 1.5, p = .15 (two-tailed). Given this probability, we might want to see stronger evidence. Moreover, a researcher is unlikely to be happy with this result. Evidently, it would have been better to conduct a study that could have provided stronger evidence for the predicted effect, say a confidence interval of d = .25 to .35, but that would have required a sample size of N = 6,500 participants.

A wide confidence interval can also suggest that more evidence is needed, but the important question is how much more evidence is needed and how narrow a confidence interval should be before it can give confidence in a result. NHST provides a simple answer to this question. The evidence should be strong enough to reject the null-hypothesis with a specified error rate. Cumming’s new statistics provides no answer to the important question. The new statistics is descriptive, whereas NHST is an inferential statistic. As long as researchers merely want to describe their data, they can report their results in several ways, including reporting of confidence intervals, but when they want to draw conclusions from their data to support theoretical claims, it is necessary to specify what information constitutes sufficient empirical evidence.

One solution to this dilemma is to use confidence intervals to test the null-hypothesis. If the 95% confidence interval does not include 0, the ratio of effect size / sampling error is greater than 2 and the p-value would be less than .05. This is the main reason why many statistics programs report 95%CI intervals rather than 33%CI or 66%CI. However, the use of 95% confidence intervals to test significance is hardly a new statistical approach that justifies the proclamation of a new statistic that will save empirical scientists from NHST. It is NHST! Not surprisingly, Cumming’s states that “this is my least preferred way to interpret a confidence interval” (p. 17).

However, he does not explain how researchers should interpret a 95% confidence interval that does include zero. Instead, he thinks it is not necessary to make a decision. “We should not lapse back into dichotomous thinking by attaching any particular importance to whether a value of interest lies just inside or just outside our CI.”

Does an experimental treatment for Ebolay work? CI = -.3 to .8. Let’s try it. Let’s do nothing and do more studies forever. The benefit of avoiding making any decisions is that one can never make a mistake. The cost is that one can also never claim that an empirical claim is supported by evidence. Anybody who is worried about dichotomous thinking might ponder the fact that modern information processing is built on the simple dichotomy of 0/1 bits of information and that it is common practice to decide the fate of undergraduate students on the basis of scoring multiple choice tests in terms of True or False answers.

In my opinion, the solution to the credibility crisis in psychology is not to move away from dichotomous thinking, but to obtain better data that provide more conclusive evidence about theoretical predictions and a simple solution to this problem is to reduce sampling error. As sampling error decreases, confidence intervals get smaller and are less likely to include zero when an effect is present and the signal-to-noise ratio increases so that p-values get smaller and smaller when an effect is present. Thus, less sampling error also means less decision errors.

The question is how small should sampling error be to reduce decision error and at what point are resources being wasted because the signal-to-noise ratio is clear enough to make a decision.

Power Analysis

Cumming’s does not distinguish between Fischer’s and Neyman-Pearson’s use of p-values. The main difference is that Fischer advocated the use of p-values without strict criterion values for significance testing. This approach would treat p-values just like confidence intervals as continuous statistics that do not imply an inference. A p-value of .03 is significant with a criterion value of .05, but it is not significant with a criterion value of .01.

Neyman-Pearson introduced the concept of a fixed criterion value to draw conclusions from observed data. A criterion value of p = .05 has a clear interpretation. It means that a test of 1,000 null-hypotheses is expected to produce about 50 significant results (type-I errors). A lower error rate can be achieved by lowering the criterion value (p < .01 or p < .001).

Importantly, Neyman-Pearson also considered the alternative problem that the p-value may fail to reach the critical value when an effect is actually present. They called this probability the type-II error. Unfortunately, social scientists have ignored this aspect of Neyman-Pearson Significance Testing (NPST). Researchers can avoid making type-II errors by reducing sampling error. The reason is that a reduction of sampling error increases the signal-to-noise ratio.

For example, the following p-values were obtained from simulating studies with 95% power. The graph only shows p-values greater than .001 to make the distribution of p-values more prominent. As a result 62.5% of the data are missing because these p-values are below p < .001. The histogram of p-values has been popularized by Simmonsohn et al. (2013) as a p-curve. The p-curve shows that p-values are heavily skewed towards low p-values. Thus, the studies provide consistent evidence that an effect is present, even though p-values can vary dramatically from one study (p = .0001) to the next (p = .02). The variability of p-values is not a problem for NPST as long as the p-values lead to the same conclusion because the magnitude of a p-value is not important in Neyman-Pearson hypothesis testing.

The next graph shows p-values for studies with 20% power. P-values vary just as much, but now the variation covers both sides of the significance criterion, p = .05. As a result, the evidence is often inconclusive and 80% of studies fail to reject the false null-hypothesis.

R-Code
seed = length(“Cumming’sDancingP-Values”)
power=.20
low_limit = .000
up_limit = .10
p <-(1-pnorm(rnorm(2500,qnorm(.975,0,1)+qnorm(.20,0,1),1),0,1))*2
hist(p,breaks=1000,freq=F,ylim=c(0,100),xlim=c(low_limit,up_limit))
abline(v=.05,col=”red”)
percent_below_lower_limit = length(subset(p, p <  low_limit))/length(p)
percent_below_lower_limit
If a study is designed to test a qualitative prediction (an experimental manipulation leads to an increase on an observed measure), power analysis can be used to plan a study so that it has a high probability of providing evidence for the hypothesis if the hypothesis is true. It does not matter whether the hypothesis is tested with p-values or with confidence intervals by showing that the confidence does not include zero.

Thus, power analysis seems useful even for the new statistics. However, Cummings is “ambivalent about statistical power” (p. 23). First, he argues that it has “no place when we use the new statistics” (p. 23), presumably because the new statistics never make dichotomous decisions.

Cumming’s next argument against power is that power is a function of the type-I error criterion. If the type-I error probability is set to 5% and power is only 33% (e.g., d = .5, between-group design N = 40), it is possible to increase power by increasing the type-I error probability. If type-I error rate is set to 50%, power is 80%. Cumming’s thinks that this is an argument against power as a statistical concept, but raising alpha to 50% is equivalent to reducing the width of the confidence interval by computing a 50% confidence interval rather than a 95% confidence interval. Moreover, researchers who adjust alpha to 50% are essentially saying that the null-hypothesis would produce a significant result in every other study. If an editor finds this acceptable and wants to publish the results, neither power analysis nor the reported results are problematic. It is true that there was a good chance to get a significant result when a moderate effect is present (d = .5, 80% probability) and when no effect is present (d = 0, 50% probability). Power analysis provides accurate information about the type-I and type-II error rates. In contrast, the new statistics provides no information about error rates in decision making because it is merely descriptive and does not make decisions.

Cumming then points out that “power calculations have traditionally been expected [by granting agencies], but these can be fudged” (p. 23). The problem with fudging power analysis is that the requested grant money may be sufficient to conduct the study, but insufficient to produce a significant result. For example, a researcher may be optimistic and expect a strong effect, d = .80, when the true effect size is only a small effect, d = .20. The researcher conducts a study with N = 52 participants to achieve 80% power. In reality the study has only 11% power and the researcher is likely to end up with a non-significant result. In the new statistics world this is apparently not a problem because the researcher can report the results with a wide confidence interval that includes zero, but it is not clear why a granting agency should fund studies that cannot even provide information about the direction of an effect in the population.

Cummings then points out that “one problem is that we never know true power, the probability that our experiment will yield a statistically significant result, because we do not know the true effect size; that is why we are doing the experiment!” (p. 24). The exclamation mark indicates that this is the final dagger in the coffin of power analysis. Power analysis is useless because it makes assumptions about effect sizes when we can just do an experiment to observe the effect size. It is that easy in the world of new statistics. The problem is that we do not know the true effect sizes after an experiment either. We never know the true effect size because we can never determine a population parameter, just like we can never prove the null-hypothesis. It is only possible to estimate population parameter. However, before we estimate a population parameter, we may simply want to know whether an effect exists at all. Power analysis can help in planning studies so that the sample mean shows the same sign as the population mean with a specified error rate.

Determining Sample Sizes in the New Statistics

Although Cumming does not find power analysis useful, he gives some information about sample sizes. Studies should be planned to have a specified level of precision. Cumming gives an example for a between-subject design with n = 50 per cell (N = 100). He chose to present confidence intervals for unstandardized coefficients. In this case, there is no fixed value for the width of the confidence interval because the sampling variance influences the standard error. However, for standardized coefficients like Cohen’s d, sampling variance will produce variation in standardized coefficients, while the standard error is constant. The standard error is simply 2 / sqrt (N), which equals SE = .2 for N = 100. This value needs to be multiplied by 2 to get the confidence interval, and the 95%CI = d +/- .4.   Thus, it is known before the study is conducted that the confidence interval will span 8/10 of a standard deviation and that an observed effect size of d > .4 is needed to exclude 0 from the confidence interval and to state with 95% confidence that the observed effect size would not have occurred if the true effect size were 0 or in the opposite direction.

The problem is that Cumming provides no guidelines about the level of precision that a researcher should achieve. Is 8/10 of a standard deviation precise enough? Should researchers aim for 1/10 of a standard deviation? So when he suggests that funding agencies should focus on precision, it is not clear what criterion should be used to fund research.

One obvious criterion would be to ensure that precision is sufficient to exclude zero so that the results can be used to state that direction of the observed effect is the same as the direction of the effect in the population that a researcher wants to generalize to. However, as soon as effect sizes are used in the planning of the precision of a study, precision planning is equivalent to power analysis. Thus, the main novel aspect of the new statistics is to ignore effect sizes in the planning of studies, but without providing guidelines about desirable levels of precision. Researchers should be aware that N = 100 in a between-subject design gives a confidence interval that spans 8/10 of a standard deviation. Is that precise enough?

Problem of Questionable Research Practices, Publication Bias, and Multiple Testing

A major problem for any statistical method is the assumption that random sampling error is the only source of error. However, the current replication crisis has demonstrated that reported results are also systematically biased. A major challenge for any statistical approach, old or new, is to deal effectively with systematically biased data.

It is impossible to detect bias in a single study. However, when more than one study is available, it becomes possible to examine whether the reported data are consistent with the statistical assumption that each sample is an independent sample and that the results in each sample are a function of the true effect size and random sampling error. In other words, there is no systematic error that biases the results. Numerous statistical methods have been developed to examine whether data are biased or not.

Cumming (2014) does not mention a single method for detecting bias (Funnel Plot, Eggert regression, Test of Excessive Significance, Incredibility-Index, P-Curve, Test of Insufficient Variance, Replicabiity-Index, P-Uniform). He merely mentions a visual inspection of forest plots and suggests that “if for example, a set of studies is distinctly too homogeneous – it shows distinctly less bouncing around than we would expect from sampling variability… we can suspect selection or distortion of some kind” (p. 23). However, he provides no criteria that explain how variability of observed effect sizes should be compared against predicted variability and how the presence of bias influences the interpretation of a meta-analysis. Thus, he concludes that “even so [biases may exist], meta-analysis can give the best estimates justified by research to date, as well as the best guidance for practitioners” (p. 23). Thus, the new statistics would suggest that extrasensory perception is real because a meta-analysis of Bem’s (2011) infamous Journal of Personality and Social Psychology article shows an effect with a tight confidence interval that does not include zero. In contrast, other researchers have demonstrated with old statistical tools and with the help of post-hoc power that Bem’s results are not credible (Francis, 2012; Schimmack, 2012).

Research Integrity

Cumming also advocates research integrity. His first point is that psychological science should “promote research integrity: (a) a public research literature that is complete and trustworthy and (b) ethical practice, including full and accurate reporting of research” (p. 8). However, his own article falls short of this ideal. His article does not provide a complete, balanced, and objective account of the statistical literature. Rather, Cumming (2014) cheery-picks references that support his claims and does not cite references that are inconvenient for his claims. I give one clear example of bias in his literature review.

He cites Ioannidis’s 2005 paper to argue that p-values and NHST is flawed and should be abandoned. However, he does not cite Ioannidis and Trikalinos (2007). This article introduces a statistical approach that can detect biases in meta-analysis by comparing the success rate (percentage of significant results) to the observed power of the studies. As power determines the success rate in an honest set of studies, a higher success rate reveals publication bias. Cumming not only fails to mention this article. He goes on to warn readers “beware of any power statement that does not state an ES; do not use post hoc power.” Without further elaboration, this would imply that readers should ignore evidence for bias with the Test of Excessive Significance because it relies on post-hoc power. To support this claim, he cites Hoenig and Heisey (2001) to claim that “post hoc power can often take almost any value, so it is likely to be misleading” (p. 24). This statement is misleading because post-hoc power is no different from any other statistic that is influenced by sampling error. In fact,Hoenig and Heisey (2001) show that post-hoc power in a single study is monotonically related to p-values. Their main point is that post-hoc power provides no other information than p-values. However, like p-values, post-hoc power becomes more informative, the higher it is. A study with 99% post-hoc power is likely to be a high powered study, just like extremely low p-values, p < .0001, are unlikely to be obtained in low powered studies or in studies when the null-hypothesis is true. So, post-hoc power is informative when it is high. Cumming (2014) further ignores that variability of post-hoc power estimates decreases in a meta-analysis of post-hoc power and that post-hoc power has been used successfully to reveal bias in published articles (Francis, 2012; Schimmack (2012). Thus, his statement that researchers should ignore post-hoc power analyses is not supported by an unbiased review of the literature, and his article does not provide a complete and trustworthy account of the public research literature.

Conclusion

I cannot recommend Cumming’s new statistics. I routinely report confidence intervals in my empirical articles, but I do not consider them as a new statistical tool. In my opinion, the root cause of the credibility crisis is that researchers conduct underpowered studies that have a low chance to produce the predicted effect and then use questionable research practices to boost power and to hide non-significant results that could not be salvaged. A simple solution to this problem is to conduct more powerful studies that can produce significant results when the predict effect exists. I do not claim that this is a new insight. Rather, Jacob Cohen has tried his whole life to educate psychologists about the importance of statistical power.

Here is what Jacob Cohen had to say about the new statistics in 1994 using time-travel to comment on Cumming’s article 20 years later.

“Everyone knows” that confidence intervals contain all the information to be found in significance tests and much more. They not only reveal the status of the trivial nil hypothesis but also about the status of non-nil null hypotheses and thus help remind researchers about the possible operation of the crud factor. Yet they are rarely to be found in the literature. I suspect that the main reason they are not reported is that they are so embarrassingly large! But their sheer size should move us toward improving our measurement by seeking to reduce the unreliable and invalid part of the variance in our measures (as Student himself recommended almost a century ago). Also, their width provides us with the analogue of power analysis in significance testing—larger sample sizes reduce the size of confidence intervals as they increase the statistical power of NHST” (p. 1002).

If you are looking for a book on statistics, I recommend Cohen’s old statistics over Cumming’s new statistics, p < .05.

Conflict of Interest: I do not have a book to sell (yet), but I strongly believe that power analysis is an important tool for all scientists who have to deal with uncontrollable variance in their data. Therefore I am strongly opposed to Cumming’s push for a new statistics that provides no guidelines for researchers how they can optimize the use of their resources to obtain credible evidence for effects that actually exist and no guidelines how science can correct false positive results.