It was relatively quiet on academic twitter when most academics were enjoying the last weeks of summer before the start of a new, new-normal semester. This changed on August 17, when the datacolada crew published a new blog post that revealed fraud in a study of dishonesty (http://datacolada.org/98). Suddenly, the integrity of social psychology was once again discussed on twitter, in several newspaper articles, and an article in Science magazine (O’Grady, 2021). The discovery of fraud in one dataset raises questions about other studies in articles published by the same researcher as well as in social psychology in general (“some researchers are calling Ariely’s large body of work into question”; O’Grady, 2021).
The brouhaha about the discovery of fraud is understandable because fraud is widely considered an unethical behavior that violates standards of academic integrity that may end a career (e.g., Stapel). However, there are many other reasons to be suspect of the credibility of Dan Ariely’s published results and those by many other social psychologists. Over the past decade, strong scientific evidence has accumulated that social psychologists’ research practices were inadequate and often failed to produce solid empirical findings that can inform theories of human behavior, including dishonest ones.
Arguably, the most damaging finding for social psychology was the finding that only 25% of published results could be replicated in a direct attempt to reproduce original findings (Open Science Collaboration, 2015). With such a low base-rate of successful replications, all published results in social psychology journals are likely to fail to replicate. The rational response to this discovery is to not trust anything that is published in social psychology journals unless there is evidence that a finding is replicable. Based on this logic, the discovery of fraud in a study published in 2012 is of little significance. Even without fraud, many findings are questionable.
Questionable Research Practices
The idealistic model of a scientist assumes that scientists test predictions by collecting data and then let the data decide whether the prediction was true or false. Articles are written to follow this script with an introduction that makes predictions, a results section that tests these predictions, and a conclusion that takes the results into account. This format makes articles look like they follow the ideal model of science, but it only covers up the fact that actual science is produced in a very different way; at least in social psychology before 2012. Either predictions are made after the results are known (Kerr, 1998) or the results are selected to fit the predictions (Simmons, Nelson, & Simonsohn, 2011).
This explains why most articles in social psychology support authors’ predictions (Sterling, 1959; Sterling et al., 1995; Motyl et al., 2017). This high success rate is not the result of brilliant scientists and deep insights into human behaviors. Instead, it is explained by selection for (statistical) significance. That is, when a result produces a statistically significant result that can be used to claim support for a prediction, researchers write a manuscript and submit it for publication. However, when the result is not significant, they do not write a manuscript. In addition, researchers will analyze their data in multiple ways. If they find one way that supports their predictions, they will report this analysis, and not mention that other ways failed to show the effect. Selection for significance has many names such as publication bias, questionable research practices, or p-hacking. Excessive use of these practices makes it easy to provide evidence for false predictions (Simmons, Nelson, & Simonsohn, 2011). Thus, the end-result of using questionable practices and fraud can be the same; published results are falsely used to support claims as scientifically proven or validated, when they actually have not been subjected to a real empirical test.
Although questionable practices and fraud have the same effect, scientists make a hard distinction between fraud and QRPs. While fraud is generally considered to be dishonest and punished with retractions of articles or even job losses, QRPs are tolerated. This leads to the false impression that articles that have not been retracted provide credible evidence and can be used to make scientific arguments (studies show ….). However, QRPs are much more prevalent than outright fraud and account for the majority of replication failures, but do not result in retractions (John, Loewenstein, & Prelec, 2012; Schimmack, 2021).
The good news is that the use of QRPs is detectable even when original data are not available, whereas fraud typically requires access to the original data to reveal unusual patterns. Over the past decade, my collaborators and I have worked on developing statistical tools that can reveal selection for significance (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020; Schimmack, 2012). I used the most advanced version of these methods, z-curve.2.0, to examine the credibility of results published in Dan Ariely’s articles.
To examine the credibility of results published in Dan Ariely’s articles I followed the same approach that I used for other social psychologists (Replicability Audits). I selected articles based on authors’ H-Index in WebOfKnowledge. At the time of coding, Dan Ariely had an H-Index of 47; that is, he published 47 articles that were cited at least 47 times. I also included the 48th article that was cited 47 times. I focus on the highly cited articles because dishonest reporting of results is more harmful, if the work is highly cited. Just like a falling tree may not make a sound if nobody is around, untrustworthy results in an article that is not cited have no real effect.
For all empirical articles, I picked the most important statistical test per study. The coding of focal results is important because authors may publish non-significant results when they made no prediction. They may also publish a non-significant result when they predict no effect. However, most claims are based on demonstrating a statistically significant result. The focus on a single result is needed to ensure statistical independence which is an assumption made by the statistical model. When multiple focal tests are available, I pick the first one unless another one is theoretically more important (e.g., featured in the abstract). Although this coding is subjective, other researchers including Dan Ariely can do their own coding and verify my results.
Thirty-one of the 48 articles reported at least one empirical study. As some articles reported more than one study, the total number of studies was k = 97. Most of the results were reported with test-statistics like t, F, or chi-square values. These values were first converted into two-sided p-values and then into absolute z-scores. 92 of these z-scores were statistically significant and used for a z-curve analysis.
The key results of the z-curve analysis are captured in Figure 1.
Visual inspection of the z-curve plot shows clear evidence of selection for significance. While a large number of z-scores are just statistically significant (z > 1.96 equals p < .05), there are very few z-scores that are just shy of significance (z < 1.96). Moreover, the few z-scores that do not meet the standard of significance were all interpreted as sufficient evidence for a prediction. Thus, Dan Ariely’s observed success rate is 100% or 95% if only p-values below .05 are counted. As pointed out in the introduction, this is not a unique feature of Dan Ariely’s articles, but a general finding in social psychology.
A formal test of selection for significance compares the observed discovery rate (95% z-scores greater than 1.96) to the expected discovery rate that is predicted by the statistical model. The prediction of the z-curve model is illustrated by the blue curve. Based on the distribution of significant z-scores, the model expected a lot more non-significant results. The estimated expected discovery rate is only 15%. Even though this is just an estimate, the 95% confidence interval around this estimate ranges from 5% to only 31%. Thus, the observed discovery rate is clearly much much higher than one could expect. In short, we have strong evidence that Dan Ariely and his co-authors used questionable practices to report more successes than their actual studies produced.
Although these results cast a shadow over Dan Ariely’s articles, there is a silver lining. It is unlikely that the large pile of just significant results was obtained by outright fraud; not impossible, but unlikely. The reason is that QRPs are bound to produce just significant results, but fraud can produce extremely high z-scores. The fraudulent study that was flagged by datacolada has a z-score of 11, which is virtually impossible to produce with QRPs (Simmons et al., 2001). Thus, while we can disregard many of the results in Ariely’s articles, he does not have to fear to lose his job (unless more fraud is uncovered by data detectives). Ariely is also in good company. The expected discovery rate for John A. Bargh is 15% (Bargh Audit) and the one for Roy F. Baumester is 11% (Baumeister Audit).
The z-curve plot also shows some z-scores greater than 3 or even greater than 4. These z-scores are more likely to reveal true findings (unless they were obtained with fraud) because (a) it gets harder to produce high z-scores with QRPs and replication studies show higher success rates for original studies with strong evidence (Schimmack, 2021). The problem is to find a reasonable criterion to distinguish between questionable results and credible results.
Z-curve make it possible to do so because the EDR estimates can be used to estimate the false discovery risk (Schimmack & Bartos, 2021). As shown in Figure 1, with an EDR of 15% and a significance criterion of alpha = .05, the false discovery risk is 30%. That is, up to 30% of results with p-values below .05 could be false positive results. The false discovery risk can be reduced by lowering alpha. Figure 2 shows the results for alpha = .01. The estimated false discovery risk is now below 5%. This large reduction in the FDR was achieved by treating the pile of just significant results as no longer significant (i.e., it is now on the left side of the vertical red line that reflects significance with alpha = .01, z = 2.58).
With the new significance criterion only 51 of the 97 tests are significant (53%). Thus, it is not necessary to throw away all of Ariely’s published results. About half of his published results might have produced some real evidence. Of course, this assumes that z-scores greater than 2.58 are based on real data. Any investigation should therefore focus on results with p-values below .01.
The final information that is provided by a z-curve analysis is the probability that a replication study with the same sample size produces a statistically significant result. This probability is called the expected replication rate (ERR). Figure 1 shows an ERR of 52% with alpha = 5%, but it includes all of the just significant results. Figure 2 excludes these studies, but uses alpha = 1%. Figure 3 estimates the ERR only for studies that had a p-value below .01 but using alpha = .05 to evaluate the outcome of a replication study.
In Figure 3 only z-scores greater than 2.58 (p = .01; on the right side of the dotted blue line) are used to fit the model using alpha = .05 (the red vertical line at 1.96) as criterion for significance. The estimated replication rate is 85%. Thus, we would predict mostly successful replication outcomes with alpha = .05, if these original studies were replicated and if the original studies were based on real data.
The discovery of a fraudulent dataset in a study on dishonesty has raised new questions about the credibility of social psychology. Meanwhile, the much bigger problem of selection for significance is neglected. Rather than treating studies as credible unless they are retracted, it is time to distrust studies unless there is evidence to trust them. Z-curve provides one way to assure readers that findings can be trusted by keeping the false discovery risk at a reasonably low level, say below 5%. Applying this methods to Ariely’s most cited articles showed that nearly half of Ariely’s published results can be discarded because they entail a high false positive risk. This is also true for many other findings in social psychology, but social psychologists try to pretend that the use of questionable practices was harmless and can be ignored. Instead, undergraduate students, readers of popular psychology books, and policy makers may be better off by ignoring social psychology until social psychologists report all of their results honestly and subject their theories to real empirical tests that may fail. That is, if social psychology wants to be a science, social psychologists have to act like scientists.
Citation: Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551-566. http://dx.doi.org/10.1037/a0029487
In 2011 I wrote a manuscript in response to Bem’s (2011) unbelievable and flawed evidence for extroverts’ supernatural abilities. It took nearly two years for the manuscript to get published in Psychological Methods. While I was proud to have published in this prestigious journal without formal training in statistics and a grasp of Greek notation, I now realize that Psychological Methods was not the best outlet for the article, which may explain why even some established replication revolutionaries do not know it (comment: I read your blog, but I didn’t know about this article). So, I decided to publish an abridged (it is still long), lightly edited (I have learned a few things since 2011), and commented (comments are in […]) version here.
I also learned a few things about titles. So the revised version, has a new title.
Finally, I can now disregard the request from the editor, Scott Maxwell, on behave of reviewer Daryl Bem, to change the name of my statistical index from magic index to incredibilty index. (the advantage of publishing without the credentials and censorship of peer-review).
For readers not familiar with experimental social psychology, it is also important to understand what a multiple study article is. Most science are happy with one empirical study per article. However, social psychologists didn’t trust the results of a single study with p < .05. Therefore, they wanted to see internal conceptual replications of phenomena. Magically, Bem was able to provide evidence for supernatural abilities in not just 1 or 2 or 3 studies, but 8 conceptual replication studies with 9 successful tests. The chance of a false positive result in 9 statistical tests is smaller than the chance of finding evidence for the Higgs-Bosson particle, which was a big discovery in physics. So, readers in 2011 had a difficult choice to make: either supernatural phenomena are real or multiple study articles are unreal. My article shows that the latter is likely to be true, as did an article by Greg Francis.
Aside from Alcock’s demonstration of a nearly perfect negative correlation between effect sizes and sample sizes and my demonstration of insufficient variance in Bem’s p-values, Francis’s article and my article remain the only article that question the validity of Bem’s origina findings. Other articles have shown that the results cannot be replicated, but I showed that the original results were already too good to be true. This blog post explains, how I did it.
Why most multiple-study articles are false: An Introduction to the Magic Index
(the article formerly known as “The Ironic Effect of Significant Results on the Credibility of Multiple-Study Articles”)
Cohen (1962) pointed out the importance of statistical power for psychology as a science, but statistical power of studies has not increased, while the number of studies in a single article has increased. It has been overlooked that multiple studies with modest power have a high probability of producing nonsignificant results because power decreases as a function of the number of statistical tests that are being conducted (Maxwell, 2004). The discrepancy between the expected number of significant results and the actual number of significant results in multiple-study articles undermines the credibility of the reported
results, and it is likely that questionable research practices have contributed to the reporting of too many significant results (Sterling, 1959). The problem of low power in multiple-study articles is illustrated using Bem’s (2011) article on extrasensory perception and Gailliot et al.’s (2007) article on glucose and self-regulation. I conclude with several recommendations that can increase the credibility of scientific evidence in psychological journals. One major recommendation is to pay more attention to the power of studies to produce positive results without the help of questionable research practices and to request that authors justify sample sizes with a priori predictions of effect sizes. It is also important to publish replication studies with nonsignificant results if these studies have high power to replicate a published finding.
Less is more, except of course for sample size. (Cohen, 1990, p. 1304)
In 2011, the prestigious Journal of Personality and Social Psychology published an article that provided empirical support for extrasensory perception (ESP; Bem, 2011). The publication of this controversial article created vigorous debates in psychology
departments, the media, and science blogs. In response to this debate, the acting editor and the editor-in-chief felt compelled to write an editorial accompanying the article. The editors defended their decision to publish the article by noting that Bem’s (2011) studies were performed according to standard scientific practices in the field of experimental psychology and that it would seem inappropriate to apply a different standard to studies of ESP (Judd & Gawronski, 2011).
Others took a less sanguine view. They saw the publication of Bem’s (2011) article as a sign that the scientific standards guiding publication decisions are flawed and that Bem’s article served as a glaring example of these flaws (Wagenmakers, Wetzels, Borsboom,
& van der Maas, 2011). In a nutshell, Wagenmakers et al. (2011) argued that the standard statistical model in psychology is biased against the null hypothesis; that is, only findings that are statistically significant are submitted and accepted for publication.
This bias leads to the publication of too many positive (i.e., statistically significant) results. The observation that scientific journals, not only those in psychology,
publish too many statistically significant results is by no means novel. In a seminal article, Sterling (1959) noted that selective reporting of statistically significant results can produce literatures that “consist in substantial part of false conclusions” (p.
Three decades later, Sterling, Rosenbaum, and Weinkam (1995) observed that the “practice leading to publication bias have [sic] not changed over a period of 30 years” (p. 108). Recent articles indicate that publication bias remains a problem in psychological
journals (Fiedler, 2011; John, Loewenstein, & Prelec, 2012; Kerr, 1998; Simmons, Nelson, & Simonsohn, 2011; Strube, 2006; Vul, Harris, Winkielman, & Pashler, 2009; Yarkoni, 2010).
Other sciences have the same problem (Yong, 2012). For example, medical journals have seen an increase in the percentage of retracted articles (Steen, 2011a, 2011b), and there is the concern that a vast number of published findings may be false (Ioannidis,
However, a recent comparison of different scientific disciplines suggested that the bias is stronger in psychology than in some of the older and harder scientific disciplines at the top of a hierarchy of sciences (Fanelli, 2010).
It is important that psychologists use the current crisis as an opportunity to fix problems in the way research is being conducted and reported. The proliferation of eye-catching claims based on biased or fake data can have severe negative consequences for a
science. A New Yorker article warned the public that “all sorts of well-established, multiply confirmed findings have started to look increasingly uncertain. It’s as if our facts were losing their truth: claims that have been enshrined in textbooks are suddenly unprovable” (Lehrer, 2010, p. 1).
If students who read psychology textbooks and the general public lose trust in the credibility of psychological science, psychology loses its relevance because
objective empirical data are the only feature that distinguishes psychological science from other approaches to the understanding of human nature and behavior. It is therefore hard to exaggerate the seriousness of doubts about the credibility of research findings published in psychological journals.
In an influential article, Kerr (1998) discussed one source of bias, namely, hypothesizing after the results are known (HARKing). The practice of HARKing may be attributed to the
high costs of conducting a study that produces a nonsignificant result that cannot be published. To avoid this negative outcome, researchers can design more complex studies that test multiple hypotheses. Chances increase that at least one of the hypotheses
will be supported, if only because Type I error increases (Maxwell, 2004). As noted by Wagenmakers et al. (2011), generations of graduate students were explicitly advised that this questionable research practice is how they should write scientific manuscripts
It is possible that Kerr’s (1998) article undermined the credibility of single-study articles and added to the appeal of multiple-study articles (Diener, 1998; Ledgerwood & Sherman, 2012). After all, it is difficult to generate predictions for significant effects
that are inconsistent across studies. Another advantage is that the requirement of multiple significant results essentially lowers the chances of a Type I error, that is, the probability of falsely rejecting the null hypothesis. For a set of five independent studies,
the requirement to demonstrate five significant replications essentially shifts the probability of a Type I error from p < .05 for a single study to p < .0000003 (i.e., .05^5) for a set of five studies.
This is approximately the same stringent criterion that is being used in particle physics to claim a true discovery (Castelvecchi, 2011). It has been overlooked, however, that researchers have to pay a price to meet more stringent criteria of credibility. To demonstrate significance at a more stringent criterion of significance, it is
necessary to increase sample sizes to reduce the probability of making a Type II error (failing to reject the null hypothesis). This probability is called beta. The inverse probability (1 – beta) is called power. Thus, to maintain high statistical power to demonstrate an effect with a more stringent alpha level requires an
increase in sample sizes, just as physicists had to build a bigger collider to have a chance to find evidence for smaller particles like the Higgs boson particle.
Yet there is no evidence that psychologists are using bigger samples to meet more stringent demands of replicability (Cohen, 1992; Maxwell, 2004; Rossi, 1990; Sedlmeier & Gigerenzer, 1989). This raises the question of how researchers are able to replicate findings in multiple-study articles despite modest power to demonstrate significant effects even within a single study. Researchers can use questionable research
practices (e.g., snooping, not reporting failed studies, dropping dependent variables, etc.; Simmons et al., 2011; Strube, 2006) to dramatically increase the chances of obtaining a false-positive result. Moreover, a survey of researchers indicated that these
practices are common (John et al., 2012), and the prevalence of these practices has raised concerns about the credibility of psychology as a science (Yong, 2012).
An implicit assumption in the field appears to be that the solution to these problems is to further increase the number of positive replication studies that need to be presented to ensure scientific credibility (Ledgerwood & Sherman, 2012). However, the assumption that many replications with significant results provide strong evidence for a hypothesis is an illusion that is akin to the Texas sharpshooter fallacy (Milloy, 1995). Imagine a Texan farmer named Joe. One day he invites you to his farm and shows you a target with nine shots in the bull’s-eye and one shot just outside the bull’s-eye. You are impressed by his shooting abilities until you find out that he cannot repeat this performance when you challenge him to do it again.
[So far, well-known Texan sharpshooters in experimental social psychology have carefully avoided demonstrating their sharp shooting abilities in open replication studies to avoid the embarrassment of not being able to do it again].
Over some beers, Joe tells you that he first fired 10 shots at the barn and then drew the targets after the shots were fired. One problem in science is that reading a research
article is a bit like visiting Joe’s farm. Readers only see the final results, without knowing how the final results were created. Is Joe a sharpshooter who drew a target and then fired 10 shots at the target? Or was the target drawn after the fact? The reason why multiple-study articles are akin to a Texan sharpshooter is that psychological studies have modest power (Cohen, 1962; Rossi, 1990; Sedlmeier & Gigerenzer, 1989). Assuming
60% power for a single study, the probability of obtaining 10 significant results in 10 studies is less than 1% (.6^10 = 0.6%).
I call the probability to obtain only significant results in a set of studies total power. Total power parallels Maxwell’s (2004) concept of all-pair power for multiple comparisons in analysis-of variance designs. Figure 1 illustrates how total power decreases with the number of studies that are being conducted. Eventually, it becomes extremely unlikely that a set of studies produces only significant results. This is especially true if a single study has modest power. When total power is low, it is incredible that a set
of studies yielded only significant results. To avoid the problem of incredible results, researchers would have to increase the power of studies in multiple-study articles.
Table 1 shows how the power of individual studies has to be adjusted to maintain 80% total power for a set of studies. For example, to have 80% total power for five replications, the power of each study has to increase to 96%.
Table 1 also shows the sample sizes required to achieve 80% total power, assuming a simple between-group design, an alpha level of .05 (two-tailed), and Cohen’s
(1992) guidelines for a small (d = .2), moderate, (d = .5), and strong (d = .8) effect.
[To demonstrate a small effect 7 times would require more than 10,000 participants.]
In sum, my main proposition is that psychologists have falsely assumed that increasing the number of replications within an article increases credibility of psychological science. The problem of this practice is that a truly programmatic set of multiple studies
is very costly and few researchers are able to conduct multiple studies with adequate power to achieve significant results in all replication attempts. Thus, multiple-study articles have intensified the pressure to use questionable research methods to compensate for low total power and may have weakened rather than strengthened the credibility of psychological science.
[I believe this is one reason why the replication crisis has hit experimental social psychology the hardest. Other psychologists could use HARKing to tell a false story about a single study, but experimental social psychologists had to manipulate the data to get significance all the time. Experimental cognitive psychologists also have multiple study articles, but they tend to use more powerful within-subject designs, which makes it more credible to get significant results multiple times. The multiple study BS design made it impossible to do so, which resulted in the publication of BS results.]
What Is the Allure of Multiple-Study Articles?
One apparent advantage of multiple-study articles is to provide stronger evidence against the null hypothesis (Ledgerwood & Sherman, 2012). However, the number of studies is irrelevant because the strength of the empirical evidence is a function of the
total sample size rather than the number of studies. The main reason why aggregation across studies reduces randomness as a possible explanation for observed mean differences (or correlations) is that p values decrease with increasing sample size. The
number of studies is mostly irrelevant. A study with 1,000 participants has as much power to reject the null hypothesis as a meta-analysis of 10 studies with 100 participants if it is reasonable to assume a common effect size for the 10 studies. If true effect sizes vary across studies, power decreases because a random-effects model may be more appropriate (Schmidt, 2010; but see Bonett, 2009). Moreover, the most logical approach to reduce concerns about Type I error is to use more stringent criteria for significance (Mudge, Baker, Edge, & Houlahan, 2012). For controversial or very important research findings, the significance level could be set to p < .001 or, as in particle physics, to p <
[Ironically, five years later we have a debate about p < .05 versus p < .005, without even thinking about p < .0000005 or any mention that even a pair of studies with p < .05 in each study effectively have an alpha less than p < .005, namely .0025 to be exact.]
It is therefore misleading to suggest that multiple-study articles are more credible than single-study articles. A brief report with a large sample (N = 1,000) provides more credible evidence than a multiple-study article with five small studies (N = 40, total
N = 200).
The main appeal of multiple-study articles seems to be that they can address other concerns (Ledgerwood & Sherman, 2012). For example, one advantage of multiple studies could be to test the results across samples from diverse populations (Henrich, Heine, & Norenzayan, 2010). However, many multiple-study articles are based on samples drawn from a narrowly defined population (typically, students at the local university). If researchers were concerned about generalizability across a wider range of individuals, multiple-study articles should examine different populations. However, it is not clear why it would be advantageous to conduct multiple independent studies with different populations. To compare populations, it would be preferable to use the same procedures and to analyze the data within a single statistical model with population as a potential moderating factor. Moreover, moderator tests often have low power. Thus, a single study with a large sample and moderator variables is more informative than articles that report separate analyses with small samples drawn from different populations.
Another attraction of multiple-study articles appears to be the ability to provide strong evidence for a hypothesis by means of slightly different procedures. However, even here, single studies can be as good as multiple-study articles. For example, replication across different dependent variables in different studies may mask the fact that studies included multiple dependent variables and researchers picked dependent variables that produced significant results (Simmons et al., 2011). In this case, it seems preferable to
demonstrate generalizability across dependent variables by including multiple dependent variables within a single study and reporting the results for all dependent variables.
One advantage of a multimethod assessment in a single study is that the power to
demonstrate an effect increases for two reasons. First, while some dependent variables may produce nonsignificant results in separate small studies due to low power (Maxwell, 2004), they may all show significant effects in a single study with the total sample size
of the smaller studies. Second, it is possible to increase power further by constraining coefficients for each dependent variable or by using a latent-variable measurement model to test whether the effect is significant across dependent variables rather than for each one independently.
Multiple-study articles are most common in experimental psychology to demonstrate the robustness of a phenomenon using slightly different experimental manipulations. For example, Bem (2011) used a variety of paradigms to examine ESP. Demonstrating
a phenomenon in several different ways can show that a finding is not limited to very specific experimental conditions. Analogously, if Joe can hit the bull’s-eye nine times from different angles, with different guns, and in different light conditions, Joe
truly must be a sharpshooter. However, the variation of experimental procedures also introduces more opportunities for biases (Ioannidis, 2005).
[This is my take down of social psychologists’ claim that multiple conceptual replications test theories, Stroebe & Strack, 2004]
The reason is that variation of experimental procedures allows researchers to discount null findings. Namely, it is possible to attribute nonsignificant results to problems with the experimental procedure rather than to the absence of an effect. In this way, empirical studies no longer test theoretical hypotheses because they can only produce two results: Either they support the theory (p < .05) or the manipulation did not work (p > .05). It is therefore worrisome that Bem noted that “like most social psychological experiments, the experiments reported here required extensive pilot testing” (Bem, 2011, p. 421). If Joe is a sharpshooter, who can hit the bull’s-eye from different angles and with different guns, why does he need extensive training before he can perform the critical shot?
The freedom of researchers to discount null findings leads to the paradox that conceptual replications across multiple studies give the impression that an effect is robust followed by warnings that experimental findings may not replicate because they depend “on subtle and unknown factors” (Bem, 2011, p. 422).
If experimental results were highly context dependent, it would be difficult to explain how studies reported in research articles nearly always produce the expected results. One possible explanation for this paradox is that sampling error in small samples creates the illusion that effect sizes vary systematically, although most of the variation is random. Researchers then pick studies that randomly produced inflated effect sizes and may further inflate them by using questionable research methods to achieve significance (Simmons et al., 2011).
[I was polite when I said “may”. This appears to be exactly what Bem did to get his supernatural effects.]
The final set of studies that worked is then published and gives a false sense of the effect size and replicability of the effect (you should see the other side of Joe’s barn). This may explain why research findings initially seem so impressive, but when other researchers try to build on these seemingly robust findings, it becomes increasingly uncertain whether a phenomenon exists at all (Ioannidis, 2005; Lehrer, 2010).
At this point, a lot of resources have been wasted without providing credible evidence for an effect.
[And then Stroebe and Strack in 2014 suggest that real replication studies that let the data determine the outcome are a waste of resources.]
To increase the credibility of reported findings, it would be better to use all of the resources for one powerful study. For example, the main dependent variable in Bem’s (2011) study of ESP was the percentage of correct predictions of future events.
Rather than testing this ability 10 times with N = 100 participants, it would have been possible to test the main effect of ESP in a single study with 10 variations of experimental procedures and use the experimental conditions as a moderating factor. By testing one
main effect of ESP in a single study with N = 1,000, power would be greater than 99.9% to demonstrate an effect with Bem’s a priori effect size.
At the same time, the power to demonstrate significant moderating effects would be much lower. Thus, the study would lead to the conclusion that ESP does exist but that it is unclear whether the effect size varies as a function of the actual experimental
paradigm. This question could then be examined in follow-up studies with more powerful tests of moderating factors.
In conclusion, it is true that a programmatic set of studies is superior to a brief article that reports a single study if both articles have the same total power to produce significant results (Ledgerwood & Sherman, 2012). However, once researchers use questionable research practices to make up for insufficient total power, multiple-study articles lose their main advantage over single-study articles, namely, to demonstrate generalizability across different experimental manipulations or other extraneous factors.
Moreover, the demand for multiple studies counteracts the demand for more
powerful studies (Cohen, 1962; Maxwell, 2004; Rossi, 1990) because limited resources (e.g., subject pool of PSY100 students) can only be used to increase sample size in one study or to conduct more studies with small samples.
It is therefore likely that the demand for multiple studies within a single article has eroded rather than strengthened the credibility of published research findings
(Steen, 2011a, 2011b), and it is problematic to suggest that multiple-study articles solve the problem that journals publish too many positive results (Ledgerwood & Sherman, 2012). Ironically, the reverse may be true because multiple-study articles provide a
false sense of credibility.
Joe the Magician: How Many Significant Results Are Too Many?
Most people enjoy a good magic show. It is fascinating to see something and to know at the same time that it cannot be real. Imagine that Joe is a well-known magician. In front of a large audience, he fires nine shots from impossible angles, blindfolded, and seemingly through the body of an assistant, who miraculously does not bleed. You cannot figure out how Joe pulled off the stunt, but you know it was a stunt. Similarly, seeing Joe hit the bull’s-eye 1,000 times in a row raises concerns about his abilities as a sharpshooter and suggests that some magic is contributing to this miraculous performance. Magic is fun, but it is not science.
[Before Bem’s article appeared, Steve Heine gave a talk at the University of Toront where he presented multiple studies with manipulations of absurdity (absurdity like Monty Python’s “Biggles: Pioneer Air Fighter; cf. Proulx, Heine, & Vohs, PSPB, 2010). Each absurd manipulation was successful. I didn’t have my magic index then, but I did understand the logic of Sterling et al.’s (1995) argument. So, I did ask whether there were also manipulations that did not work and the answer was affirmative. It was rude at the time to ask about a file drawer before 2011, but a recent twitter discussion suggests that it wouldn’t be rude in 2018. Times are changing.]
The problem is that some articles in psychological journals appear to be more magical than one would expect on the basis of the normative model of science (Kerr, 1998). To increase the credibility of published results, it would be desirable to have a diagnostic tool that can distinguish between credible research findings and those that are likely to be based on questionable research practices. Such a tool would also help to
counteract the illusion that multiple-study articles are superior to single-study articles without leading to the erroneous reverse conclusion that single-study articles are more trustworthy.
[I need to explain why I targeted multiple-study articles in particular. Even the personality section of JPSP started to demand multiple studies because they created the illusion of being more rigorous, e.g., the crazy glucose article was published in that section. At that time, I was still trying to publish as many articles as possible in JPSP and I was not able to compete with crazy science.]
Articles should be evaluated on the basis of their total power to demonstrate consistent evidence for an effect. As such, a single-study article with 80% (total) power is superior to a multiple-study article with 20% total power, but a multiple-study article with 80% total power is superior to a single-study article with 80% power.
The Magic Index (formerly known as the Incredibility Index)
The idea to use power analysis to examine bias in favor of theoretically predicted effects and against the null hypothesis was introduced by Sterling et al. (1995). Ioannidis and Trikalinos (2007) provided a more detailed discussion of this approach for the detection of bias in meta-analyses. Ioannidis and Trikalinos’s exploratory test estimates the probability of the number of reported significant results given the average power of the reported studies. Low p values suggest that there are too many significant results, suggesting that questionable research methods contributed to the reported results. In contrast, the inverse inference is not justified because high p values do not justify the inference that questionable research practices did not contribute to the results. To emphasize this asymmetry in inferential strength, I suggest reversing the exploratory test, focusing on the probability of obtaining more nonsignificant results than were reported in a multiple-study article and calling this index the magic index.
Higher values indicate that there is a surprising lack of nonsignificant results (a.k.a., shots that missed the bull’s eye). The higher the magic index is, the more incredible the observed outcome becomes.
Too many significant results could be due to faking, fudging, or fortune. Thus, the statistical demonstration that a set of reported findings is magical does not prove that questionable research methods contributed to the results in a multiple-study article. However, even when questionable research methods did not contribute to the results, the published results are still likely to be biased because fortune helped to inflate effect sizes and produce more significant results than total power justifies.
Computation of the Incredibility Index
To understand the basic logic of the M-index, it is helpful to consider a concrete example. Imagine a multiple-study article with 10 studies with an average observed effect size of d = .5 and 84 participants in each study (42 in two conditions, total N = 840) and all studies producing a significant result. At first sight, these 10 studies seem to provide strong support against the null hypothesis. However, a post hoc power analysis with the average effect size of d = .5 as estimate of the true effect size reveals that each study had
only 60% power to obtain a significant result. That is, even if the true effect size were d = .5, only six out of 10 studies should have produced a significant result.
The M-index quantifies the probability of the actual outcome (10 out of 10 significant results) given the expected value (six out of 10 significant results) using binomial
probability theory. From the perspective of binomial probability theory, the scenario
is analogous to an urn problem with replacement with six green balls (significant) and four red balls (nonsignificant). The binomial probability to draw at least one red ball in 10 independent draws is 99.4%. (Stat Trek, 2012).
That is, 994 out of 1,000 multiple-study articles with 10 studies and 60% average power
should have produced at least one nonsignificant result in one of the 10 studies. It is therefore incredible if an article reports 10 significant results because only six out of 1,000 attempts would have produced this outcome simply due to chance alone.
[I now realize that observed power of 60% would imply that the null-hypothesis is true because observed power is also inflated by selecting for significance. As 50% observed poewr is needed to achieve significance and chance cannot produce the same observed power each time, the minimum observed power is 62%!]
One of the main problems for power analysis in general and the computation of the IC-index in particular is that the true effect size is unknown and has to be estimated. There are three basic approaches to the estimation of true effect sizes. In rare cases, researchers provide explicit a priori assumptions about effect sizes (Bem, 2011). In this situation, it seems most appropriate to use an author’s stated assumptions about effect sizes to compute power with the sample sizes of each study. A second approach is to average reported effect sizes either by simply computing the mean value or by weighting effect sizes by their sample sizes. Averaging of effect sizes has the advantage that post hoc effect size estimates of single studies tend to have large confidence intervals. The confidence intervals shrink when effect sizes are aggregated across
studies. However, this approach has two drawbacks. First, averaging of effect sizes makes strong assumptions about the sampling of studies and the distribution of effect sizes (Bonett, 2009). Second, this approach assumes that all studies have the same effect
size, which is unlikely if a set of studies used different manipulations and dependent variables to demonstrate the generalizability of an effect. Ioannidis and Trikalinos (2007) were careful to warn readers that “genuine heterogeneity may be mistaken for bias” (p.
[I did not know about Ioannidis and Trikalinos’s (2007) article when I wrote the first draft. Maybe that is a good thing because I might have followed their approach. However, my approach is different from their approach and solves the problem of pooling effect sizes. Claiming that my method is the same as Trikalinos’s method is like confusing random effects meta-analysis with fixed-effect meta-analysis]
To avoid the problems of average effect sizes, it is promising to consider a third option. Rather than pooling effect sizes, it is possible to conduct post hoc power analysis for each study. Although each post hoc power estimate is associated with considerable sampling error, sampling errors tend to cancel each other out, and the M-index for a set of studies becomes more accurate without having to assume equal effect sizes in all studies.
Unfortunately, this does not guarantee that the M-index is unbiased because power is a nonlinear function of effect sizes. Yuan and Maxwell (2005) examined the implications of this nonlinear relationship. They found that the M-index may provide inflated estimates of average power, especially in small samples where observed effect sizes vary widely around the true effect size. Thus, the M-index is conservative when power is low and magic had to be used to create significant results.
In sum, it is possible to use reported effect sizes to compute post hoc power and to use post hoc power estimates to determine the probability of obtaining a significant result. The post hoc power values can be averaged and used as the probability for a successful
outcome. It is then possible to use binomial probability theory to determine the probability that a set of studies would have produced equal or more nonsignificant results than were actually reported. This probability is [now] called the M-index.
[Meanwhile, I have learned that it is much easier to compute observed power based on reported test statistics like t, F, and chi-square values because observed power is determined by these statistics.]
Example 1: Extrasensory Perception (Bem, 2011)
I use Bem’s (2011) article as an example because it may have been a tipping point for the current scientific paradigm in psychology (Wagenmakers et al., 2011).
[I am still waiting for EJ to return the favor and cite my work.]
The editors explicitly justified the publication of Bem’s article on the grounds that it was subjected to a rigorous review process, suggesting that it met current standards of scientific practice (Judd & Gawronski, 2011). In addition, the editors hoped that the publication of Bem’s article and Wagenmakers et al.’s (2011) critique would stimulate “critical further thoughts about appropriate methods in research on social cognition and attitudes” (Judd & Gawronski, 2011, p. 406).
A first step in the computation of the M-index is to define the set of effects that are being examined. This may seem trivial when the M-index is used to evaluate the credibility of results in a single article, but multiple-study articles contain many results and it is not always obvious that all results should be included in the analysis (Maxwell, 2004).
[Same here. Maxwell accepted my article, but apparently doesn’t think it is useful to cite when he writes about the replication crisis.]
[deleted minute details about Bem’s study here.]
Another decision concerns the number of hypotheses that should be examined. Just as multiple studies reduce total power, tests of multiple hypotheses within a single study also reduce total power (Maxwell, 2004). Francis (2012b) decided to focus only on the
hypothesis that ESP exists, that is, that the average individual can foresee the future. However, Bem (2011) also made predictions about individual differences in ESP. Therefore, I used all 19 effects reported in Table 7 (11 ESP effects and eight personality effects).
[I deleted the section that explains alternative approaches that rely on effect sizes rather than observed power here.]
I used G*Power 3.1.2 to obtain post hoc power on the basis of effect sizes and sample sizes (Faul, Erdfelder, Buchner, & Lang, 2009).
The M-index is more powerful when a set of studies contains only significant results. In this special case, the M-index is the inverse probability of total power.
[An article by Fabrigar and Wegener misrepresents my article and confuses the M-Index with total power. When articles do report non-significant result and honestly report them as failures to reject the null-hypothesis (not marginal significance), it is necessary to compute the binomial probability to get the M-Index.]
[Again, I deleted minute computations for Bem’s results.]
Using the highest magic estimates produces a total Magic-Index of 99.97% for Bem’s 17 results. Thus, it is unlikely that Bem (2011) conducted 10 studies, ran 19 statistical tests of planned hypotheses, and obtained 14 statisstically significant results.
Yet the editors felt compelled to publish the manuscript because “we can only take the author at his word that his data are in fact genuine and that the reported findings have not been taken from a larger set of unpublished studies showing null effects” (Judd & Gawronski, 2011, p. 406).
[It is well known that authors excluded disconfirming evidence and that editors sometimes even asked authors to engage in this questionable research practice. However, this quote implies that the editors asked Bem about failed studies and that he assured them that there are no failed studies, which may have been necessary to publish these magical results in JPSP. If Bem did not disclose failed studies on request and these studies exist, it would violate even the lax ethical standards of the time that mostly operated on a “don’t ask don’t tell” basis. ]
The M-index provides quantitative information about the credibility of this assumption and would have provided the editors with objective information to guide their decision. More importantly, awareness about total power could have helped Bem to plan fewer studies with higher total power to provide more credible evidence for his hypotheses.
Example 2: Sugar High—When Rewards Undermine Self-Control
Bem’s (2011) article is exceptional in that it examined a controversial phenomenon. I used another nine-study article that was published in the prestigious Journal of Personality and Social Psychology to demonstrate that low total power is also a problem
for articles that elicit less skepticism because they investigate less controversial hypotheses. Gailliot et al. (2007) examined the relation between blood glucose levels and self-regulation. I chose this article because it has attracted a lot of attention (142 citations in Web of Science as of May 2012; an average of 24 citations per year) and it is possible to evaluate the replicability of the original findings on the basis of subsequent studies by other researchers (Dvorak & Simons, 2009; Kurzban, 2010).
[If anybody needs evidence that citation counts are a silly indicator of quality, here it is: the article has been cited 80 times in 2014, 64 times in 2015, 63 times in 2016, and 61 times in 2017. A good reason to retract it, if JPSP and APA cares about science and not just impact factors.]
Sample sizes were modest, ranging from N = 12 to 102. Four studies had sample sizes of N < 20, which Simmons et al. (2011) considered to require special justification. The total N is 359 participants. Table 1 shows that this total sample
size is sufficient to have 80% total power for four large effects or two moderate effects and is insufficient to demonstrate a [single] small effect. Notably, Table 4 shows that all nine reported studies produced significant results.
The M-Index for these 9 studies was greater than 99%. This indicates that from a statistical point of view, Bem’s (2011) evidence for ESP is more credible than Gailliot et al.’s (2007) evidence for a role of blood glucose in self-regulation.
A more powerful replication study with N = 180 participants provides more conclusive evidence (Dvorak & Simons, 2009). This study actually replicated Gailliot et al.’s (1997) findings in Study 1. At the same time, the study failed to replicate the results for Studies 3–6 in the original article. Dvorak and Simons (2009) did not report the correlation, but the authors were kind enough to provide this information. The correlation was not significant in the experimental group, r(90) = .10, and the control group, r(90) =
.03. Even in the total sample, it did not reach significance, r(180) = .11. It is therefore extremely likely that the original correlations were inflated because a study with a sample of N = 90 has 99.9% power to produce a significant effect if the true effect
size is r = .5. Thus, Dvorak and Simons’s results confirm the prediction of the M-index that the strong correlations in the original article are incredible.
In conclusion, Gailliot et al. (2007) had limited resources to examine the role of blood glucose in self-regulation. By attempting replications in nine studies, they did not provide strong evidence for their theory. Rather, the results are incredible and difficult to replicate, presumably because the original studies yielded inflated effect sizes. A better solution would have been to test the three hypotheses in a single study with a large sample. This approach also makes it possible to test additional hypotheses, such as mediation (Dvorak & Simons, 2009). Thus, Example 2 illustrates that
a single powerful study is more informative than several small studies.
Fifty years ago, Cohen (1962) made a fundamental contribution to psychology by emphasizing the importance of statistical power to produce strong evidence for theoretically predicted effects. He also noted that most studies at that time had only sufficient power to provide evidence for strong effects. Fifty years later, power
analysis remains neglected. The prevalence of studies with insufficient power hampers scientific progress in two ways. First, there are too many Type II errors that are often falsely interpreted as evidence for the null hypothesis (Maxwell, 2004). Second, there
are too many false-positive results (Sterling, 1959; Sterling et al., 1995). Replication across multiple studies within a single article has been considered a solution to these problems (Ledgerwood & Sherman, 2012). The main contribution of this article is to point out that multiple-study articles do not provide more credible evidence simply because they report more statistically significant results. Given the modest power of individual studies, it is even less credible that researchers were able to replicate results repeatedly in a series of studies than that they obtained a significant effect in a single study.
The demonstration that multiple-study articles often report incredible results might help to reduce the allure of multiple-study articles (Francis, 2012a, 2012b). This is not to say that multiple-study articles are intrinsically flawed or that single-study articles are superior. However, more studies are only superior if total power is held constant, yet limited resources create a trade-off between the number of studies and total power of a set of studies.
To maintain credibility, it is better to maximize total power rather than number of studies. In this regard, it is encouraging that some editors no longer consider number ofstudies as a selection criterion for publication (Smith, 2012).
[Over the past years, I have been disappointed by many psychologists that I admired or respected. I loved ER Smith’s work on exemplar models that influenced my dissertation work on frequency estimation of emotion. In 2012, I was hopeful that he would make real changes, but my replicability rankings show that nothing changed during his term as editor of the JPSP section that published Bem’s article. Five wasted years and nobody can say he couldn’t have known better.]
Subsequently, I first discuss the puzzling question of why power continues to be ignored despite the crucial importance of power to obtain significant results without the help of questionable research methods. I then discuss the importance of paying more attention to total power to increase the credibility of psychology as a science. Due to space limitations, I will not repeat many other valuable suggestions that have been made to improve the current scientific model (Schooler, 2011; Simmons et al., 2011; Spellman, 2012; Wagenmakers et al., 2011).
In my discussion, I will refer to Bem’s (2011) and Gailliot et al.’s (2007) articles, but it should be clear that these articles merely exemplify flaws of the current scientific
paradigm in psychology.
Why Do Researchers Continue to Ignore Power?
Maxwell (2004) proposed that researchers ignore power because they can use a shotgun approach. That is, if Joe sprays the barn with bullets, he is likely to hit the bull’s-eye at least once. For example, experimental psychologists may use complex factorial
designs that test multiple main effects and interactions to obtain at
least one significant effect (Maxwell, 2004).
Psychologists who work with many variables can test a large number of correlations
to find a significant one (Kerr, 1998). Although studies with small samples have modest power to detect all significant effects (low total power), they have high power to detect at least one significant effect (Maxwell, 2004).
The shotgun model is unlikely to explain incredible results in multiple-study articles because the pattern of results in a set of studies has to be consistent. This has been seen as the main strength of multiple-study articles (Ledgerwood & Sherman, 2012).
However, low total power in multiple-study articles makes it improbable that all studies produce significant results and increases the pressure on researchers to use questionable research methods to comply with the questionable selection criterion that
manuscripts should report only significant results.
A simple solution to this problem would be to increase total power to avoid
having to use questionable research methods. It is therefore even more puzzling why the requirement of multiple studies has not resulted in an increase in power.
One possible explanation is that researchers do not care about effect sizes. Researchers may not consider it unethical to use questionable research methods that inflate effect sizes as long as they are convinced that the sign of the reported effect is consistent
with the sign of the true effect. For example, the theory that implicit attitudes are malleable is supported by a positive effect of experimental manipulations on the implicit association test, no matter whether the effect size is d = .8 (Dasgupta & Greenwald,
2001) or d = .08 (Joy-Gaba & Nosek, 2010), and the influence of blood glucose levels on self-control is supported by a strong correlation of r = .6 (Gailliot et al., 2007) and a weak correlation of r = .1 (Dvorak & Simons, 2009).
The problem is that in the real world, effect sizes matter. For example, it matters whether exercising for 20 minutes twice a week leads to a weight loss of one
pound or 10 pounds. Unbiased estimates of effect sizes are also important for the integrity of the field. Initial publications with stunning and inflated effect sizes produce underpowered replication studies even if subsequent researchers use a priori power analysis.
As failed replications are difficult to publish, inflated effect sizes are persistent and can bias estimates of true effect sizes in meta-analyses. Failed replication studies in file drawers also waste valuable resources (Spellman, 2012).
In comparison to one small (N = 40) published study with an inflated effect size and
nine replication studies with nonsignificant replications in file drawers (N = 360), it would have been better to pool the resources of all 10 studies for one strong test of an important hypothesis (N = 400).
A related explanation is that true effect sizes are often likely to be small to moderate and that researchers may not have sufficient resources for unbiased tests of their hypotheses. As a result, they have to rely on fortune (Wegner, 1992) or questionable research
methods (Simmons et al., 2011; Vul et al., 2009) to report inflated observed effect sizes that reach statistical significance in small samples.
Another explanation is that researchers prefer small samples to large samples because small samples have less power. When publications do not report effect sizes, sample sizes become an imperfect indicator of effect sizes because only strong effects
reach significance in small samples. This has led to the flawed perception that effect sizes in large samples have no practical significance because even effects without practical significance can reach statistical significance (cf. Royall, 1986). This line of
reasoning is fundamentally flawed and confounds credibility of scientific evidence with effect sizes.
The most probable and banal explanation for ignoring power is poor statistical training at the undergraduate and graduate levels. Discussions with colleagues and graduate students suggest that power analysis is mentioned, but without a sense of importance.
[I have been preaching about power for years in my department and it became a running joke for students to mention power in their presentation without having any effect on research practices until 2011. Fortunately, Bem unintentionally made it able to convince some colleagues that power is important.]
Research articles also reinforce the impression that power analysis is not important as sample sizes vary seemingly at random from study to study or article to article. As a result, most researchers probably do not know how risky their studies are and how lucky they are when they do get significant and inflated effects.
I hope that this article will change this and that readers take total power into account when they read the next article with five or more studies and 10 or more significant results and wonder whether they have witnessed a sharpshooter or have seen a magic show.
Finally, it is possible that researchers ignore power simply because they follow current practices in the field. Few scientists are surprised that published findings are too good to be true. Indeed, a common response to presentations of this work has been that the M-index only shows the obvious. Everybody knows that researchers use a number of questionable research practices to increase their chances of reporting significant results, and a high percentage of researchers admit to using these practices, presumably
because they do not consider them to be questionable (John et al., 2012).
[Even in 2014, Stroebe and Strack claim that it is not clear which practices should be considered questionable, whereas my undergraduate students have no problem realizing that hiding failed studies undermines the purpose of doing an empirical study in the first place.]
The benign view of current practices is that successful studies provide all of the relevant information. Nobody wants to know about all the failed attempts of alchemists to turn base metals into gold, but everybody would want to know about a process that
actually achieves this goal. However, this logic rests on the assumption that successful studies were really successful and that unsuccessful studies were really flawed. Given the modest power of studies, this conclusion is rarely justified (Maxwell, 2004).
To improve the status of psychological science, it will be important to elevate the scientific standards of the field. Rather than pointing to limited resources as an excuse,
researchers should allocate resources more wisely (spend less money on underpowered studies) and conduct more relevant research that can attract more funding. I think it would be a mistake to excuse the use of questionable research practices by pointing out that false discoveries in psychological research have less dramatic consequences than drugs with little benefits, huge costs, and potential side effects.
Therefore, I disagree with Bem’s (2000) view that psychologists should “err on the side of discovery” (p. 5).
[Yup, he wrote that in a chapter that was used to train graduate students in social psychology in the art of magic.]
Recommendations for Improvement
Use Power in the Evaluation of Manuscripts
Granting agencies often ask that researchers plan studies with adequate power (Fritz & MacKinnon, 2007). However, power analysis is ignored when researchers report their results. The reason is probably that (a priori) power analysis is only seen as a way to ensure that a study produces a significant result. Once a significant finding has been found, low power no longer seems to be a problem. After all, a significant effect was found (in one condition, for male participants, after excluding two outliers, p =
One way to improve psychological science is to require researchers to justify sample sizes in the method section. For multiple-study articles, researchers should be asked to compute total power.
[This is something nobody has even started to discuss. Although there are more and more (often questionable) a priori power calculations in articles, they tend to aim for 80% power for a single hypothesis test, but these articles often report multiple studies or multiple hypothesis tests in a single article. The power to get two significant results with 80-% for each test is only 64%. ]
If a study has 80% total power, researchers should also explain how they would deal with the possible outcome of a nonsignificant result. Maybe it would change the perception of research contributions when a research article reports 10 significant
results, although power was only sufficient to obtain six. Implementing this policy would be simple. Thus, it is up to editors to realize the importance of statistical power and to make power an evaluation criterion in the review process (Cohen, 1992).
Implementing this policy could change the hierarchy of psychological
journals. Top journals would no longer be the journals with the most inflated effect sizes but, rather, the journals with the most powerful studies and the most credible scientific evidence.
[Based on this idea, I started developing my replicability rankings of journals. And they show that impact factors still do not take replicability into account.]
Reward Effort Rather Than Number of Significant Results
Another recommendation is to pay more attention to the total effort that went into an empirical study rather than the number of significant p values. The requirement to have multiple studies with no guidelines about power encourages a frantic empiricism in
which researchers will conduct as many cheap and easy studies as possible to find a set of significant results.
[And if power is taken into account, researchers now do six cheap Mturk studies. Although this is better than six questionable studies, it does not correct the problem that good research often requires a lot of resources.]
It is simply too costly for researchers to invest in studies with observation of real behaviors, high ecological validity, or longitudinal assessments that take
time and may produce a nonsignificant result.
Given the current environmental pressures, a low-quality/high-quantity strategy is
more adaptive and will ensure survival (publish or perish) and reproductive success (more graduate students who pursue a lowquality/ high-quantity strategy).
[It doesn’t help to become a meta-psychologists. Which smart undergraduate student would risk the prospect of a career by becoming a meta-psychologist?]
A common misperception is that multiple-study articles should be rewarded because they required more effort than a single study. However, the number of studies is often a function of the difficulty of conducting research. It is therefore extremely problematic to
assume that multiple studies are more valuable than single studies.
A single longitudinal study can be costly but can answer questions that multiple cross-sectional studies cannot answer. For example, one of the most important developments in psychological measurement has been the development of the implicit association test
(Greenwald, McGhee, & Schwartz, 1998). A widespread belief about the implicit association test is that it measures implicit attitudes that are more stable than explicit attitudes (Gawronski, 2009), but there exist hardly any longitudinal studies of the stability of implicit attitudes.
[I haven’t checked but I don’t think this has changed much. Cross-sectional Mturk studies can still produce sexier results than a study that simply estimates the stability of the same measure over time. Social psychologists tend to be impatient creatures (e.g., Bem)]
A simple way to change the incentive structure in the field is to undermine the false belief that multiple-study articles are better than single-study articles. Often multiple studies are better combined into a single study. For example, one article published four studies that were identical “except that the exposure duration—suboptimal (4 ms)
or optimal (1 s)—of both the initial exposure phase and the subsequent priming phase was orthogonally varied” (Murphy, Zajonc, & Monahan, 1995, p. 589). In other words, the four studies were four conditions of a 2 x 2 design. It would have been more efficient and
informative to combine the information of all studies in a single study. In fact, after reporting each study individually, the authors reported the results of a combined analysis. “When all four studies are entered into a single analysis, a clear pattern emerges” (Murphy et al., 1995, p. 600). Although this article may be the most extreme example of unnecessary multiplicity, other multiple-study articles could also be more informative by reducing the number of studies in a single article.
Apparently, readers of scientific articles are aware of the limited information gain provided by multiple-study articles because citation counts show that multiple-study articles do not have more impact than single-study articles (Haslam et al., 2008). Thus, editors should avoid using number of studies as a criterion for accepting articles.
Allow Publication of Nonsignificant Results
The main point of the M-index is to alert researchers, reviewers, editors, and readers of scientific articles that a series of studies that produced only significant results is neither a cause for celebration nor strong evidence for the demonstration of a scientific discovery; at least not without a power analysis that shows the results are credible.
Given the typical power of psychological studies, nonsignificant findings should be obtained regularly, and the absence of nonsignificant results raises concerns about the credibility of published research findings.
Most of the time, biases may be benign and simply produce inflated effect sizes, but occasionally, it is possible that biases may have more serious consequences (e.g.,
demonstrate phenomena that do not exist).
A perfectly planned set of five studies, where each study has 80% power, is expected to produce one nonsignificant result. It is not clear why editors sometimes ask researchers to remove studies with nonsignificant results. Science is not a beauty contest, and a
nonsignificant result is not a blemish.
This wisdom is captured in the Japanese concept of wabi-sabi, in which beautiful objects are designed to have a superficial imperfection as a reminder that nothing is perfect. On the basis of this conception of beauty, a truly perfect set of studies is one that echoes the imperfection of reality by including failed studies or studies that did not produce significant results.
Even if these studies are not reported in great detail, it might be useful to describe failed studies and explain how they informed the development of studies that produced significant results. Another possibility is to honestly report that a study failed to produce a significant result with a sample size that provided 80% power and that the researcher then added more participants to increase power to 95%. This is different from snooping (looking at the data until a significant result has been found), especially if it is stated clearly that the sample size was increased because the effect was not significant with the originally planned sample size and the significance test has been adjusted to take into account that two significance tests were performed.
The M-index rewards honest reporting of results because reporting of null findings renders the number of significant results more consistent with the total power of the studies. In contrast, a high M-index can undermine the allure of articles that report more significant results than the power of the studies warrants. In this
way, post-hoc power analysis could have the beneficial effect that researchers finally start paying more attention to a priori power.
Limited resources may make it difficult to achieve high total power. When total power is modest, it becomes important to report nonsignificant results. One way to report nonsignificant results would be to limit detailed discussion to successful studies but to
include studies with nonsignificant results in a meta-analysis. For example, Bem (2011) reported a meta-analysis of all studies covered in the article. However, he also mentioned several pilot studies and a smaller study that failed to produce a significant
result. To reduce bias and increase credibility, pilot studies or other failed studies could be included in a meta-analysis at the end of a multiple-study article. The meta-analysis could show that the effect is significant across an unbiased sample of studies that produced significant and nonsignificant results.
This overall effect is functionally equivalent to the test of the hypothesis in a single
study with high power. Importantly, the meta-analysis is only credible if it includes nonsignificant results.
[Since then, several articles have proposed meta-analyses and given tutorials on mini-meta-analysis without citing my article and without clarifying that these meta-analysis are only useful if all evidence is included and without clarifying that bias tests like the M-Index can reveal whether all relevant evidence was included.]
It is also important that top journals publish failed replication studies. The reason is that top journals are partially responsible for the contribution of questionable research practices to published research findings. These journals look for novel and groundbreaking studies that will garner many citations to solidify their position
as top journals. As everywhere else (e.g., investing), the higher payoff comes with a higher risk. In this case, the risk is publishing false results. Moreover, the incentives for researchers to get published in top journals or get tenure at Ivy League universities
increases the probability that questionable research practices contribute
to articles in the top journals (Ledford, 2010). Stapel faked data to get a publication in Science, not to get a publication in Psychological Reports.
There are positive signs that some journal editors are recognizing their responsibility for publication bias (Dirnagl & Lauritzen, 2010). The medical journal Journal of Cerebral Blood Flow and Metabolism created a section that allows researchers to publish studies with disconfirmatory evidence so that this evidence is published in the same journal. One major advantage of having this section in top journals is that it may change the evaluation criteria of journal editors toward a more careful assessment of Type I error when they accept a manuscript for publication. After all, it would be quite embarrassing to publish numerous articles that erred on the side of discovery if subsequent issues reveal that these discoveries were illusory.
[After some pressure from social media, JPSP did publish failed replications of Bem, and it now has a replication section (online only). Maybe somebody can dig up some failed replications of glucose studies, I know they exist, or do one more study to publish in JPSP that, just like ESP, glucose is a myth.]
It could also reduce the use of questionable research practices by researchers eager to publish in prestigious journals if there was a higher likelihood that the same journal will publish failed replications by independent researchers.It might also motivate more researchers to conduct rigorous replication studies if they can bet against a finding and hope to get a publication in a prestigious journal.
The M-index can be helpful in putting pressure on editors and journals to curb the proliferation of false-positive results because it can be used to evaluate editors and journals in terms of the credibility of the results that are published in these journals.
As everybody knows, the value of a brand rests on trust, and it is easy to destroy this value when consumers lose that trust. Journals that continue to publish incredible results and suppress contradictory replication studies are not going to survive, especially given the fact that the Internet provides an opportunity for authors of repressed replication studies to get their findings out (Spellman, 2012).
[I wrote this in the third revision when I thought the editor would not want to see the manuscript again.]
[I deleted the section where I pick on Ritchie’s failed replications of Bem because three studies with small studies of N = 50 are underpowered and can be dismissed as false positives. Replication studies should have at least the sample size of original studies which was N = 100 for most of Bem’s studies.]
Another solution would be to ignore p values altogether and to focus more on effect sizes and confidence intervals (Cumming & Finch, 2001). Although it is impossible to demonstrate that the true effect size is exactly zero, it is possible to estimate
true effect sizes with very narrow confidence intervals. For example, a sample of N = 1,100 participants would be sufficient to demonstrate that the true effect size of ESP is zero with a narrow confidence interval of plus or minus .05.
If an even more stringent criterion is required to claim a null effect, sample sizes would have to increase further, but there is no theoretical limit to the precision of effect size estimates. No matter whether the focus is on p values or confidence intervals, Cohen’s recommendation that bigger is better, at least for sample sizes, remains true because large samples are needed to obtain narrow confidence intervals (Goodman & Berlin, 1994).
Changing paradigms is a slow process. It took decades to unsettle the stronghold of behaviorism as the main paradigm in psychology. Despite Cohen’s (1962) important contribution to the field 50 years ago and repeated warnings about the problems of underpowered studies, power analysis remains neglected (Maxwell, 2004; Rossi, 1990; Sedlmeier & Gigerenzer, 1989). I hope the M-index can make a small contribution toward the goal of improving the scientific standards of psychology as a science.
Bem’s (2011) article is not going to be a dagger in the heart of questionable research practices, but it may become the historic marker of a paradigm shift.
There are positive signs in the literature on meta-analysis (Sutton & Higgins, 2008), the search for better statistical methods (Wagenmakers, 2007)*, the call for more
open access to data (Schooler, 2011), changes in publication practices of journals (Dirnagl & Lauritzen, 2010), and increasing awareness of the damage caused by questionable research practices (Francis, 2012a, 2012b; John et al., 2012; Kerr, 1998; Simmons
et al., 2011) to be hopeful that a paradigm shift may be underway.
[Another sad story. I did not understand Wagenmaker’s use of Bayesian methods at the time and I honestly thought this work might make a positive contribution. However, in retrospect I realize that Wagenmakers is more interested in selling his statistical approach at any cost and disregards criticisms of his approach that have become evident in recent years. And, yes, I do understand how the method works and why it will not solve the replication crisis (see commentary by Carlsson et al., 2017, in Psychological Science).]
Even the Stapel debacle (Heatherton, 2010), where a prominent psychologist admitted to faking data, may have a healthy effect on the field.
[Heaterton emailed me and I thought he was going to congratulate me on my nice article or thank me for citing him, but he was mainly concerned that quoting him in the context of Stapel might give the impression that he committed fraud.]
After all, faking increases Type I error by 100% and is clearly considered unethical. If questionable research practices can increase Type I error by up to 60% (Simmons et al., 2011), it becomes difficult to maintain that these widely used practices are questionable but not unethical.
[I guess I was a bit optimistic here. Apparently, you can hide as many studies as you want, but you cannot change one data point because that is fraud.]
During the reign of a paradigm, it is hard to imagine that things will ever change. However, for most contemporary psychologists, it is also hard to imagine that there was a time when psychology was dominated by animal research and reinforcement schedules. Older psychologists may have learned that the only constant in life is change.
[Again, too optimistic. Apparently, many old social psychologists still believe things will remain the same as they always were. Insert head in the sand cartoon here.]
I have been fortunate enough to witness historic moments of change such as the falling of the Berlin Wall in 1989 and the end of behaviorism when Skinner gave his last speech at the convention of the American Psychological Association in 1990. In front of a packed auditorium, Skinner compared cognitivism to creationism. There was dead silence, made more audible by a handful of grey-haired members in the audience who applauded
[Only I didn’t realize that research in 1990 had other problems. Nowadays I still think that Skinner was just another professor with a big ego and some published #me_too allegations to his name, but he was right in his concerns about (social) cognitivism as not much more scientific than creationism.]
I can only hope to live long enough to see the time when Cohen’s valuable contribution to psychological science will gain the prominence that it deserves. A better understanding of the need for power will not solve all problems, but it will go a long way toward improving the quality of empirical studies and the credibility of results published in psychological journals. Learning about power not only empowers researchers to conduct studies that can show real effects without the help of questionable research practices but also empowers them to be critical consumers of published research findings.
Knowledge about power is power.
Bem, D. J. (2000). Writing an empirical article. In R. J. Sternberg (Ed.), Guide
to publishing in psychological journals (pp. 3–16). Cambridge, England:
Cambridge University Press. doi:10.1017/CBO9780511807862.002
Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous
retroactive influences on cognition and affect. Journal of Personality
and Social Psychology, 100, 407–425. doi:10.1037/a0021524
Bonett, D. G. (2009). Meta-analytic interval estimation for standardized
and unstandardized mean differences. Psychological Methods, 14, 225–
Cohen, J. (1962). Statistical power of abnormal–social psychological research:
A review. Journal of Abnormal and Social Psychology, 65,
Cohen, J. (1990). Things I have learned (so far). American Psychologist,
45, 1304–1312. doi:10.1037/0003-066X.45.12.1304
Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155–159.
Dasgupta, N., & Greenwald, A. G. (2001). On the malleability of automatic
attitudes: Combating automatic prejudice with images of admired and
disliked individuals. Journal of Personality and Social Psychology, 81,
Diener, E. (1998). Editorial. Journal of Personality and Social Psychology,
74, 5–6. doi:10.1037/h0092824
Dirnagl, U., & Lauritzen, M. (2010). Fighting publication bias: Introducing
the Negative Results section. Journal of Cerebral Blood Flow and
Metabolism, 30, 1263–1264. doi:10.1038/jcbfm.2010.51
Dvorak, R. D., & Simons, J. S. (2009). Moderation of resource depletion
in the self-control strength model: Differing effects of two modes of
self-control. Personality and Social Psychology Bulletin, 35, 572–583.
Erdfelder, E., Faul, F., & Buchner, A. (1996). GPOWER: A general power
analysis program. Behavior Research Methods, 28, 1–11. doi:10.3758/
Fanelli, D. (2010). “Positive” results increase down the hierarchy of the
sciences. PLoS One, 5, Article e10068. doi:10.1371/journal.pone
Faul, F., Erdfelder, E., Buchner, A., & Lang, A. G. (2009). Statistical
power analyses using G*Power 3.1: Tests for correlation and regression
analyses. Behavior Research Methods, 41, 1149–1160. doi:10.3758/
Faul, F., Erdfelder, E., Lang, A. G., & Buchner, A. (2007). G*Power 3: A
flexible statistical power analysis program for the social, behavioral, and
biomedical sciences. Behavior Research Methods, 39, 175–191. doi:
Fiedler, K. (2011). Voodoo correlations are everywhere—not only in
neuroscience. Perspectives on Psychological Science, 6, 163–171. doi:
Francis, G. (2012a). The same old New Look: Publication bias in a study
of wishful seeing. i-Perception, 3, 176–178. doi:10.1068/i0519ic
Francis, G. (2012b). Too good to be true: Publication bias in two prominent
studies from experimental psychology. Psychonomic Bulletin & Review,
19, 151–156. doi:10.3758/s13423-012-0227-9
Fritz, M. S., & MacKinnon, D. P. (2007). Required sample size to detect
the mediated effect. Psychological Science, 18, 233–239. doi:10.1111/
Gailliot, M. T., Baumeister, R. F., DeWall, C. N., Maner, J. K., Plant,
E. A., Tice, D. M., & Schmeichel, B. J. (2007). Self-control relies on
glucose as a limited energy source: Willpower is more than a metaphor.
Journal of Personality and Social Psychology, 92, 325–336. doi:
Gawronski, B. (2009). Ten frequently asked questions about implicit
measures and their frequently supposed, but not entirely correct answers.
Canadian Psychology/Psychologie canadienne, 50, 141–150. doi:
Goodman, S. N., & Berlin, J. A. (1994). The use of predicted confidence
intervals when planning experiments and the misuse of power when
interpreting results. Annals of Internal Medicine, 121, 200–206.
Greenwald, A. G., McGhee, D. E., & Schwartz, J. L. K. (1998). Measuring
individual differences in implicit cognition: The implicit association test.
Journal of Personality and Social Psychology, 74, 1464–1480. doi:
Haslam, N., Ban, L., Kaufmann, L., Loughnan, S., Peters, K., Whelan, J.,
& Wilson, S. (2008). What makes an article influential? Predicting
impact in social and personality psychology. Scientometrics, 76, 169–
Henrich, J., Heine, S. J., & Norenzayan, A. (2010). The weirdest people in
the world? Behavioral and Brain Sciences, 33, 61–83. doi:10.1017/
Ioannidis, J. P. A. (2005). Why most published research findings are false.
PLoS Medicine, 2(8), Article e124. doi:10.1371/journal.pmed.0020124
Ioannidis, J. P. A., & Trikali nos, T. A. (2007). An exploratory test for an
excess of significant findings. Clinical Trials, 4, 245–253. doi:10.1177/
John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence
of questionable research practices with incentives for truth telling.
Psychological Science, 23, 524–532. doi:10.1177/0956797611430953
Joy-Gaba, J. A., & Nosek, B. A. (2010). The surprisingly limited malleability
of implicit racial evaluations. Social Psychology, 41, 137–146.
Judd, C. M., & Gawronski, B. (2011). Editorial comment. Journal of
Personality and Social Psychology, 100, 406. doi:10.1037/0022789
Kerr, N. L. (1998). HARKing: Hypothezising after the results are known.
Personality and Social Psychology Review, 2, 196–217. doi:10.1207/
Kurzban, R. (2010). Does the brain consume additional glucose during
self-control tasks? Evolutionary Psychology, 8, 244–259.
Ledford, H. (2010, August 17). Harvard probe kept under wraps. Nature,
466, 908–909. doi:10.1038/466908a
Ledgerwood, A., & Sherman, J. W. (2012). Short, sweet, and problematic?
The rise of the short report in psychological science. Perspectives on Psychological Science, 7, 60–66. doi:10.1177/1745691611427304
Maxwell, S. E. (2004). The persistence of underpowered studies in psychological
research: Causes, consequences, and remedies. Psychological Methods, 9, 147–163. doi:10.1037/1082-989X.9.2.147
Milloy, J. S. (1995). Science without sense: The risky business of public
health research. Washington, DC: Cato Institute.
Mudge, J. F., Baker, L. F., Edge, C. B., & Houlahan, J. E. (2012). Setting
an optimal that minimizes errors in null hypothesis significance tests.
PLoS One, 7(2), Article e32734. doi:10.1371/journal.pone.0032734
Murphy, S. T., Zajonc, R. B., & Monahan, J. L. (1995). Additivity of
nonconscious affect: Combined effects of priming and exposure. Journal
of Personality and Social Psychology, 69, 589–602. doi:10.1037/0022-
Ritchie, S. J., Wiseman, R., & French, C. C. (2012a). Failing the future:
Three unsuccessful attempts to replicate Bem’s “retroactive facilitation
of recall” effect. PLoS One, 7(3), Article e33423. doi:10.1371/
Rossi, J. S. (1990). Statistical power of psychological research: What have
we gained in 20 years? Journal of Consulting and Clinical Psychology,
58, 646–656. doi:10.1037/0022-006X.58.5.646
Royall, R. M. (1986). The effect of sample size on the meaning of
significance tests. American Statistician, 40, 313–315. doi:10.2307/
Schmidt, F. (2010). Detecting and correcting the lies that data tell. Perspectives
on Psychological Science, 5, 233–242. doi:10.1177/
Schooler, J. (2011, February 23). Unpublished results hide the decline
effect. Nature, 470, 437. doi:10.1038/470437a
Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power
have an effect on the power of studies? Psychological Bulletin, 105,
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive
psychology: Undisclosed flexibility in data collection and analysis allows
presenting anything as significant. Psychological Science, 22,
Smith, E. R. (2012). Editorial. Journal of Personality and Social Psychology,
102, 1–3. doi:10.1037/a0026676
Spellman, B. A. (2012). Introduction to the special section: Data, data,
everywhere . . . especially in my file drawer. Perspectives on Psychological
Science, 7, 58–59. doi:10.1177/1745691611432124
Steen, R. G. (2011a). Retractions in the scientific literature: Do authors
deliberately commit research fraud? Journal of Medical Ethics, 37,
Steen, R. G. (2011b). Retractions in the scientific literature: Is the incidence
of research fraud increasing? Journal of Medical Ethics, 37,
Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance— or vice versa. Journal of the American Statistical Association, 54(285), 30–34. doi:10.2307/ 2282137
Sterling, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication decisions revisited: The effect of the outcome of statistical tests on the decision to publish and vice-versa. American Statistician, 49, 108–112. doi:10.2307/2684823
Strube, M. J. (2006). SNOOP: A program for demonstrating the consequences
of premature and repeated null hypothesis testing. Behavior
Research Methods, 38, 24–27. doi:10.3758/BF03192746
Sutton, A. J., & Higgins, J. P. I. (2008). Recent developments in metaanalysis.
Statistics in Medicine, 27, 625–650. doi:10.1002/sim.2934
Vul, E., Harris, C., Winkielman, P., & Pashler, H. (2009). Puzzlingly high
correlations in fMRI studies of emotion, personality, and social cognition.
Perspectives on Psychological Science, 4, 274–290. doi:10.1111/
Wagenmakers, E. J. (2007). A practical solution to the pervasive problems
of p values. Psychonomic Bulletin & Review, 14, 779–804. doi:10.3758/
Wagenmakers, E. J., Wetzels, R., Borsboom, D., & van der Maas, H. L. J.
(2011). Why psychologists must change the way they analyze their data:
The case of psi: Comment on Bem (2011). Journal of Personality and
Social Psychology, 100, 426–432. doi:10.1037/a0022790
Wegner, D. M. (1992). The premature demise of the solo experiment.
Personality and Social Psychology Bulletin, 18, 504–508. doi:10.1177/
Yarkoni, T. (2009). Big correlations in little studies: Inflated fMRI correlations
reflect low statistical power—Commentary on Vul et al. (2009).
Perspectives on Psychological Science, 4, 294–298. doi:10.1111/j.1745-
Yong, E. (2012, May 16). Bad copy. Nature, 485, 298–300. doi:10.1038/
Yuan, K. H., & Maxwell, S. (2005). On the post hoc power in testing mean
differences. Journal of Educational and Behavioral Statistics, 30, 141–
Received May 30, 2011
Revision received June 18, 2012
Accepted June 25, 2012
Further Revised February 18, 2018
Thanks to social media, geography is no longer a barrier for scientific discourse. However, language is still a barrier. Fortunately, I understand German and I can respond to the official statement of the board of the German Psychological Association (DGPs), which was posted on the DGPs website (in German).
On September 1, 2015, Prof. Dr. Andrea Abele-Brehm, Prof. Dr. Mario Gollwitzer, and Prof. Dr. Fritz Strack published an official response to the results of the OSF-Replication Project – Psychology (in German) that was distributed to public media in order to correct potentially negative impressions about psychology as a science.
Numerous members of DGPs felt that this official statement did not express their views and noticed that members were not consulted about the official response of their organization. In response to this criticism, DGfP opened a moderated discussion page, where members could post their personal views (mostly in German).
On October 6, 2015, the board closed the discussion page and posted some final words (Schlussbeitrag). In this blog, I provide a critical commentary on these final words.
BOARD’S RESPONSE TO COMMENTS
The board members provide a summary of the core insights and arguments of the discussion from their (personal/official) perspective.
„Wir möchten nun die aus unserer Sicht zentralen Erkenntnisse und Argumente der unterschiedlichen Forumsbeiträge im Folgenden zusammenfassen und deutlich machen, welche vorläufigen Erkenntnisse wir im Vorstand aus ihnen ziehen.“
1. 68% success rate?
The first official statement suggested that the replication project showed that 68% of studies. This number is based on significance in a meta-analysis of the original and replication study. Critics pointed out that this approach is problematic because the replication project showed clearly that the original effect sizes were inflated (on average by 100%). Thus, the meta-analysis is biased and the 68% number is inflated.
In response to this criticism, the DGPs board states that “68% is the maximum [größtmöglich] optimistic estimate.” I think the term “biased and statistically flawed estimate” is a more accurate description of this estimate. It is common practice to consider fail-safe-N or to correct meta-analysis for publication bias. When there is clear evidence of bias, it is unscientific to report the biased estimate. This would be like saying that the maximum optimistic estimate of global warming is that global warming does not exist. This is probably a true statement about the most optimistic estimate, but not a scientific estimate of the actual global warming that has been taking place. There is no place for optimism in science. Optimism is a bias and the aim of science is to remove bias. If DGPs wants to represent scientific psychology, the board should post what they consider the most accurate estimate of replicability in the OSF-project.
2. The widely cited 36% estimate is negative.
The board members then justify the publication of the maximally optimistic estimate as a strategy to counteract negative perceptions of psychology as a science in response to the finding that only 36% of results were replicated. The board members felt that these negative responses misrepresent the OSF-project and psychology as a scientific discipline.
„Dies wird weder dem Projekt der Open Science Collaboration noch unserer Disziplin insgesamt gerecht. Wir sollten jedoch bei der konstruktiven Bewältigung der Krise Vorreiter innerhalb der betroffenen Wissenschaften sein.“
However, reporting the dismal 36% replication rate of the OSF-replication project is not a criticism of the OSF-project. Rather, it assumes that the OSF-replication project was a rigorous and successful attempt to provide an estimate of the typical replicability of results published in top psychology journals. The outcome could have been 70% or 35%. The quality of the project does not depend on the result. The result is also not a negatively biased perception of psychology as a science. It is an objective scientific estimate of the probability that a reported significant result in a journal would produce a significant result again in a replication study. Whether 36% is acceptable or not can be debated, but it seems problematic to post a maximally optimistic estimate to counteract negative implications of an objective estimate.
3. Is 36% replicability good or bad?
Next, the board ponders the implications of the 36% success rate. “How should we evaluate this number?” The board members do not know. According to their official conclusion, this question is complex as divergent contributions on the discussion page suggest.
„Im Science-Artikel wurde die relative Häufigkeit der in den Replikationsstudien statistisch bedeutsamen Effekte mit 36% angegeben. Wie ist diese Zahl zu bewerten? Wie komplex die Antwort auf diese Frage ist, machen die Forumsbeiträge von Roland Deutsch, Klaus Fiedler, Moritz Heene (s.a. Heene & Schimmack) und Frank Renkewitz deutlich.“
To help the board members to understand the number, I can give a brief explanation of replicability. Although there are several ways to define replicability, one plausible definition of replicability is to equate it with statistical power. Statistical power is the probability that a study will produce a significant result. A study with 80% power has an 80% probability to produce a significant result. For a set of 100 studies, one would expect roughly 80 significant results and 20 non-significant results. For 100 studies with 36% power, one would expect roughly 36 significant results and 64 non-significant results. If researchers would publish all studies, the percentage of published significant results would provide an unbiased estimate of the typical power of studies. However, it is well known that significant results are more likely to be written up, submitted for publication, and accepted for publication. These reporting biases explain why psychology journals report over 90% significant results, although the actual power of studies is less than 90%.
In 1962, Jacob Cohen provided the first attempt to estimate replicability of psychological results. His analysis suggested that psychological studies have approximately 50% power. He suggested that psychologists should increase power to 80% to provide robust evidence for effects and to avoid wasting resources on studies that cannot detect small, but practically important effects. For the next 50 years, psychologists have ignored Cohen’s warning that most studies are underpowered, despite repeated reminders that there are no signs of improvement, including reminders by prominent German psychologists like Gerg Giegerenzer, director of a Max Planck Institute (Sedlmeier & Giegerenzer, 1989; Maxwell, 2004; Schimmack, 2012).
The 36% success rate for an unbiased set of 100 replication studies, suggest that the actual power of published studies in psychology journals is 36%. The power of all studies conducted is even lower because the p < .05 selection criterion favors studies with higher power. Does the board think 36% power is an acceptable amount of power?
4. Psychologists should improve replicability in the future
On a positive note, the board members suggest that, after careful deliberation, psychologists need to improve replicability so that it can be demonstrated in a few years that replicability has increased.
„Wir müssen nach sorgfältiger Diskussion unter unseren Mitgliedern Maßnahmen ergreifen (bei Zeitschriften, in den Instituten, bei Förderorganisationen, etc.), die die Replikationsquote im temporalen Vergleich erhöhen können.“
The board members do not mention a simple solution to the replicabilty problem that was advocated over 50 years ago by Jacob Cohen. To increase replicability, psychologists have to think about the strength of the effects that they are investigating and they have to conduct studies that have a realistic chance to distinguish these effects from variation due to random error. This often means investing more resources (larger samples, repeated trials, etc.) in a single study. Unfortunately, the leaders of German psychologists appear to be unaware of this important and simple solution to the replication crisis. They neither mention power as a cause of the problem, nor do they recommend increasing power to increase replicability in the future.
5. Do the Results Reveal Fraud?
The DGPs board members then discuss the possibility that the OSF-reproducibilty results reveal fraud, like the fraud committed by Stapel. The board points out that the OSF-results do not imply that psychologists commit fraud because failed replications can occur for various reasons.
„Viele Medien (und auch einige Kolleginnen und Kollegen aus unserem Fach) nennen die Befunde der Science-Studie im gleichen Atemzug mit den Betrugsskandalen, die unser Fach in den letzten Jahren erschüttert haben. Diese Assoziation ist unserer Meinung nach problematisch: sie suggeriert, die geringe Replikationsrate sei auf methodisch fragwürdiges Verhalten der Autor(inn)en der Originalstudien zurückzuführen.“
It is true that the OSF-results do not reveal fraud. However, the board members confuse fraud with questionable research practices. Fraud is defined as fabricating data that were never collected. Only one of the 100 studies in the OSF-replication project (by Jens Förster, a former student of Fritz Strack, one of the board members) is currently being investigated for fraud by the University of Amsterdam. Despite very strong results in the original study, it failed to replicate.
The more relevant question is how much questionable research practices contributed to the results. Questionable research practices are practices where data are being collected, but statistical results are only being reported if they produce a significant result (studies, conditions, dependent variables, data points that do not produce significant results are excluded from the results that are being submitted for publication. It has been known for over 50 years that these practices produce a discrepancy between the actual power of studies and the rate of significant results that are published in psychology journals (Sterling, 1959).
Recent statistical developments have made it possible to estimate the true power of studies after correcting for publication bias. Based on these calculations, the true power of the original studies in the OSF-project was only 50%. Thus a large portion of the discrepancy between nearly 100% reported significant results and a replication success rate of 36% is explained by publication bias (see R-Index blogs for social psychology and cognitive psychology).
Other factors may contribute to the discrepancy between the statistical prediction that the replication success rate would be 50% and the actual success rate of 36%. Nevertheless, the lion share of the discrepancy can be explained by the questionable practice to report only evidence that supports a hypothesis that a researcher wants to support. This motivated bias undermines the very foundations of science. Unfortunately, the board ignores this implication of the OSF results.
6. What can we do?
The board members have no answer to this important question. In the past four years, numerous articles have been published that have made suggestions how psychology can improve its credibility as a science. Yet, the DPfP board seems to be unaware of these suggestions or unable to comment on these proposals.
„Damit wären wir bei der Frage, die uns als Fachgesellschaft am stärksten beschäftigt und weiter beschäftigen wird. Zum einen brauchen wir eine sorgfältige Selbstreflexion über die Bedeutung von Replikationen in unserem Fach, über die Bedeutung der neuesten Science-Studie sowie der weiteren, zurzeit noch im Druck oder in der Phase der Auswertung befindlichen Projekte des Center for Open Science (wie etwa die Many Labs-Studien) und über die Grenzen unserer Methoden und Paradigmen“
The time for more discussion has passed. After 50 years of ignoring Jacob Cohen’s recommendation to increase statistical power it is time for action. If psychologists are serious about replicability, they have to increase the power of their studies.
The board then discusses the possibility of measuring and publishing replication rates at the level of departments or individual scientists. They are not in favor of such initiatives, but they provide no argument for their position.
„Datenbanken über erfolgreiche und gescheiterte Replikationen lassen sich natürlich auch auf der Ebene von Instituten oder sogar Personen auswerten (wer hat die höchste Replikationsrate, wer die niedrigste?). Sinnvoller als solche Auswertungen sind Initiativen, wie sie zurzeit (unter anderem) an der LMU an der LMU München implementiert wurden (siehe den Beitrag von Schönbrodt und Kollegen).“
The question is why replicability should not be measured and used to evaluate researchers. If the board really valued replicability and wanted to increase replicability in a few years, wouldn’t it be helpful to have a measure of replicability and to reward departments or researchers who invest more resources in high powered studies that can produce significant results without the need to hide disconfirming evidence in file-drawers? A measure of replicability is also needed because current quantitative measures of scientific success are one of the reasons for the replicability crisis. The most successful researchers are those who publish the most significant results, no matter how these results were obtained (with the exception of fraud). To change this unscientific practice of significance chasing, it is necessary to have an alternative indicator of scientific quality that reflects how significant results were obtained.
The board makes some vague concluding remarks that are not worthwhile repeating here. So let me conclude with my own remarks.
The response of the DGPs board is superficial and does not engage with the actual arguments that were exchanged on the discussion page. Moreover, it ignores some solid scientific insights into the causes of the replicability crisis and it makes no concrete suggestions how German psychologists should change their behaviors to improve the credibility of psychology as a science. Not once do they point out that the results of the OSF-project were predictable based on the well-known fact that psychological studies are underpowered and that failed studies are hidden in file-drawers.
I received my education in Germany all the way to the Ph.D at the Free University in Berlin. I had several important professors and mentors that educated me about philosophy of science and research methods (Rainer Reisenzein, Hubert Feger, Hans Westmeyer, Wolfgang Schönpflug). I was a member of DGPs for many years. I do not believe that the opinion of the board members represent a general consensus among German psychologists. I hope that many German psychologists recognize the importance of replicability and are motivated to make changes to the way psychologists conduct research. As I am no longer a member of DGfP, I have no direct influence on it, but I hope that the next election will elect a candidate that will promote open science, transparency, and above all scientific integrity.
The OSF-Reproducibility Project (Psychology) aimed to replicate 100 results published in original research articles in three psychology journals in 2008. The selected journals focus on publishing results from experimental psychology. The main paradigm of experimental psychology is to recruit samples of participants and to study their behaviors in controlled laboratory conditions. The results are then generalized to the typical behavior of the average person.
An important methodological distinction in experimental psychology is the research design. In a within-subject design, participants are exposed to several (a minimum of two) situations and the question of interest is whether responses to one situation differ from behavior in other situations. The advantage of this design is that individuals serve as their own controls and variation due to unobserved causes (mood, personality, etc.) does not influence the results. This design can produce high statistical power to study even small effects. The design is often used by cognitive psychologists because the actual behaviors are often simple behaviors (e.g., pressing a button) that can be repeated many times (e.g., to demonstrate interference in the Stroop paradigm).
In a between-subject design, participants are randomly assigned to different conditions. A mean difference between conditions reveals that the experimental manipulation influenced behavior. The advantage of this design is that behavior is not influenced by previous behaviors in the experiment (carry over effects). The disadvantage is that many uncontrolled factors (e..g, mood, personality) also influence behavior. As a result, it can be difficult to detect small effects of an experimental manipulation among all of the other variance that is caused by uncontrolled factors. As a result, between-subject designs require large samples to study small effects or they can only be used to study large effects.
One of the main findings of the OSF-Reproducibility Project was that results from within-subject designs used by cognitive psychology were more likely to replicate than results from between-subject designs used by social psychologists. There were two few between-subject studies by cognitive psychologists or within-subject designs by social psychologists to separate these factors. This result of the OSF-reproducibility project was predicted by PHP-curves of the actual articles as well as PHP-curves of cognitive and social journals (Replicability-Rankings).
Given the reliable difference between disciplines within psychology, it seems problematic to generalize the results of the OSF-reproducibility project across all areas of psychology. For this reason, I conducted separate analyses for social psychology and for cognitive psychology. This post examines the replicability of results in cognitive psychology. The results for social psychology are posted here.
The master data file of the OSF-reproducibilty project contained 167 studies with replication results for 99 studies. 42 replications were classified as cognitive studies. I excluded Reynolds and Bresner was excluded because the original finding was not significant. I excluded C Janiszewski, D Uy (doi:10.1111/j.1467-9280.2008.02057.x) because it examined the anchor effect, which I consider to be social psychology. Finally, I excluded two studies with children as participants because this research falls into developmental psychology (E Nurmsoo, P Bloom; V Lobue, JS DeLoache).
I first conducted a post-hoc-power analysis of the reported original results. Test statistics were first converted into two-tailed p-values and two-tailed p-values were converted into absolute z-scores using the formula (1 – norm.inverse(1-p/2). Post-hoc power was estimated by fitting the observed z-scores to predicted z-scores with a mixed-power model with three parameters (Brunner & Schimmack, in preparation).
Estimated power was 75%. This finding reveals the typical presence of publication bias because the actual success rate of 100% is too high given the power of the studies. Based on this estimate, one would expect that only 75% of the 38 findings (k = 29) would produce a significant result in a set of 38 exact replication studies with the same design and sample size.
The Figure visualizes the discrepancy between observed z-scores and the success rate in the original studies. Evidently, the distribution is truncated and suggests a file-drawer of missing studies with non-significant results. However, the mode of the curve (it’s highest point) is projected to be on the right side of the significance criterion (z = 1.96, p = .05 (two-tailed)), which suggests that more than 50% of results should replicate. Given the absence of reliable data in the range from 0 to 1.96, the data make it impossible to estimate the exact distribution in this region, but the gentle decline of z-scores on the right side of the significance criterion suggests that the file-drawer is relatively small.
Sample sizes of the replication studies were based on power analysis with the reported effect sizes. The problem with this approach is that the reported effect sizes are inflated and provide an inflated estimate of true power. With a true power estimate of 75%, the inflated power estimates were above 80% and often over 90%. As a result, many replication studies used the same sample size and some even used a smaller sample size because the original study appeared to be overpowered (the sample size was much larger than needed). The median sample size for the original studies was 32. The median sample size for the replication studies was N = 32. Changes in sample sizes make it difficult to compare the replication rate of the original studies with those of the replication study. Therefore, I adjusted the z-scores of the replication study to match z-scores that would have been obtained with the original sample size. Based on the post-hoc-power analysis above, I predicted that 75% of the replication studies would produce a significant result (k = 29). I also had posted predictions for individual studies based on a more comprehensive assessment of each article. The success rate for my a priori predictions was 69% (k = 27).
The actual replication rate based on adjusted z-scores was 63% (k = 22), although 3 studies produced only p-values between .05 and .06 after the adjustment was applied. If these studies were not counted, the success rate would have been 50% (19/38). This finding suggests that post-hoc power analysis overestimates true power by 10% to 25%. However, it is also possible that some of the replication studies failed to reproduce the exact experimental conditions of the original studies, which would lower the probability of obtaining a significant result. Moreover, the number of studies is very small and the discrepancy may simply be due to random sampling error. The important result is that post-hoc power curves correctly predict that the success rate in a replication study will be lower than the actual success rate because it corrects for the effect of publication bias. It also correctly predicts that a substantial number of studies will be successfully replicated, which they were. In comparison, post-hoc power analysis of social psychology predicted only 35% of successful replications and only 8% successfully replicated. Thus, post-hoc power analysis correctly predicts that results in cognitive psychology are more replicable than results in social psychology.
The next figure shows the post-hoc-power curve for the sample-size corrected z-scores of the replication studies.
The PHP-Curve estimate of power for z-scores in the range from 0 to 4 is 53% for the heterogeneous model that fits the data better than a homogeneous model. The shape of the distribution suggests that several of the non-significant results are type-II errors; that is, the studies had insufficient statistical power to demonstrate a real effect.
I also conducted a power analysis that was limited to the non-significant results. The estimated average power was 22%. This power is a mixture of true power in different studies and may contain some cases of true false positives (power = .05), but the existing data are insufficient to determine whether results are true false positives or whether a small effect is present and sample sizes were too small to detect it. Again, it is noteworthy that the same analysis for social psychology produced an estimate of 5%, which suggests that most of the non-significant results in social psychology are true false positives (the null-effect is true).
Below I discuss my predictions of individual studies.
Eight studies reported an effect with a z-score greater than 4 (4 sigma), and I predicted that all of the 4-sigma effects would replicate. 7 out of 8 effects were successfully replicated (D Ganor-Stern, J Tzelgov; JI Campbell, ND Robert; M Bassok, SF Pedigo, AT Oskarsson; PA White; E Vul, H Pashler; E Vul, M Nieuwenstein, N Kanwisher; J Winawer, AC Huk, L Boroditsky). The only exception was CP Beaman, I Neath, AM Surprenant (DOI: 10.1037/0278-7322.214.171.124). It is noteworthy that the sample size of the original study was N = 99 and the sample size of the replication study was N = 14. Even with an adjusted z-score the study produced a non-significant result (p = .19). However, small samples produce less reliable results and it would be interesting to examine whether the result would become significant with an actual sample of 99 participants.
Based on more detailed analysis of individual articles, I predicted that an additional 19 studies would replicate. However, 9 out these 19 studies were not successfully replicated. Thus, my predictions of additional successful replications are just at chance level, given the overall success rate of 50%.
Based on more detailed analysis of individual articles, I predicted that 11 studies would not replicate. However, 5 out these 11 studies were successfully replicated. Thus, my predictions of failed replications are just at chance level, given the overall success rate of 50%.
In short, my only rule that successfully predicted replicability of individual studies was the 4-sigma rule that predicts that all findings with a z-score greater than 4 will replicate.
In conclusion, a replicability of 50-60% is consistent with Cohen’s (1962) suggestion that typical studies in psychology have 60% power. Post-hoc power analysis slightly overestimated the replicability of published findings despite its ability to correct for publication bias. Future research needs to examine the sources that lead to a discrepancy between predicted and realized success rate. It is possible that some of this discrepancy is due to moderating factors. Although a replicability of 50-60% is not as catastrophic as the results for social psychology with estimates in the range from 8-35%, cognitive psychologists should aim to increase the replicability of published results. Given the widespread use of powerful within-subject designs, this is easily achieved by a modest increase in sample sizes from currently 30 participants to 50 participants, which would increase power from 60% to 80%.
A type-I error is defined as the probability of rejecting the null-hypothesis (i.e., the effect size is zero) when the null-hypothesis is true.
A type-II error is defined as the probability of failing to reject the null-hypothesis when the null-hypothesis is false (i.e., there is an effect).
A common application of statistics is to provide empirical evidence for a theoretically predicted relationship between two variables (cause-effect or covariation). The results of an empirical study can produce two outcomes. Either the result is statistically significant or it is not statistically significant. Statistically significant results are interpreted as support for a theoretically predicted effect.
Statistically non-significant results are difficult to interpret because the prediction may be false (the null-hypothesis is true) or a type-II error occurred (the theoretical prediction is correct, but the results fail to provide sufficient evidence for it).
To avoid type-II errors, researchers can design studies that reduce the type-II error probability. The probability of avoiding a type-II error when a predicted effect exists is called power. It could also be called the probability of success because a significant result can be used to provide empirical support for a hypothesis.
Ideally researchers would want to maximize power to avoid type-II errors. However, powerful studies require more resources. Thus, researchers face a trade-off between the allocation of resources and their probability to obtain a statistically significant result.
Jacob Cohen dedicated a large portion of his career to help researchers with the task of planning studies that can produce a successful result, if the theoretical prediction is true. He suggested that researchers should plan studies to have 80% power. With 80% power, the type-II error rate is still 20%, which means that 1 out of 5 studies in which a theoretical prediction is true would fail to produce a statistically significant result.
Cohen (1962) examined the typical effect sizes in psychology and found that the typical effect size for the mean difference between two groups (e.g., men and women or experimental vs. control group) is about half-of a standard deviation. The standardized effect size measure is called Cohen’s d in his honor. Based on his review of the literature, Cohen suggested that an effect size of d = .2 is small, d = .5 moderate, and d = .8. Importantly, a statistically small effect size can have huge practical importance. Thus, these labels should not be used to make claims about the practical importance of effects. The main purpose of these labels is that researchers can better plan their studies. If researchers expect a large effect (d = .8), they need a relatively small sample to have high power. If researchers expect a small effect (d = .2), they need a large sample to have high power. Cohen (1992) provided information about effect sizes and sample sizes for different statistical tests (chi-square, correlation, ANOVA, etc.).
Cohen (1962) conducted a meta-analysis of studies published in a prominent psychology journal. Based on the typical effect size and sample size in these studies, Cohen estimated that the average power in studies is about 60%. Importantly, this also means that the typical power to detect small effects is less than 60%. Thus, many studies in psychology have low power and a high type-II error probability. As a result, one would expect that journals often report that studies failed to support theoretical predictions. However, the success rate in psychological journals is over 90% (Sterling, 1959; Sterling, Rosenbaum, & Weinkam, 1995). There are two explanations for discrepancies between the reported success rate and the success probability (power) in psychology. One explanation is that researchers conduct multiple studies and only report successful studies. The other studies remain unreported in a proverbial file-drawer (Rosenthal, 1979). The other explanation is that researchers use questionable research practices to produce significant results in a study (John, Loewenstein, & Prelec, 2012). Both practices have undesirable consequences for the credibility and replicability of published results in psychological journals.
A simple solution to the problem would be to increase the statistical power of studies. If the power of psychological studies in psychology were over 90%, a success rate of 90% would be justified by the actual probability of obtaining significant results. However, meta-analysis and method articles have repeatedly pointed out that psychologists do not consider statistical power in the planning of their studies and that studies continue to be underpowered (Maxwell, 2004; Schimmack, 2012; Sedlmeier & Giegerenzer, 1989).
One reason for the persistent neglect of power could be that researchers have no awareness of the typical power of their studies. This could happen because observed power in a single study is an imperfect indicator of true power (Yuan & Maxwell, 2005). If a study produced a significant result, the observed power is at least 50%, even if the true power is only 30%. Even if the null-hypothesis is true, and researchers publish only type-I errors, observed power is dramatically inflated to 62%, when the true power is only 5% (the type-I error rate). Thus, Cohen’s estimate of 60% power is not very reassuring.
Over the past years, Schimmack and Brunner have developed a method to estimate power for sets of studies with heterogeneous designs, sample sizes, and effect sizes. A technical report is in preparation. The basic logic of this approach is to convert results of all statistical tests into z-scores using the one-tailed p-value of a statistical test. The z-scores provide a common metric for observed statistical results. The standard normal distribution predicts the distribution of observed z-scores for a fixed value of true power. However, for heterogeneous sets of studies the distribution of z-scores is a mixture of standard normal distributions with different weights attached to various power values. To illustrate this method, the histograms of z-scores below show simulated data with 10,000 observations with varying levels of true power: 20% null-hypotheses being true (5% power), 20% of studies with 33% power, 20% of studies with 50% power, 20% of studies with 66% power, and 20% of studies with 80% power.
The plot shows the distribution of absolute z-scores (there are no negative effect sizes). The plot is limited to z-scores below 6 (N = 99,985 out of 10,000). Z-scores above 6 standard deviations from zero are extremely unlikely to occur by chance. Even with a conservative estimate of effect size (lower bound of 95% confidence interval), observed power is well above 99%. Moreover, quantum physics uses Z = 5 as a criterion to claim success (e.g., discovery of Higgs-Boson Particle). Thus, Z-scores above 6 can be expected to be highly replicable effects.
Z-scores below 1.96 (the vertical dotted red line) are not significant for the standard criterion of (p < .05, two-tailed). These values are excluded from the calculation of power because these results are either not reported or not interpreted as evidence for an effect. It is still important to realize that true power of all experiments would be lower if these studies were included because many of the non-significant results are produced by studies with 33% power. These non-significant results create two problems. Researchers wasted resources on studies with inconclusive results and readers may be tempted to misinterpret these results as evidence that an effect does not exist (e.g., a drug does not have side effects) when an effect is actually present. In practice, it is difficult to estimate power for non-significant results because the size of the file-drawer is difficult to estimate.
It is possible to estimate power for any range of z-scores, but I prefer the range of z-scores from 2 (just significant) to 4. A z-score of 4 has a 95% confidence interval that ranges from 2 to 6. Thus, even if the observed effect size is inflated, there is still a high chance that a replication study would produce a significant result (Z > 2). Thus, all z-scores greater than 4 can be treated as cases with 100% power. The plot also shows that conclusions are unlikely to change by using a wider range of z-scores because most of the significant results correspond to z-scores between 2 and 4 (89%).
The typical power of studies is estimated based on the distribution of z-scores between 2 and 4. A steep decrease from left to right suggests low power. A steep increase suggests high power. If the peak (mode) of the distribution were centered over Z = 2.8, the data would conform to Cohen’s recommendation to have 80% power.
Using the known distribution of power to estimate power in the critical range gives a power estimate of 61%. A simpler model that assumes a fixed power value for all studies produces a slightly inflated estimate of 63%. Although the heterogeneous model is correct, the plot shows that the homogeneous model provides a reasonable approximation when estimates are limited to a narrow range of Z-scores. Thus, I used the homogeneous model to estimate the typical power of significant results reported in psychological journals.
The results presented below are based on an ongoing project that examines power in psychological journals (see results section for the list of journals included so far). The set of journals does not include journals that primarily publish reviews and meta-analysis or clinical and applied journals. The data analysis is limited to the years from 2009 to 2015 to provide information about the typical power in contemporary research. Results regarding historic trends will be reported in a forthcoming article.
I downloaded pdf files of all articles published in the selected journals and converted the pdf files to text files. I then extracted all t-tests and F-tests that were reported in the text of the results section searching for t(df) or F(df1,df2). All t and F statistics were converted into one-tailed p-values and then converted into z-scores.
The plot above shows the results based on 218,698 t and F tests reported between 2009 and 2015 in the selected psychology journals. Unlike the simulated data, the plot shows a steep drop for z-scores just below the threshold of significance (z = 1.96). This drop is due to the tendency not to publish or report non-significant results. The heterogeneous model uses the distribution of non-significant results to estimate the size of the file-drawer (unpublished non-significant results). However, for the present purpose the size of the file-drawer is irrelevant because power is estimated only for significant results for Z-scores between 2 and 4.
The green line shows the best fitting estimate for the homogeneous model. The red curve shows fit of the heterogeneous model. The heterogeneous model is doing a much better job at fitting the long tail of highly significant results, but for the critical interval of z-scores between 2 and 4, the two models provide similar estimates of power (55% homogeneous & 53% heterogeneous model). If the range is extended to z-scores between 2 and 6, power estimates diverge (82% homogenous, 61% heterogeneous). The plot indicates that the heterogeneous model fits the data better and that the 61% estimate is a better estimate of true power for significant results in this range. Thus, the results are in line with Cohen (1962) estimate that psychological studies average 60% power.
The distribution of z-scores between 2 and 4 was used to estimate the average power separately for each journal. As power is the probability to obtain a significant result, this measure estimates the replicability of results published in a particular journal if researchers would reproduce the studies under identical conditions with the same sample size (exact replication). Thus, even though the selection criterion ensured that all tests produced a significant result (100% success rate), the replication rate is expected to be only about 50%, even if the replication studies successfully reproduce the conditions of the published studies. The table below shows the replicability ranking of the journals, the replicability score, and a grade. Journals are graded based on a scheme that is similar to grading schemes for undergraduate students (below 50 = F, 50-59 = E, 60-69 = D, 70-79 = C, 80-89 = B, 90+ = A).
The average value in 2000-2014 is 57 (D+). The average value in 2015 is 58 (D+). The correlation for the values in 2010-2014 and those in 2015 is r = .66. These findings show that the replicability scores are reliable and that journals differ systematically in the power of published studies.
The main limitation of the method is that focuses on t and F-tests. The results might change when other statistics are included in the analysis. The next goal is to incorporate correlations and regression coefficients.
The second limitation is that the analysis does not discriminate between primary hypothesis tests and secondary analyses. For example, an article may find a significant main effect for gender, but the critical test is whether gender interacts with an experimental manipulation. It is possible that some journals have lower scores because they report more secondary analyses with lower power. To address this issue, it will be necessary to code articles in terms of the importance of statistical test.
The ranking for 2015 is based on the currently available data and may change when more data become available. Readers should also avoid interpreting small differences in replicability scores as these scores are likely to fluctuate. However, the strong correlation over time suggests that there are meaningful differences in the replicability and credibility of published results across journals.
This article provides objective information about the replicability of published findings in psychology journals. None of the journals reaches Cohen’s recommended level of 80% replicability. Average replicability is just about 50%. This finding is largely consistent with Cohen’s analysis of power over 50 years ago. The publication of the first replicability analysis by journal should provide an incentive to editors to increase the reputation of their journal by paying more attention to the quality of the published data. In this regard, it is noteworthy that replicability scores diverge from traditional indicators of journal prestige such as impact factors. Ideally, the impact of an empirical article should be aligned with the replicability of the empirical results. Thus, the replicability index may also help researchers to base their own research on credible results that are published in journals with a high replicability score and to avoid incredible results that are published in journals with a low replicability score. Ultimately, I can only hope that journals will start competing with each other for a top spot in the replicability rankings and as a by-product increase the replicability of published findings and the credibility of psychological science.
Imagine an NBA player has an 80% chance to make one free throw. What is the chance that he makes both free throws? The correct answer is 64% (80% * 80%).
Now consider the possibility that it is possible to distinguish between two types of free throws. Some free throws are good; they don’t touch the rim and make a swishing sound when they go through the net (all net). The other free throws bounce of the rim and go in (rattling in).
What is the probability that an NBA player with an 80% free throw percentage makes a free throw that is all net or rattles in? It is more likely that an NBA player with an 80% free throw average makes a perfect free throw because a free throw that rattles in could easily have bounded the wrong way, which would lower the free throw percentage. To achieve an 80% free throw percentage, most free throws have to be close to perfect.
Let’s say the probability of hitting the rim and going in is 30%. With an 80% free throw average, this means that the majority of free throws are in the close-to-perfect category (20% misses, 30% rattle-in, 50% close-to-perfect).
What does this have to do with science? A lot!
The reason is that the outcome of a scientific study is a bit like throwing free throws. One factor that contributes to a successful study is skill (making correct predictions, avoiding experimenter errors, and conducting studies with high statistical power). However, another factor is random (a lucky or unlucky bounce).
The concept of statistical power is similar to an NBA players’ free throw percentage. A researcher who conducts studies with 80% statistical power is going to have an 80% success rate (that is, if all predictions are correct). In the remaining 20% of studies, a study will not produce a statistically significant result, which is equivalent to missing a free throw and not getting a point.
Many years ago, Jacob Cohen observed that researchers often conduct studies with relatively low power to produce a statistically significant result. Let’s just assume right now that a researcher conducts studies with 60% power. This means, researchers would be like NBA players with a 60% free-throw average.
Now imagine that researchers have to demonstrate an effect not only once, but also a second time in an exact replication study. That is researchers have to make two free throws in a row. With 60% power, the probability to get two significant results in a row is only 36% (60% * 60%). Moreover, many of the freethrows that are made rattle in rather than being all net. The percentages are about 40% misses, 30% rattling in and 30% all net.
One major difference between NBA players and scientists is that NBA players have to demonstrate their abilities in front of large crowds and TV cameras, whereas scientists conduct their studies in private.
Imagine an NBA player could just go into a private room, throw two free throws and then report back how many free throws he made and the outcome of these free throws determine who wins game 7 in the playoff finals. Would you trust the player to tell the truth?
If you would not trust the NBA player, why would you trust scientists to report failed studies? You should not.
It can be demonstrated statistically that scientists are reporting more successes than the power of their studies would justify (Sterling et al., 1995; Schimmack, 2012). Amongst scientists this fact is well known, but the general public may not fully appreciate the fact that a pair of exact replication studies with significant results is often just a selection of studies that included failed studies that were not reported.
Fortunately, it is possible to use statistics to examine whether the results of a pair of studies are likely to be honest or whether failed studies were excluded. The reason is that an amateur is not only more likely to miss a free throw. An amateur is also less likely to make a perfect free throw.
Based on the theory of statistical power developed by Nyman and Pearson and popularized by Jacob Cohen, it is possible to make predictions about the relative frequency of p-values in the non-significant (failure), just significant (rattling in), and highly significant (all net) ranges.
As for made-free-throws, the distinction between lucky and clear successes is somewhat arbitrary because power is continuous. A study with a p-value of .0499 is very lucky because p = .501 would have been not significant (rattled in after three bounces on the rim). A study with p = .000001 is a clear success. Lower p-values are better, but where to draw the line?
As it turns out, Jacob Cohen’s recommendation to conduct studies with 80% power provides a useful criterion to distinguish lucky outcomes and clear successes.
Imagine a scientist conducts studies with 80% power. The distribution of observed test-statistics (e.g. z-scores) shows that this researcher has a 20% chance to get a non-significant result, a 30% chance to get a lucky significant result (p-value between .050 and .005), and a 50% chance to get a clear significant result (p < .005). If the 20% failed studies are hidden, the percentage of results that rattled in versus studies with all-net results are 37 vs. 63%. However, if true power is just 20% (an amateur), 80% of studies fail, 15% rattle in, and 5% are clear successes. If the 80% failed studies are hidden, only 25% of the successful studies are all-net and 75% rattle in.
One problem with using this test to draw conclusions about the outcome of a pair of exact replication studies is that true power is unknown. To avoid this problem, it is possible to compute the maximum probability of a rattling-in result. As it turns out, the optimal true power to maximize the percentage of lucky outcomes is 66% power. With true power of 66%, one would expect 34% misses (p > .05), 32% lucky successes (.050 < p < .005), and 34% clear successes (p < .005).
For a pair of exact replication studies, this means that there is only a 10% chance (32% * 32%) to get two rattle-in successes in a row. In contrast, there is a 90% chance that misses were not reported or that an honest report of successful studies would have produced at least one all-net result (z > 2.8, p < .005).
Example: Unconscious Priming Influences Behavior
I used this test to examine a famous and controversial set of exact replication studies. In Bargh, Chen, and Burrows (1996), Dr. Bargh reported two exact replication studies (studies 2a and 2b) that showed an effect of a subtle priming manipulation on behavior. Undergraduate students were primed with words that are stereotypically associated with old age. The researchers then measured the walking speed of primed participants (n = 15) and participants in a control group (n = 15).
The two studies were not only exact replications of each other; they also produced very similar results. Most readers probably expected this outcome because similar studies should produce similar results, but this false belief ignores the influence of random factors that are not under the control of a researcher. We do not expect lotto winners to win the lottery again because it is an entirely random and unlikely event. Experiments are different because there could be a systematic effect that makes a replication more likely, but in studies with low power results should not replicate exactly because random sampling error influences results.
Study 1: t(28) = 2.86, p = .008 (two-tailed), z = 2.66, observed power = 76%
Study 2: t(28) = 2.16, p = .039 (two-tailed), z = 2.06, observed power = 54%
The median power of these two studies is 65%. However, even if median power were lower or higher, the maximum probability of obtaining two p-values in the range between .050 and .005 remains just 10%.
Although this study has been cited over 1,000 times, replication studies are rare.
One of the few published replication studies was reported by Cesario, Plaks, and Higgins (2006). Naïve readers might take the significant results in this replication study as evidence that the effect is real. However, this study produced yet another lucky success.
Study 3: t(62) = 2.41, p = .019, z = 2.35, observed power = 65%.
The chances of obtaining three lucky successes in a row is only 3% (32% *32% * 32*). Moreover, with a median power of 65% and a reported success rate of 100%, the success rate is inflated by 35%. This suggests that the true power of the reported studies is considerably lower than the observed power of 65% and that observed power is inflated because failed studies were not reported.
The R-Index corrects for inflation by subtracting the inflation rate from observed power (65% – 35%). This means the R-Index for this set of published studies is 30%.
This R-Index can be compared to several benchmarks.
An R-Index of 22% is consistent with the null-hypothesis being true and failed attempts are not reported.
An R-Index of 40% is consistent with 30% true power and all failed attempts are not reported.
It is therefore not surprising that other researchers were not able to replicate Bargh’s original results, even though they increased statistical power by using larger samples (Pashler et al. 2011, Doyen et al., 2011).
In conclusion, it is unlikely that Dr. Bargh’s original results were the only studies that they conducted. In an interview, Dr. Bargh revealed that the studies were conducted in 1990 and 1991 and that they conducted additional studies until the publication of the two studies in 1996. Dr. Bargh did not reveal how many studies they conducted over the span of 5 years and how many of these studies failed to produce significant evidence of priming. If Dr. Bargh himself conducted studies that failed, it would not be surprising that others also failed to replicate the published results. However, in a personal email, Dr. Bargh assured me that “we did not as skeptics might presume run many studies and only reported the significant ones. We ran it once, and then ran it again (exact replication) in order to make sure it was a real effect.” With a 10% probability, it is possible that Dr. Bargh was indeed lucky to get two rattling-in findings in a row. However, his aim to demonstrate the robustness of an effect by trying to show it again in a second small study is misguided. The reason is that it is highly likely that the effect will not replicate or that the first study was already a lucky finding after some failed pilot studies. Underpowered studies cannot provide strong evidence for the presence of an effect and conducting multiple underpowered studies reduces the credibility of successes because the probability of this outcome to occur even when an effect is present decreases with each study (Schimmack, 2012). Moreover, even if Bargh was lucky to get two rattling-in results in a row, others will not be so lucky and it is likely that many other researchers tried to replicate this sensational finding, but failed to do so. Thus, publishing lucky results hurts science nearly as much as the failure to report failed studies by the original author.
Dr. Bargh also failed to realize how lucky he was to obtain his results, in his response to a published failed-replication study by Doyen. Rather than acknowledging that failures of replication are to be expected, Dr. Bargh criticized the replication study on methodological grounds. There would be a simple solution to test Dr. Bargh’s hypothesis that he is a better researcher and that his results are replicable when the study is properly conducted. He should demonstrate that he can replicate the result himself.
In an interview, Tom Bartlett asked Dr. Bargh why he didn’t conduct another replication study to demonstrate that the effect is real. Dr. Bargh’s response was that “he is aware that some critics believe he’s been pulling tricks, that he has a “special touch” when it comes to priming, a comment that sounds like a compliment but isn’t. “I don’t think anyone would believe me,” he says.” The problem for Dr. Bargh is that there is no reason to believe his original results, either. Two rattling-in results alone do not constitute evidence for an effect, especially when this result could not be replicated in an independent study. NBA players have to make free-throws in front of a large audience for a free-throw to count. If Dr. Bargh wants his findings to count, he should demonstrate his famous effect in an open replication study. To avoid embarrassment, it would be necessary to increase the power of the replication study because it is highly unlikely that even Dr. Bargh can continuously produce significant results with samples of N = 30 participants. Even if the effect is real, sampling error is simply too large to demonstrate the effect consistently. Knowledge about statistical power is power. Knowledge about post-hoc power can be used to detect incredible results. Knowledge about a priori power can be used to produce credible results.
Lay people, undergraduate students, and textbook authors have a simple model of science. Researchers develop theories that explain observable phenomena. These theories are based on exploratory research or deduced from existing theories. Based on a theory, researchers make novel predictions that can be subjected to empirical tests. The gold-standard for an empirical test is an experiment, but when experiments are impractical, quasi-experiments or correlational designs may be used. The minimal design examines whether two variables are related to each other. In an experiment, a relation exists when an experimentally created variation produces variation in observations on a variable of interest. In a correlational study, a relation exists when two variables covary with each other. When empirical results show the expected covariation, the results are considered supportive of a theory and the theory lives another day. When the expected covariation is not observed, the theory is challenged. If repeated attempts fail to show the expected effect, researchers start developing a new theory that is more consistent with the existing evidence. In this model of science, all scientists are only motivated by the goal to build a theory that is most consistent with a robust set of empirical findings.
The Challenge of Probabilistic Predictions and Findings
I distinguish two types of science; the distinction maps onto the distinction between hard and soft sciences, but I think the key difference between the two types of science is whether theories are used to test deterministic relationships (i.e., relationships that hold in virtually every test of the phenomenon) and probabilistic relationships, where a phenomenon may be observed only some of the time. An example of deterministic science is chemistry where the combination of oxygen and hydrogen leads to an explosion and water, when hydrogen and oxygen atoms combine to form H20. An example, of probabilistic science is a classic memory experiment where more recent information is more likely to be remembered than more remote information, but memory is not deterministic and it is possible that remote information is sometimes remembered better than recent information. A unique challenge for probabilistic science is to interpret empirical evidence because it is possible to make two errors in the interpretation of empirical results. These errors are called type-I and type-II errors.
Type-I errors refer to the error that the data show a theoretically predicted result when the prediction is false.
Type-II errors refer to the error that the data do not show a theoretically predicted result when the prediction is correct.
There are many reasons why a particular study may produce misleading results. Most prominently, a study may have failed to control (experimentally or statistically) for confounding factors. Another reason could be that a manipulation failed or a measure failed to measure the intended construct. Aside from these practical problems in conducting an empirical study, type-I and type-II errors can still emerge even in the most carefully conducted study with perfect measures. The reason is that empirical results in tests of probabilistic hypothesis are influenced by factors that are not under the control of the experimenter. These causal factors are sometimes called random error, sampling error, or random sampling error. The main purpose of inferential statistics is to deal with type-I and type-II errors that are caused by random error. It is also possible to conduct statistical analysis without drawing conclusions from the results. These statistics are often called descriptive statistics. For example, it is possible to compute and report the mean and standard deviation of a measure, the mean difference between two groups, or the correlation between two variables in a sample. As long as these results are merely reported they simply describe an empirical fact. They also do not test a theoretical hypothesis because scientific theories cannot make predictions about empirical results in a specific sample. Type-I or Type-II errors occur when the empirical results are used to draw inferences about results in future studies, in the population, or about the truth of theoretical predictions.
Three Approaches to the Problem of Probabilistic Science
In the world of probabilities, there is no certainty, but there are different degrees of uncertainty. As the strength of empirical evidence increases, it becomes less likely that researchers make type-I or type-II errors. The main aim of inferential statistics is to provide objective and quantitative information about the probability that empirical data provide the correct information about the hypothesis; that is to avoid making a type-I or type-II error.
Statisticians have developed three schools of thought: Fisherian, Neyman-Pearson, and Bayesian statistics. The problem is that contemporary proponents of these approaches are still fighting about the right approach. As a prominent statistician noted, “the effect on statistics of having three (actually more) warring factions… has not been good for our professional image” (Berger, 2003, p. 4). He goes on to note that statisticians have failed to make “a concerted professional effort to provide the scientific world with a unified testing methodology.”
For applied statisticians the distinction between Fisher and Neyman-Pearson is of relatively little practical concern because both approaches rely on the null-hypothesis and p-values. Statistics textbook often do present a hybrid model of both approaches. The Fisherian approach is to treat p-values as a measure of the strength of evidence against the null-hypothesis. As p-values approach zero, it becomes less and less likely that the null-hypothesis is true. For example, imagine a researcher computes the correlation between height and weight in a sample of N = 10 participants. The correlation is r = .50. Given the small sample size, this extreme deviation from the null-hypothesis could still have occurred by chance. As the sample size increases, random factors can produce only smaller and smaller deviations from zero and an observed correlation of r = .50 becomes less and less likely to have occurred as a result of random sampling error (oversampling tall and heavy participants and undersampling short and lightweight).
The main problem for Fisher’s approach is that it provides no guidelines about the size of a p-value that should be used to reject the null-hypothesis (there is no correlation) and therewith confirm the alternative (there is a correlation). Thus, p-values provide a quantitative measure of evidence against the null-hypothesis, but they do not provide a decision rule how strong the evidence should be to conclude that the null-hypothesis is false. As such, one might argue that Fisher’s approach is not an inferential statistical approach because it does not spell out how researchers should interpret p-values. Without a decision rule, a p-value is just an objective statistic like a sample mean or standard deviation.
Neyman-Pearson solved the problem of inference by introducing a criterion value. The most common criterion value is p = .05. When the strength of the evidence against the null-hypothesis leads to a p-value less than .05, the null-hypothesis is rejected. When the p-value is above the criterion, the null-hypothesis is accepted. According to Berger (2003), Neyman-Pearson also advocated to compute and report type-I and type-II error probabilities. Evidently, this suggestion has not been adopted in applied research, especially with regard to type-II error probabilities. The main reason for not adopting Neyman-Pearson’s recommendation is that the type-II error rate depends on an a priori assumption about the size of an effect. However, many hypothesis in the probabilities sciences make only diffuse, qualitative predictions (e.g., height will be positively correlated with weight, but the correlation may range anywhere from r = .1 to .8). Applied researchers saw little value in computing type-II error rates that are based on subjective assumptions about the strength of an effect. Instead, they adopted the criterion approach by Neyman-Pearson, but they used the criterion only to make the decision that the null-hypothesis is false when the evidence was strong enough to reject the null-hypothesis (p < .05). In contrast, when the evidence was not strong enough to reject the null-hypothesis, the results were considered inconclusive. The null-hypothesis could be true or the results were a type-II error. It was not important to determine whether the null-hypothesis was true or not because researchers were mainly interested in demonstrating causal relationships (a drug is effective) than in showing that something does not have an effect (a drug is not effective). By avoiding to rule in favor of the null-hypothesis, researchers could never make a type-II error in the classical sense that they falsely accepted the null-hypothesis. In this context, the term type-II error assumed a new meaning. A type-II error now meant that the study had insufficient statistical power to demonstrate that the null-hypothesis was false. A study with more statistical power might be able to produce a p-value less than .05 and demonstrate that the null-hypothesis is false.
The appeal of the hybrid approach was that the criterion provided meaningful information about the type-I error and that the type-II error rate was zero because results were never interpreted as favoring the null-hypothesis. The problem of this approach is that it can never lead to the conclusion that an effect is not present. For example, it is only possible to demonstrate gender differences, but it is never possible to demonstrate that men and women do not differ from each other. The main problem with this one-sided testing approach was that non-significant results seemed unimportant because they were inconclusive and it seemed more important to report conclusive, significant results than inconclusive and insignificant results. However, if only significant results are reported, it is no longer clear how many of these significant results might be type-I errors (Sterling, 1959). If only significant results are reported, the literature will be biased and can contain an undetermined amount of type-I errors (false evidence for an effect when the null-hypothesis is true). However, this is not a problem of p-values. It is a problem of not reporting studies that failed to provide support for a hypothesis, which is needed to reveal type-I errors. As type-I errors would occur only at a rate of 1 out of 20, honest reporting of all studies would quickly reveal which significant results are type-I errors.
The Bayesian tradition is not a unified approach to statistical inference. The main common element of Bayesian statistics is to criticize p-values because they do not provide information about the probability that a hypothesis is true; p(H1|D). Bayesians argue that empirical scientists misinterpret p-values as estimates of the probability that a hypothesis is true, when they quantify merely the probability that the data could have been produced without an effect. The main aim of Bayesian statistics is to use the Bayes Theorem to obtain an estimate of p(H1|D) from the empirically observed data.
One piece of information is the probability of an empirical observed statistic when the null-hypothesis is true, p(D|H0). This probability is closely related to p-values. Whereas the Bayesian p(D|H0) is the probability of obtaining a particular test statistic (e.g., a z-score of 1.65), p-values quantify the probability of obtaining a test statistic greater (one-sided) than the observed test statistic (p[z > 1.65] = .05) [for the two-sided case, p[abs(z) = 1.96] = .05]
The problem for estimating the probability that the hypothesis is true given an empirical result depends on three more probabilities that are unrelated to the observed data, namely the probability that the hypothesis is true, P(H0), the probability that the alternative hypothesis is true, p(H1), and the probability that the data would have been observed if the alternative hypothesis is true, p(D|H1). One approach to the problem of three unknowns is to use prior knowledge or empirical data to estimate these parameters. However, the problem for many empirical studies is that there is very little reliable a priori information that can be used to estimate these parameters.
A group of Bayesian psychologists has advocated an objective Bayesian approach to deal with problem of unknown parameters in Bayes’ Theorem (Wagenmakers et al., 2011). To deal with the problem that p(H1|D) is unknown, the authors advocate using a default a priori probability distribution of effect sizes. The next step is to compute the ratio of p(H0|D) and p(H1|D). This ratio is called the Bayes-Factor. The following formula shows that the probability of the null-hypothesis being true given the data, p(H0|D), increases as the Bayes-Factor, p(D|H0)/p(D|H1) increases. Similarly, the probability of the alternative hypothesis given the data, p(H1|D) increases as the Bayes-Factor decreases. To quantify these probabilities, one would need to make assumptions about p(H0) and p(H1), but even without making assumptions about these probabilities, it is clear that the ratio of p(H0|D)/p(H1|D) is proportional to p(D|H0)/p(D|H1).
Bayes-Factors have two limitations. First, like p-values, Bayes-Factors alone are insufficient for inferential statistics because they only quantify the relative evidence in favor of two competing hypotheses. It is not clear at which point the results of a study should be interpreted as evidence for one of the two hypotheses. For example, is a Bayes-Factor of 1.1, 2.5, 3, 10, or 100 sufficient to conclude that the null-hypothesis is true? The second problem is that the default function may not adequately characterize the alternative hypothesis. In this regard, Bayesian statistics have the same problem as Neyman-Pearson’s approach that required making a priori assumptions about the effect size in order to compute type-II error rates. In Bayesian statistic the a priori distribution of effect sizes influences the Bayes-Factor.
In response to the first problem, Bayesians often use conventional criterion values that are used to make decisions based on empirical data. Commonly used criterion values are a Bayes-Factor of 3 or 10. A decision rule is clearly implemented in Bayesian studies with optional stopping where a Bayes-Factor of 10 or greater is used to justify terminating a study early. Bayes-Factors with a decision criterion create a new problem in that it is now possible to obtain inconclusive results and results that favor the null-hypothesis. As a result, there are now two types of type-II errors. Some type-II errors occur when the BF meets the criterion to accept the null-hypothesis when the null-hypothesis is false. Other type-II errors occur when the null-hypothesis is false and the data are inconclusive.
So far, Bayesian statisticians have not examined type-II error rates with the argument that Bayes-Factors do not require researchers to make decisions. However, without clear decision rules, Bayes-Factors are not very appealing to applied scientists because researchers, reviewers, editors, and readers need some rational criterion to make decisions about publication and planning of future studies. The best way to provide this information would be to examine how often Bayes-Factors of a certain magnitude lead to false conclusions; that is, to determine the type-I and type-II(a,b) error rates that are associated with a Bayes-Factor of a certain magnitude. This question has not been systematically examined.
The Bayesian Default T-Test
As noted above, there is no unified Bayesian approach to statistical inference. Thus, it is impossible to make general statements about Bayesian statistics. Here I focus on the statistical properties of the default Bayesian t-test (Rouder, Speckman, Sun, Morey, & Iverson, 2009). Most prominently, this test was used to demonstrate the superiority of Bayes-Factors over p-values with Bem’s (2011) controversial set of studies that seemed to support extrasensory perception.
The authors provide an R-package with a function that computes Bayes-Factors based on the observed t-statistic and degrees of freedom. It is noteworthy that the Bayes-Factor is fully determined by the t-value, the degrees of freedom, and a default scaling parameter for the prior distribution. As t-values and df are also used to compute p-values, Bayes-Factors and p-values are related to each other. The main difference is that p-values have a constant meaning for different sample sizes. That is, p = .04 has the same meaning in studies with N = 10, 100, or 1000 participants. However, Bayes-Factors for the same t-value changes as a function of sample size.
“With smaller sample sizes that are insufficient to differentiate between approximate and exact invariances, the Bayes factors allows researchers to gain evidence for the null. This evidence may be interpreted as support for at least an approximate invariance. In very large samples, however, the Bayes factor allows for the discovery of small perturbations that negate the existence of an exact invariance.” (Rouder et al., 2009, p 233).
This means that the same population effect size can produce three different outcomes depending on sample size; it may show evidence in favor of the null-hypothesis with a small sample size, it may show inconclusive results with a moderate sample size, and it may show evidence for the alternative hypothesis with a large sample size.
The ability to compute Bayes-Factors and p-values from t-values also implies that for a fixed sample size, p-values can be directly transformed into Bayes-Factors and vice versa. This makes it easy to directly compare the inferences that can be drawn from observed t-values for different p-values and Bayes-Factors.
The simulations used the default setting of a Cauchi distribution with a scale parameter of .707.
The x-axis shows potential effect sizes. The y-axis shows the weight attached to different effect sizes. The Cauchy distribution is centered over zero, giving the highest probability to an effect size of d = 0. As effect sizes increase weights decrease. However, even effect sizes greater than d = .8 (strong effect, Cohen, 1988) still have notable weights and the distribution includes effect sizes above d = 2. It is important to keep in mind that Bayes-Factors express the relative strength of evidence for or against the null-hypothesis relative to the weighted average effect size implied by the default function. Thus, it is possible that a Bayes-Factor favors the null-hypothesis if the population effect size is small because a small effect size is inconsistent with a prior distribution that considers strong effect sizes as a possible outcome.
The next figure shows Bayes-Factors as a function of p-values for an independent group t-test with n = 50 per condition. The black line shows the Bayes-Factor for H1 over H0. The red line shows the Bayes-Factor for H0 over H1. I show both ratios because I find it easier to compare Bayes-Factors greater than 1 than Bayes-Factors less than 1. The two lines cross when BF = 1, which is the point where the data favor both hypothesis equally.
The graph shows the monotonic relationship between Bayes-Factors and p-values. As p-values decrease BF10 (favor H1 over H0, black) increases. As p-values increase, BF01-values (favor H0 over H1, red) also increase. However, the shapes of the two curves are rather different. As p-values decrease, the black line stays flat for a long time. As p-values are around p = .2, the curve goes up. It reaches a value of 3 just below a p-value of .05 (marked by the green line) and then increases quickly. This graph suggests that a Bayes-Factor of 3 corresponds roughly to a p-value of .05. A Bayes-Factor of 10 would correspond to a more stringent p-value. The red curve has a different shape. Starting from the left, it rises rather quickly and then slows down as p-values move towards 1. BF01 cross the red dotted line marking BF = 3 at around p = .3, but it never reaches a factor of 10 in favor of the null-hypothesis. Thus, using a criterion of BF = 3, p-values higher than .3 would be interpreted as evidence in favor of the null-hypothesis.
The next figure shows the same plot for different sample sizes.
The graph shows how the Bayes-Factor of H0 over H1 (red line) increases as a function of sample size. It also reaches the critical value of BF = 3 earlier and earlier. With n = 1000 in each group (total N = 2000) the default Bayesian test is very likely to produce strong evidence in favor of either H1 or H0.
The responsiveness of BF01 to sample size makes sense. As sample size increases, statistical power to detect smaller and smaller effects also increases. In the limit a study with an infinite sample size has 100% power. That means, when the whole population has been studied and the effect size is zero, the null-hypothesis has been proven. However, even the smallest deviation from zero in the population will refute the null-hypothesis because sampling error is zero and the observed effect size is different from zero.
The graph also shows that Bayes-Factors and p-values provide approximately the same information when H1 is true. Statistical decisions based on BF10 or p-values lead to the same conclusion for matching criterion values. The standard criterion of p = .05 corresponds approximately to BF10 = 3 and BF10 = 10 corresponds roughly to p = .005. Thus, Bayes-Factors are not less likely to produce type-I errors than p-values because they reflect the same information, namely how unlikely it is that the deviation from zero in the sample is simply due to chance.
The main difference between Bayes-Factors and p-values arises in the interpretation of non-significant results (p > .05, BF10 < 3). The classic Neyman-Pearson approach would treat all non-significant results as evidence for the null-hypothesis, but would also try to quantify the type-II error rate (Berger, 2003). The Fisher-Neyman-Pearson hybrid approach treats all non-significant results as inconclusive and never decides in favor of the null-hypothesis. The default Bayesian t-tests distinguishes between inconclusive results and those that favor the null-hypothesis. To distinguish between these two conclusions, it is necessary to postulate a criterion value. Using the same criterion that is used to rule in favor of the alternative hypothesis (p = .05 ~ BF10 = 3), a BF01 > 3 is a reasonable criterion to decide in favor of the null-hypothesis. Moreover, a more stringent criterion would not be useful in small samples, because BF01 can never reach values of 10 or higher. Thus, in small samples, the conclusion would always be the same as in the standard approach that treats all non-significant results as inconclusive.
Power, Type I, and Type-II Error rates of the default Bayesian t-test with BF=3 as criterion value
As demonstrated in the previous section, the results of a default Bayesian t-test depend on the amount of sampling error, which is fully determined by sample size in a between-subject design. The previous results also showed that the default Bayesian t-test has modest power to rule in favor of the null-hypothesis in small samples.
For the first simulation, I used a sample size of n = 50 per group (N = 100). The reason is that Wagenmakers and colleagues have conducted several pre-registered replication studies with a stopping rule when sample size reaches N= 100. The simulation examines how often a default t-test with 100 participants can correctly identify the null-hypothesis when the null-hypothesis is true. The criterion value was set to BF01 = 3. As the previous graph showed, this implies that any observed p-value of approximately p = .30 to 1 is considered to be evidence in favor of the null-hypothesis. The simulation with 10,000 t-tests produced 6,927 BF01s greater than 3. This result is to be expected because p-values follow a uniform distribution when the null-hypothesis is true. Therefore, the p-value that corresponds to BF01 = 3 determines the rate of decisions in favor of null. With p = .30 as the criterion value that corresponds to BF01 = 3, 70% of the p-values are in the range from .30 to 1. 70% power may be deemed sufficient.
The next question is how the default Bayesian t-test behaves when the null-hypothesis is false. The answer to this question depends on the actual effect size. I conducted three simulation studies. The first simulation examined effect sizes in the moderate to large range (d = .5 to .8). Effect sizes were uniformly distributed. With a uniform distribution of effect sizes, true power ranges from 70% to 97% with an average power of 87% for the traditional criterion value of p = .05 (two-tailed). Consistent with this power analysis, the simulation produced 8704 significant results. Using the BF10 = 3 criterion, the simulation produced 7405 results that favored the alternative hypothesis with a Bayes-Factor greater than 3. The power is slightly lower than for p=.05 because BF = 3 is a slightly stricter criterion. More important, the power of the test to show support for the alternative is about equal to the power to support the null-hypothesis; 74% vs. 70%, respectively.
The next simulation examined effect sizes in the small to moderate range (d = .2 to .5). Power ranges from 17% to 70% with an average power of 42%. Consistent with this prediction, the simulation study with 10,000 t-tests produced 4072 significant results with p < .05 as criterion. With the somewhat stricter criterion of BF = 3, it produced only 2,434 results that favored the alternative hypothesis with BF > 3. More problematic is the finding that it favored the null-hypothesis (BF01 > 3) nearly as often, namely 2405 times. This means, that in a between-subject design with 100 participants and a criterion-value of BF = 3, the study has about 25% power to demonstrate that an effect is present, it will produce inconclusive results in 50% of all cases, and it will falsely support the null-hypothesis in 25% of all cases.
Things get even worse when the true effect size is very small (d > 0, d < .2). In this case, power ranges from just over .05, the type-I error rate, to just under 17% for d = .2. The average power is just 8%. Consistent with this prediction, the simulation produced only 823 out of 10,000 significant results with the traditional p = .05 criterion. The stricter BF = 3 criterion favored the alternative hypothesis in only 289 out of 10,000 cases with a BF greater than 3. However, BF01 exceeded a value of 3 in 6201 cases. The remaining 3519 cases produced inconclusive results. In this case, the Bayes-Factor favored the null-hypothesis when it was actually false. The rate of false decisions in favor of the null-hypothesis is nearly as high as the power of the test to correctly identify the null-hypothesis (62% vs. 70%).
The previous analyses indicate that Bayes-Factors produce meaningful results when power to detect an effect is high, but that Bayes-Factors are at risk to falsely favor the null-hypothesis when power is low. The next simulation directly examined the relationship between power and Bayes-Factors. The simulation used effect sizes in the range from d = .001 to d = 8 with N = 100. This creates a range of power from 5 to 97% with an average power of 51%.
In this figure, red data points show BF01 and blue data points show BF10. The right side of the figure shows that high-powered studies provide meaningful information about the population effect size as BF10 tend to be above the criterion value of 3 and BF01 are very rarely above the criterion value of 3. In contrast, on the left side, the results are misleading because most of the blue data points are below the criterion value of 3 and many BF01 data points are above the criterion value of BF = 3.
What about the probability of the data when the default alternative hypothesis is true?
A Bayes-Factor is defined as the ratio of two probabilities, the probability of the data when the null-hypothesis is true and the probability of the data when the null-hypothesis is false. As such, Bayes-Factors combine information about two hypotheses, but it might be informative to examine each hypothesis separately. What is the probability of the data when the null-hypothesis is true and what is the probability of the data when the alternative hypothesis is true? To examine this, I computed p(D|H1) by dividing the p-values by BF01 for t-values in the range from 0 to 5.
As Bayes-Factors are sensitive to sample size (degrees of freedom), I repeated the analysis with N = 40 (n = 20), N = 100 (n = 50), and N = 200 (n = 100).
The most noteworthy aspect of the figure is that p-values (the black line, p(D|H0)), are much more sensitive to changes in t-values than the probabilities of the data given the alternative hypothesis (yellow N=40, orange N=100, red N=200). The reason is the diffuse nature of the alternative hypothesis. It always includes a hypothesis that predicts the test-statistic, but it also includes many other hypotheses that make other predictions. This makes the relationship between the observed test-statistic, t, and the probability of t given the diffuse alternative hypothesis dull. The figure also shows that p(D|H0) and p(D|H1) both decrease monotonically as t-values increase. The reason is that the default prior distribution has its mode over 0. Thus, it also predicts that an effect size of 0 is the most likely outcome. It is therefore not a real alternative hypothesis that predicts an alternative effect size. It merely is a function that has a more muted relationship to the observed t-values. As a result, it is less compatible with low t-values and more compatible with high t-values than the steeper function for the point-null hypotheses.
Do we need Bayes-Factors to Provide Evidence in Favor of the Null-Hypothesis?
A common criticism of p-values is that they can only provide evidence against the null-hypothesis, but that they can never demonstrate that the null-hypothesis is true. Bayes-Factors have been advocated as a solution to this alleged problem. However, most researchers are not interested in testing the null-hypothesis. They want to demonstrate that a relationship exists. There are many reasons why a study may fail to produce the expected effect. However, when the predicted effect emerges, p-values can be used to rule out (with a fixed error probability) that the effect emerged simply as a result of chance alone.
Nevertheless, non-Bayesian statistics could also be used to examine whether a null-hypothesis is true without the need to construct diffuse priors or to compare the null-hypothesis to an alternative hypothesis. The approach is so simple that it is hard to find sources that explain it. Let’s assume that a researcher wants to test the null-hypothesis that Bayesian statisticians and other statisticians are equally intelligent. The researcher recruits 20 Bayesian statisticians and 20 frequentist statisticians and administers an IQ test. The Bayesian statisticians have an average IQ of 130 points. The frequentists have an average IQ of 120 points. The standard deviation of IQ scores on this IQ test is 15 points. Moreover, it has been shown that IQ scores are approximately normally distributed. Thus, sampling error is defined as 15 * (2 / sqrt(40)) = 4.7 ~ 5. The figure below shows the distribution of difference scores under the assumption that the null-hypothesis is true. The red lines show the 95% confidence interval. A 5 point difference is well within the 95% confidence interval. Thus, the result is consistent with the null-hypothesis that there is no difference in intelligence between the two groups. Of course, a 5 point difference is one-third of a standard deviation, but the sample size is simply too small to infer from the data that the null-hypothesis is false.
A more stringent test of the null-hypothesis would require a larger sample. A frequentist researcher conducts a power analysis and assumes that only a 5 point difference or more would be meaningful. She conducts a power analysis and finds that a study with 143 participants in each group (N = 286) is needed to have 80% power to show a difference of 5 points or more. A non-significant result would suggest that the difference is smaller or that a type-II error occurred with a 20% probability. The study yields a mean of 128 for frequentists and 125 for Bayesians. The 3 point difference is not significant. As a result, the data support the null-hypothesis that Bayesians and Frequentists do not differ in intelligence by more than 5 points. A more stringent test of equality or invariance would require an even larger sample. There is no magic Bayesian bullet that can test a precise null-hypothesis in small samples.
Ignoring Small Effects is Rational: Parsimony and Occam’s Razor
Another common criticism of p-values is that they are prejudice against the null-hypothesis because it is always possible to get a significant result simply by increasing sample size. With N = 1,000,000, a study has 95% power to detect even an effect size of d = .007. The argument is that it is meaningless to demonstrate significance in smaller samples, if it is certain that significance can always be obtained in a larger sample. The argument is flawed because it is simply not true that p-values will eventually produce a significant result when sample sizes increase. P-values will only produce significant results when a true effect exists. When the null-hypothesis is true an honest test of the hypothesis will only produce as many significant results as the type-I error criterion specifies. Moreover, Bayes-Factors are no solution to this problem. When a true effect exists, they will also favor the alternative hypothesis no matter how small the effect is and when sample sizes are large enough to have sufficient power. The only difference is that Bayes-Factors may falsely accept the null-hypothesis in smaller samples.
The more interesting argument against p-value is not that significant results in large studies are type-I errors, but that these results are practically meaningless. To make this point, statistics books often distinguish statistical significance and practical significance and warn that statistically significant results in large samples may have little practical significance. This warning was useful in the past when researchers would only report p-values (e.g., women have higher verbal intelligence than men, p < .05). The p-value says nothing about the size of the effect. When only the p-value is available, it makes sense to assume that significant results in smaller samples are larger because only large effects can be significant in these samples. However, large effects can also be significant in large samples and large effects in small studies can be inflated by sampling error. Thus, the notion of practical significance is outdated and should be replaced by questions about effect sizes. Neither p-values nor Bayes-Factors provide information about the size of the effect or the practical implications of a finding.
How can p-values be useful when there is clear evidence of a replication crisis?
Bem (2011) conducted 10 studies to demonstrate experimental evidence for anomalous retroactive influences on cognition and affect. His article reports 9 significant results and one marginally significant result. Subsequent studies have failed to replicate this finding. Wagenmakers et al. (2011) used Bem’s results as an example to highlight the advantages of Bayesian statistics. The logic was that p-values are flawed and that Bayes-Factors would have revealed that Bem’s (2011) evidence was weak. There are several problems with Wagenmaker et al.’s (2011) Bayesian analysis of Bem’s data.
First, the reported results differ from the default Bayesian-test implemented on Dr. Rouder’s website (http://pcl.missouri.edu/bf-one-sample). The reason is that Bayes-Factors depend on a scaling factor of the Cauchy distribution. Wagenmakers et al. (2011) used a scaling factor of 1, whereas the online app used .707 as the default. The choice of a scaling parameter gives some degrees of freedom to researchers. Researchers who favor the null-hypothesis can choose a larger scaling factor which makes the alternative hypothesis more extreme and easier to reject with small effects. Smaller scaling factors make the Cauchy-distribution narrower and it is easier to show evidence in favor of the alternative hypothesis with smaller effects. The behavior of Bayes-Factors for different scaling parameters is illustrated in Table 1 with Bem’s data.
Experiment 7 is highlighted because Bem (2011) already interpreted the non-significant result in this study as evidence that the effect disappears with supraliminal stimuli; that is, visible stimuli. The Bayes-Factor would support Bem’s (2011) conclusion that Experiment 7 shows evidence that the effect does not exist under this condition. The other studies essentially produced inconclusive Bayes-Factors, especially for the online default-setting with a scaling factor of .707. The only study that produced clear evidence for ESP was experiment 9. This study had the smallest sample size (N = 50), but a large effect size that was twice the effect size in the other studies. Of course, this difference is not reliable due to the small sample size, but it highlights how sensitive Bayes-Factors are to sampling error in small samples.
Another important feature of the Bayesian default t-test is that it centers the alternative hypothesis over 0. That is, it assigns the highest probability to the null-hypothesis, which is somewhat odd as the alternative hypothesis states that an effect should be present. The justification for this default setting is that the actual magnitude of the effect is unknown. However, it is typically possible to formulate an alternative hypothesis that allows for uncertainty, while predicting that the most likely outcome is a non-null effect size. This is especially true when previous studies provide some information about expected effect sizes. In fact, Bem (2011) explicitly planned his study with the expectation that the true effect size is small, d ~ .2. Moreover, it was demonstrated above that the default t-test is biased against small effects. Thus, the default Bayesian t-test with a scaling factor of 1 does not provide a fair test of Bem’s hypothesis against the null-hypothesis.
It is possible to use the default t-test to examine how consistent the data are with Bem’s (2011) a priori prediction that the effect size is d = .2. To do this, the null-hypothesis can be formulated as d = .2 and t-values can be computed as deviations from a population parameter d = .2. In this case, the null-hypothesis presents Bem’s (2011) a priori prediction and the alternative prediction is that observed effect sizes will deviated from this prediction because the effect is smaller (or larger). The next table shows the results for the Bayesian t-test that tests H0: d = .2 against a diffuse alternative H1: Cauchy-distribution centered over d = .2. Results are presented as BF01 so that Bayes-Factors greater than 3 indicate support for Bem’s (2011) prediction.
The Bayes-Factor supports Bem’s prediction in all tests. Choosing a wider alternative this time provides even stronger support for Bem’s prediction because the data are very consistent with the point prediction of a small effect size, d = .2. Moreover, even Experiment 7 now shows support for the hypothesis because an effect size of d = .09 is still more likely to have occurred when the effect size is d = .2 than for a wide-range of other effect sizes. Finally, Experiment 9 now shows the weakest support for the hypothesis. The reason is that Bem used only 50 participants in this study and the effect size was unusually large. This produced a low p-value in a test against zero, but it also produced the largest deviation from the a priori effect size of d = .2. However, this is to be expected in a small sample with large sampling error. Thus, the results are still supportive, but the evidence is rather weak compared to studies with larger samples and effect sizes close to d = 2.
The results demonstrate that Bayes-Factors cannot be interpreted as evidence for or against a specific hypothesis. They are influenced by the choice of the hypotheses that are being tested. In contrast, p-values have a consistent meaning. They quantify how probable it is that random sampling error alone could have produced a deviation between an observed sample parameter and a postulated population parameter. Bayesians have argued that this information is irrelevant and does not provide useful information for the testing of hypotheses. Although it is true that p-values do not quantify the probability that a hypothesis is true when significant results were observed, Bayes-Factors also do not provide this information. Moreover, Bayes-Factors are simply a ratio of two probabilities that compare two hypotheses against each other, but usually only one of the hypotheses is of theoretical interest. Without a principled and transparent approach to the formulation of alternative hypotheses, Bayes-Factors have no meaning and will change depending on different choices of the alternatives. The default approach aims to solve this by using a one-size-fits-all solution to the selection of priors. However, inappropriate priors will lead to invalid results and the diffuse Cauchy-distribution never fits any a priori theory.
Statisticians have been fighting for supremacy for decades. Like civilians in a war, empirical scientists have suffered from this war because they have been bombarded by propaganda and they have been criticized that they misunderstand statistics or use the wrong statistics. In reality, the statistical approaches are all related to each other and they all rely on the ratio of the observed effect sizes to sampling error (i.e, the signal to noise ratio) to draw inferences from observed data about hypotheses. Moreover, all statistical inferences are subject to the rule that studies with less sampling error provide more robust empirical evidence than studies with more sampling error. The biggest challenge for empirical researchers is to optimize the allocation of resources so that each study has high statistical power to produce a significant result when an effect exists. With high statistical power to detect an effect, p-values are likely to be small (50% chance to get a p-value of .005 or lower with 80% power) and Bayes-Factors and p-values provide virtually the same information for matching criterion values, when an effect is present. High power also implies a relative low frequency of type-II errors, which makes it more likely that a non-significant result occurred because the hypothesis is wrong. Thus, planning studies with high power is important no matter whether data are analyzed with Frequentist or Bayesian statistics.
Studies that aim to demonstrate the lack of an effect or an invariance (there is no difference in intelligence between Bayesian and frequentist statisticians) need large samples to demonstrate invariance or have to accept that there is a high probability that a larger study would find a reliable difference. Bayes-Factors do not provide a magical tool to provide strong support for the null-hypothesis in small samples. In small samples Bayes-Factors can falsely favor the null-hypothesis even when effect sizes are in the moderate to large range.
In conclusion, like p-values, Bayes-Factors are not wrong. They are mathematically defined entities. However, when p-values or Bayes-Factors are used by empirical scientists to interpret their data, it is important that the numeric results are interpreted properly. False interpretation of Bayes-Factors is just as problematic as false interpretation of p-values. Hopefully, this blog post provided some useful information about Bayes-Factors and their relationship to p-values.
Yuan, K.-H., & Maxwell, S. (2005). On the Post Hoc Power in Testing Mean Differences. Journal of Educational and Behavioral Statistics, 141–167
This blog post provides an accessible introduction to the concept of observed power. Most of the statistical points are based on based on Yuan and Maxwell’s (2005 excellent but highly technical article about post-hoc power. This bog post tries to explain statistical concepts in more detail and uses simulation studies to illustrate important points.
What is Power?
Power is defined as the long-run probability of obtaining significant results in a series of exact replication studies. For example, 50% power means that a set of 100 studies is expected to produce 50 significant results and 50 non-significant results. The exact numbers in an actual set of studies will vary as a function of random sampling error, just like 100 coin flips are not always going to produce a 50:50 split of heads and tails. However, as the number of studies increases, the percentage of significant results will be ever closer to the power of a specific study.
A priori power
Power analysis can be useful for the planning of sample sizes before a study is being conducted. A power analysis that is being conducted before a study is called a priori power analysis (before = a priori). Power is a function of three parameters: the actual effect size, sampling error, and the criterion value that needs to be exceeded to claim statistical significance. In between-subject designs, sampling error is determined by sample size alone. In this special case, power is a function of the true effect size, the significance criterion and sample size.
The problem for researchers is that power depends on the effect size in the population (e.g., the true correlation between height and weight amongst Canadians in 2015). The population effect size is sometimes called the true effect size. Imagine that somebody would actually obtain data from everybody in a population. In this case, there is no sampling error and the correlation is the true correlation in the population. However, typically researchers use much smaller samples and the goal is to estimate the correlation in the population on the basis of a smaller sample. Unfortunately, power depends on the correlation in the population, which is unknown to a researcher planning a study. Therefore, researchers have to estimate the true effect size to compute an a priori power analysis.
Cohen (1988) developed general guidelines for the estimation of effect sizes. For example, in studies that compare the means of two groups, a standardized difference of half a standard deviation (e.g., 7.5 IQ points on an iQ scale with a standard deviation of 15) is considered a moderate effect. Researchers who assume that their predicted effect has a moderate effect size, can use d = .5 for an a priori power analysis. Assuming that they want to claim significance with the standard criterion of p < .05 (two-tailed), they would need N = 210 (n =105 per group) to have a 95% chance to obtain a significant result (GPower). I do not discuss a priori power analysis further because this blog post is about observed power. I merely introduced a priori power analysis to highlight the difference between a priori power analysis and a posteriori power analysis, which is the main topic of Yuan and Maxwell’s (2005) article.
A Posteriori Power Analysis: Observed Power
Observed power computes power after a study or several studies have been conducted. The key difference between a priori and a posteriori power analysis is that a posteriori power analysis uses the observed effect size in a study as an estimate of the population effect size. For example, assume a researcher found a correlation of r = .54 in a sample of N = 200 Canadians. Instead of guessing the effect size, the researcher uses the correlation observed in this sample as an estimate of the correlation in the population. There are several reasons why it might be interesting to conduct a power analysis after a study. First, the power analysis might be used to plan a follow up or replication study. Second, the power analysis might be used to examine whether a non-significant result might be the result of insufficient power. Third, observed power is used to examine whether a researcher used questionable research practices to produce significant results in studies that had insufficient power to produce significant results.
In sum, observed power is an estimate of the power of a study based on the observed effect size in a study. It is therefore not power that is being observed, but the effect size that is being observed. However, because the other parameters that are needed to compute power are known (sample size, significance criterion), the observed effect size is the only parameter that needs to be observed to estimate power. However, it is important to realize that observed power does not mean that power was actually observed. Observed power is still an estimate based on an observed effect size because power depends on the effect size in the population (which remains unobserved) and the observed effect size in a sample is just an estimate of the population effect size.
A Posteriori Power Analysis after a Single Study
Yuan and Maxwell (2005) examined the statistical properties of observed power. The main question was whether it is meaningful to compute observed power based on the observed effect size in a single study.
The first statistical analysis of an observed mean difference is to examine whether the study produced a significant result. For example, the study may have examined whether music lessons produce an increase in children’s IQ. The study had 95% power to produce a significant difference with N = 176 participants and a moderate effect size (d = .5; IQ = 7.5).
One possibility is that the study actually produced a significant result. For example, the observed IQ difference was 5 IQ points. This is less than the expected difference of 7.5 points and corresponds to a standardized effect size of d = .3. Yet, the t-test shows a highly significant difference between the two groups, t(208) = 3.6, p = 0.0004 (1 / 2513). The p-value shows that random sampling error alone would produce differences of this magnitude or more in only 1 out of 2513 studies. Importantly, the p-value only makes it very likely that the intervention contributed to the mean difference, but it does not provide information about the size of the effect. The true effect size may be closer to the expected effect size of 7.5 or it may be closer to 0. The true effect size remains unknown even after the mean difference between the two groups is observed. Yet, the study provides some useful information about the effect size. Whereas the a priori power analysis relied exclusively on guess-work, observed power uses the effect size that was observed in a reasonably large sample of 210 participants. Everything else being equal, effect size estimates based on 210 participants are more likely to match the true effect size than those based on 0 participants.
The observed effect size can be entered into a power analysis to compute observed power. In this example, observed power with an effect size of d = .3 and N = 210 (n = 105 per group) is 58%. One question examined by Yuan and Maxwell (2005) is whether it can be useful to compute observed power after a study produced a significant result.
The other question is whether it can be useful to compute observed power when a study produced a non-significant result. For example, assume that the estimate of d = 5 is overly optimistic and that the true effect size of music lessons on IQ is a more modest 1.5 IQ points (d = .10, one-tenth of a standard deviation). The actual mean difference that is observed after the study happens to match the true effect size exactly. The difference between the two groups is not statistically significant, t(208) = .72, p = .47. A non-significant result is difficult to interpret. On the one hand, the means trend in the right direction. On the other hand, the mean difference is not statistically significant. The p-value suggests that a mean difference of this magnitude would occur in every second study by chance alone even if music intervention had no effect on IQ at all (i.e., the true effect size is d = 0, the null-hypothesis is true). Statistically, the correct conclusion is that the study provided insufficient information regarding the influence of music lessons on IQ. In other words, assuming that the true effect size is closer to the observed effect size in a sample (d = .1) than to the effect size that was used to plan the study (d = .5), the sample size was insufficient to produce a statistically significant result. Computing observed power merely provides some quantitative information to reinforce this correct conclusion. An a posteriori power analysis with d = .1 and N = 210, yields an observed power of 11%. This suggests that the study had insufficient power to produce a significant result, if the effect size in the sample matches the true effect size.
Yuan and Maxwell (2005) discuss false interpretations of observed power. One false interpretation is that a significant result implies that a study had sufficient power. Power is a function of the true effect size and observed power relies on effect sizes in a sample. 50% of the time, effect sizes in a sample overestimate the true effect size and observed power is inflated. It is therefore possible that observed power is considerably higher than the actual power of a study.
Another false interpretation is that low power in a study with a non-significant result means that the hypothesis is correct, but that the study had insufficient power to demonstrate it. The problem with this interpretation is that there are two potential reasons for a non-significant result. One of them, is that a study had insufficient power to show a significant result when an effect is actually present (this is called the type-II error). The second possible explanation is that the null-hypothesis is actually true (there is no effect). A non-significant result cannot distinguish between these two explanations. Yet, it remains true that the study had insufficient power to test these hypotheses against each other. Even if a study had 95% power to show an effect if the true effect size is d = .5, it can have insufficient power if the true effect size is smaller. In the example, power decreased from 95% assuming d = .5, to 11% assuming d = .1.
Yuan and Maxell’s Demonstration of Systematic Bias in Observed Power
Yuan and Maxwell focus on a design in which a sample mean is compared against a population mean and the standard deviation is known. To modify the original example, a researcher could recruit a random sample of children, do a music lesson intervention and test the IQ after the intervention against the population mean of 100 with the population standard deviation of 15, rather than relying on the standard deviation in a sample as an estimate of the standard deviation. This scenario has some advantageous for mathematical treatments because it uses the standard normal distribution. However, all conclusions can be generalized to more complex designs. Thus, although Yuan and Maxwell focus on an unusual design, their conclusions hold for more typical designs such as the comparison of two groups that use sample variances (standard deviations) to estimate the variance in a population (i.e., pooling observed variances in both groups to estimate the population variance).
Yuan and Maxwell (2005) also focus on one-tailed tests, although the default criterion in actual studies is a two-tailed test. Once again, this is not a problem for their conclusions because the two-tailed criterion value for p = .05 is equivalent to the one-tailed criterion value for p = .025 (.05 / 2). For the standard normal distribution, the value is z = 1.96. This means that an observed z-score has to exceed a value of 1.96 to be considered significant.
To illustrate this with an example, assume that the IQ of 100 children after a music intervention is 103. After subtracting the population mean of 100 and dividing by the standard deviation of 15, the effect size is d = 3/15 = .2. Sampling error is defined by 1 / sqrt (n). With a sample size of n = 100, sampling error is .10. The test-statistic (z) is the ratio of the effect size and sampling error (.2 / .1) = 2. A z-score of 2 is just above the critical value of 2, and would produce a significant result, z = 2, p = .023 (one-tailed; remember criterion is .025 one-tailed to match .05 two-tailed). Based on this result, a researcher would be justified to reject the null-hypothesis (there is no effect of the intervention) and to claim support for the hypothesis that music lessons lead to an increase in IQ. Importantly, this hypothesis makes no claim about the true effect size. It merely states that the effect is greater than zero. The observed effect size in the sample (d = .2) provides an estimate of the actual effect size but the true effect size can be smaller or larger than the effect size in the sample. The significance test merely rejects the possibility that the effect size is 0 or less (i.e., music lessons lower IQ).
Entering a non-centrality parameter of 3 for a generic z-test in G*power yields the following illustration of a non-central distribution.
Illustration of non-central distribution using G*Power output
The red curve shows the standard normal distribution for the null-hypothesis. With d = 0, the non-centrality parameter is also 0 and the standard normal distribution is centered over zero.
The blue curve shows the non-central distribution. It is the same standard normal distribution, but now it is centered over z = 3. The distribution shows how z-scores would be distributed for a set of exact replication studies, where exact replication studies are defined as studies with the same true effect size and sampling error.
The figure also illustrates power by showing the critical z-score of 1.96 with a green line. On the left side are studies where sampling error reduced the observed effect size so much that the z-score was below 1.96 and produced a non-significant result (p > .025 one-tailed, p > .05, two-tailed). On the right side are studies with significant results. The area under the curve on the left side is called type-II error or beta-error). The area under the curve on the right side is called power (1 – type-II error). The output shows that beta error probability is 15% and Power is 85%.
In sum, the formula
states that power for a given true effect size is the area under the curve to the right side of a critical z-score for a standard normal distribution that is centered over the non-centrality parameter that is defined by the ratio of the true effect size over sampling error.
[personal comment: I find it odd that sampling error is used on the right side of the formula but not on the left side of the formula. Power is a function of the non-centrality parameter and not just the effect size. Thus I would have included sqrt (n) also on the left side of the formula].
Because the formula relies on the true effect size, it specifies true power given the (unknown) population effect size. To use it for observed power, power has to be estimated based on the observed effect size in a sample.
The important novel contribution of Yuan and Maxwell (2005) was to develop a mathematical formula that relates observed power to true power and to find a mathematical formula for the bias in observed power.
The formula implies that the amount of bias is a function of the unknown population effect size. Yuan and Maxwell make several additional observations about bias. First, bias is zero when true power is 50%. The second important observation is that systematic bias is never greater than 9 percentage points. The third observation is that power is overestimated when true power is less than 50% and underestimated when true power is above 50%. The last observation has important implications for the interpretation of observed power.
50% power implies that the test statistic matches the criterion value. For example, if the criterion is p < .05 (two-tailed), 50% power is equivalent to p = .05. If observed power is less than 50%, a study produced a non-significant result. A posteriori power analysis might suggest that observed power is only 40%. This finding suggests that the study was underpowered and that a more powerful study might produce a significant result. Systematic bias implies that the estimate of 40% is more likely to be an overestimation than an underestimation. As a result, bias does not undermine the conclusion. Rather observed power is conservative because the actual power is likely to be even less than 40%.
The alternative scenario is that observed power is greater than 50%, which implies a significant result. In this case, observed power might be used to argue that a study had sufficient power because it did produce a significant result. Observed power might show, however, that observed power is only 60%. This would indicate that there was a relatively high chance to end up with a non-significant result. However, systematic bias implies that observed power is more likely to underestimate true power than to overestimate it. Thus, true power is likely to be higher. Again, observed power is conservative when it comes to the interpretation of power for studies with significant results. This would suggest that systematic bias is not a serious problem for the use of observed power. Moreover, the systematic bias is never more than 9 percentage-points. Thus, observed power of 60% cannot be systematically inflated to more than 70%.
In sum, Yuan and Maxwell (2005) provided a valuable analysis of observed power and demonstrated analytically the properties of observed power.
Practical Implications of Yuan and Maxwell’s Findings
Based on their analyses, Yuan and Maxwell (2005) draw the following conclusions in the abstract of their article.
Using analytical, numerical, and Monte Carlo approaches, our results show that the estimated power does not provide useful information when the true power is small. It is almost always a biased estimator of the true power. The bias can be negative or positive. Large sample size alone does not guarantee the post hoc power to be a good estimator of the true power.
Unfortunately, other scientists often only read the abstract, especially when the article contains mathematical formulas that applied scientists find difficult to follow. As a result, Yuan and Maxwell’s (2005) article has been cited mostly as evidence that it observed power is a useless concept. I think this conclusion is justified based on Yuan and Maxwell’s abstract, but it does not follow from Yuan and Maxwell’s formula of bias. To make this point, I conducted a simulation study that paired 25 sample sizes (n = 10 to n = 250) and 20 effect sizes (d = .05 to d = 1) to create 500 non-centrality parameters. Observed effect sizes were randomly generated for a between-subject design with two groups (df = n*2 – 2). For each non-centrality parameter, two simulations were conducted for a total of 1000 studies with heterogeneous effect sizes and sample sizes (standard errors). The results are presented in a scatterplot with true power on the x-axis and observed power on the y-axis. The blue line shows prediction of observed power from true power. The red curve shows the biased prediction based on Yuan and Maxwell’s bias formula.
The most important observation is that observed power varies widely as a function of random sampling error in the observed effect sizes. In comparison, the systematic bias is relatively small. Moreover, observed power at the extremes clearly distinguishes between low powered (< 25%) and high powered (> 80%) power. Observed power is particularly informative when it is close to the maximum value of 100%. Thus, observed power of 99% or more strongly suggests that a study had high power. The main problem for posteriori power analysis is that observed effect sizes are imprecise estimates of the true effect size, especially in small samples. The next section examines the consequences of random sampling error in more detail.
Standard Deviation of Observed Power
Awareness has been increasing that point estimates of statistical parameters can be misleading. For example, an effect size of d = .8 suggests a strong effect, but if this effect size was observed in a small sample, the effect size is strongly influenced by sampling error. One solution to this problem is to compute a confidence interval around the observed effect size. The 95% confidence interval is defined by sampling error times 1.96; approximately 2. With sampling error of .4, the confidence interval could range all the way from 0 to 1.6. As a result, it would be misleading to claim that an effect size of d = .8 in a small sample suggests that the true effect size is strong. One solution to this problem is to report confidence intervals around point estimates of effect sizes. A common confidence interval is the 95% confidence interval. A 95% confidence interval means that there is a 95% probability that the population effect size is contained in the 95% confidence interval around the (biased) effect size in a sample.
To illustrate the use of confidence interval, I computed the confidence interval for the example of music training and IQ in children. The example assumes that the IQ of 100 children after a music intervention is 103. After subtracting the population mean of 100 and dividing by the standard deviation of 15, the effect size is d = 3/15 = .2. Sampling error is defined by 1 / sqrt (n). With a sample size of n = 100, sampling error is .10. To compute a 95% confidence interval, sampling error is multiplied with the z-scores that capture 95% of a standard normal distribution, which is 1.96. As sampling error is .10, the values are -.196 and .196. Given an observed effect size of d = .2, the 95% confidence interval ranges from .2 – .196 = .004 to .2 + .196 = .396.
A confidence interval can be used for significance testing by examining whether the confidence interval includes 0. If the 95% confidence interval does not include zero, it is possible to reject the hypothesis that the effect size in the population is 0, which is equivalent to rejecting the null-hypothesis. In the example, the confidence interval ends at d = .004, which implies that the null-hypothesis can be rejected. At the upper end, the confidence interval ends at d = .396. This implies that the empirical results also would reject hypotheses that the population effect size is moderate (d = .5) or strong (d = .8).
Confidence intervals around effect sizes are also useful for posteriori power analysis. Yuan and Maxwell (2005) demonstrated that confidence interval of observed power is defined by the observed power of the effect sizes that define the confidence interval of effect sizes.
The figure below illustrates the observed power for the lower bound of the confidence interval in the example of music lessons and IQ (d = .004).
The figure shows that the non-central distribution (blue) and the central distribution (red) nearly perfectly overlap. The reason is that the observed effect size (d = .004) is just slightly above the d-value of the central distribution when the effect size is zero (d = .000). When the null-hypothesis is true, power equals the type-I error rate (2.5%) because 2.5% of studies will produce a significant result by chance alone and chance is the only factor that produces significant results. When the true effect size is d = .004, power increases to 2.74 percent.
Remember that this power estimate is based on the lower limit of a 95% confidence interval around the observed power estimate of 50%. Thus, this result means that there is a 95% probability that the true power of the study is 2.5% when observed power is 50%.
The next figure illustrates power for the upper limit of the 95% confidence interval.
In this case, the non-central distribution and the central distribution overlap very little. Only 2.5% of the non-central distribution is on the left side of the criterion value, and power is 97.5%. This finding means that there is a 95% probability that true power is not greater than 97.5% when observed power is 50%.
Taken these results together, the results show that the 95% confidence interval around an observed power estimate of 50% ranges from 2.5% to 97.5%. As this interval covers pretty much the full range of possible values, it follows that observed power of 50% in a single study provides virtually no information about the true power of a study. True power can be anywhere between 2.5% and 97.5% percent.
The next figure illustrates confidence intervals for different levels of power.
The data are based on the same simulation as in the previous simulation study. The green line is based on computation of observed power for the d-values that correspond to the 95% confidence interval around the observed (simulated) d-values.
The figure shows that confidence intervals for most observed power values are very wide. The only accurate estimate of observed power can be achieved when power is high (upper right corner). But even 80% true power still has a wide confidence interval where the lower bound is below 20% observed power. Firm conclusions can only be drawn when observed power is high.
For example, when observed power is 95%, a one-sided 95% confidence interval (guarding only against underestimation) has a lower bound of 50% power. This finding would imply that observing power of 95% justifies the conclusion that the study had at least 50% power with an error rate of 5% (i.e., in 5% of the studies the true power is less than 50%).
The implication is that observed power is useless unless observed power is 95% or higher.
In conclusion, consideration of the effect of random sampling error on effect size estimates provides justification for Yuan and Maxwell’s (2005) conclusion that computation of observed power provides relatively little value. However, the reason is not that observed power is a problematic concept. The reason is that observed effect sizes in underpowered studies provide insufficient information to estimate observed power with any useful degree of accuracy. The same holds for the reporting of observed effect sizes that are routinely reported in research reports and for point estimates of effect sizes that are interpreted as evidence for small, moderate, or large effects. None of these statements are warranted when the confidence interval around these point estimates is taken into account. A study with d = .80 and a confidence interval of d = .01 to 1.59 does not justify the conclusion that a manipulation had a strong effect because the observed effect size is largely influenced by sampling error.
In conclusion, studies with large sampling error (small sample sizes) are at best able to determine the sign of a relationship. Significant positive effects are likely to be positive and significant negative effects are likely to be negative. However, the effect sizes in these studies are too strongly influenced by sampling error to provide information about the population effect size and therewith about parameters that depend on accurate estimation of population effect sizes like power.
Meta-Analysis of Observed Power
One solution to the problem of insufficient information in a single underpowered study is to combine the results of several underpowered studies in a meta-analysis. A meta-analysis reduces sampling error because sampling error creates random variation in effect size estimates across studies and aggregation reduces the influence of random factors. If a meta-analysis of effect sizes can produce more accurate estimates of the population effect size, it would make sense that meta-analysis can also increase the accuracy of observed power estimation.
Yuan and Maxwell (2005) discuss meta-analysis of observed power only briefly.
A problem in a meta-analysis of observed power is that observed power is not only subject to random sampling error, but also systematically biased. As a result, the average of observed power across a set of studies would also be systematically biased. However, the reason for the systematic bias is the non-symmetrical distribution of observed power when power is not 50%. To avoid this systematic bias, it is possible to compute the median. The median is unbiased because 50% of the non-central distribution is on the left side of the non-centrality parameter and 50% is on the right side of the non-centrality parameter. Thus, the median provides an unbiased estimate of the non-centrality parameter and the estimate becomes increasingly accurate as the number of studies in a meta-analysis increases.
The next figure shows the results of a simulation with the same 500 studies (25 sample sizes and 20 effect sizes) that were simulated earlier, but this time each study was simulated to be replicated 1,000 times and observed power was estimated by computing the average or the median power across the 1,000 exact replication studies.
Purple = average observed power; Orange = median observed power
The simulation shows that Yuan and Maxwell’s (2005) bias formula predicts the relationship between true power and the average of observed power. It also confirms that the median is an unbiased estimator of true power and that observed power is a good estimate of true power when the median is based on a large set of studies. However, the question remains whether observed power can estimate true power when the number of studies is smaller.
The next figure shows the results for a simulation where estimated power is based on the median observed power in 50 studies. The maximum discrepancy in this simulation was 15 percentage points. This is clearly sufficient to distinguish low powered studies (<50% power) from high powered studies (>80%).
To obtain confidence intervals for median observed power estimates, the power estimate can be converted into the corresponding non-centrality parameter of a standard normal distribution. The 95% confidence interval is defined as the standard deviation divided by the square root of the number of studies. The standard deviation of a standard normal distribution equals 1. Hence, the 95% confidence interval for a set of studies is defined by
Lower Limit = Normal (InverseNormal (power) – 1.96 / sqrt(k))
Upper Limit = Normal (inverseNormal(power) + 1.96 / sqrt(k))
Interestingly, the number of observations in a study is irrelevant. The reason is that larger samples produce smaller confidence intervals around an effect size estimate and increase power at the same time. To hold power constant, the effect size has to decrease and power decreases exponentially as effect sizes decrease. As a result, observed power estimates do not become more precise when sample sizes increase and effect sizes decrease proportionally.
The next figure shows simulated data for 1000 studies with 20 effect sizes (0.05 to 1) and 25 sample sizes (n = 10 to 250). Each study was repeated 50 times and the median value was used to estimate true power. The green lines are the 95% confidence interval around the true power value. In real data, the confidence interval would be drawn around observed power, but observed power does not provide a clear mathematical function. The 95% confidence interval around the true power values is still useful because it predicts how much observed power estimates can deviate from true power. 95% of observed power values are expected to be within the area that is defined by lower and upper bound of the confidence interval. The Figure shows that most values are within the area. This confirms that sampling error in a meta-analysis of observed power is a function of the number of studies. The figure also shows that sampling error is greatest when power is 50%. In the tails of the distribution range restriction produces more precise estimates more quickly.
With 50 studies, the maximum absolute discrepancy is 15 percentage points. This level of precision is sufficient to draw broad conclusions about the power of a set of studies. For example, any median observed power estimate below 65% is sufficient to reveal that a set of studies had less power than Cohen’s recommended level of 80% power. A value of 35% would strongly suggest that a set of studies was severely underpowered.
Yuan and Maxwell (2005) provided a detailed statistical examination of observed power. They concluded that observed power typically provides little to no useful information about the true power of a single study. The main reason for this conclusion was that sampling error in studies with low power is too large to estimate true power with sufficient precision. The only precise estimate of power can be obtained when sampling error is small and effect sizes are large. In this case, power is near the maximum value of 1 and observed power correctly estimates true power as being close to 1. Thus, observed power can be useful when it suggests that a study had high power.
Yuan and Maxwell’s (2005) also showed that observed power is systematically biased unless true power is 50%. The amount of bias is relatively small and even without this systematic bias, the amount of random error is so large that observed power estimates based on a single study cannot be trusted.
Unfortunately, Yuan and Maxwell’s (2005) article has been misinterpreted as evidence that observed power calculations are inherently biased and useless. However, observed power can provide useful and unbiased information in a meta-analysis of several studies. First, a meta-analysis can provide unbiased estimates of power because the median value is an unbiased estimator of power. Second, aggregation across studies reduces random sampling error, just like aggregation across studies reduces sampling error in meta-analyses of effect sizes.
The demonstration that median observed power provides useful information about true power is important because observed power has become a valuable tool in the detection of publication bias and other biases that lead to inflated estimates of effect sizes. Starting with Sterling, Rosenbaum, and Weinkam ‘s(1995) seminal article, observed power has been used by Ioannidis and Trikalinos (2007), Schimmack (2012), Francis (2012), Simonsohn (2014), and van Assen, van Aert, and Wicherts (2014) to draw inferences about a set of studies with the help of posteriori power analysis. The methods differ in the way observed data are used to estimate power, but they all rely on the assumption that observed data provide useful information about the true power of a set of studies. This blog post shows that Yuan and Maxwell’s (2005) critical examination of observed power does not undermine the validity of statistical approaches that rely on observed data to estimate power.
This blog post focussed on meta-analysis of exact replication studies that have the same population effect size and the same sample size (sampling error). It also assumed that the set of studies is a representative set of studies. An important challenge for future research is to examine the statistical properties of observed power when power varies across studies (heterogeneity) and when publication bias and other biases are present. A major limitation of existing methods is that these methods assume a fixed population effect size (Ioannidis and Trikalinos (2007), Francis (2012), Simonsohn (2014), and van Assen, van Aert, and Wicherts (2014). At present, the Incredibility index (Schimmack, 2012) and the R-Index (Schimmack, 2014) have been proposed as methods for sets of studies that are biased and heterogeneous. An important goal for future research is to evaluate these methods in simulation studies with heterogeneous and biased sets of data.
The authors distinguish between fraud and QRPs. Fraud is typically limited to cases in which researchers create false data. In contrast, QRPs typically involve the exclusion of data that are inconsistent with a theoretical hypothesis. QRPs are treated differently than fraud because QRPs can sometimes be used for legitimate purposes.
For example, a data entry error may produce a large outlier that leads to a non-significant result when all data are included in the analysis. The results are significant when the outlier is removed. Statistical textbook often advise to exclude outliers for this reason. However, removal of outliers becomes a QRP when it is used selectively. That is, outliers are not removed when a result is significant or when the outlier helps to produce a significant result, but outliers are removed when removal of outliers helps to get a significant result.
The use of QRPs is damaging because published results provide false impressions about the replicability of empirical results and misleading evidence about the size of an effect.
Below is a list of QRPs.
Selective reporting of (dependent) variables. For example, a researcher may include 10 items to measure depression. Typically, the 10 items are averaged to get the best measure of depression. However, if this analysis does not produce a significant result, the researcher can conduct analyses of each individual item or average items that trend in the right direction. By creating different dependent variables after the study is completed, a researcher increases the chances of obtaining a significant result that will not replicate in a replication study with the same dependent variable.
A simple solution to preventing this QRP is to ask authors to use well-established measures as dependent variables and/or to ask for pre-registration of all measures that are relevant to the test of a theoretical hypothesis (i.e., it is not necessary to specify that the study also asked about handedness because handedness is not a measure of depression).
Deciding whether to collect more data after looking to see whether the results will be significant. It is difficult to distinguish random variation from a true effect in small samples. At the same time, it can be a costly waste of resources (or even unethical in animal research) to conduct studies with large samples, when the effect can be detected in a smaller sample. It is also difficult to know a priori how large a sample should be to obtain a significant result. It therefore seems reasonable to check data while they are being collected for significance. If an effect does not seem to be present in a reasonably large sample size, it may be better to abandon a study. None of these practices are problematic unless a researcher constantly checks for significance and stops data collection immediately after the data show a significant result. This practice capitalizes on sampling error and the experiment will typically stop when sampling error inflates the true effect size.
A simple solution to this problem is to set some a priori rules about the end of data collection. For example, a researcher may calculate sample size based on a rough power analysis. Based on an optimistic assumption that the true effect is large, the data will be checked when the study has 80% power for a large effect (d = .8). If this does not result in a significant result, the researcher continues with the revised hypothesis that the true effect is moderate and then checks the data again when 80% power for a moderate effect is reached. If this does not result in a significant result, the researcher may give up or continue with the revised hypothesis that the true effect is small. This procedure would allow researchers to use an optimal amount of resources. Moreover, they can state there sampling strategy openly so that meta-analysts can make corrections for the small amount of biases that is still introduced by this reasonable form of optional stopping.
Failing to disclose experimental conditions. There are no justifiable reasons for the exclusion of conditions. Evidently, researchers are not going to exclude conditions that are consistent with theoretical predictions. So, the exclusion of conditions can only produce results that are overly consistent with theoretical predictions. If there are reasonable doubts about a condition (e.g., a manipulation check shows that it did not work), the condition can be included and it can be explained why the results may not conform to predictions).
A simple solution to the problem of conditions with unexpected results is that researchers may include too many conditions in their design. A 2 x 2 x 2 factorial design has 8 cells, which allows for 28 comparisons of means. What are the chances that all of these 28 comparisons produce results that are consistent with theoretical predictions?
Another simple solution is to avoid the use of statistical methods with low power. To demonstrate a three-way interaction requires a lot more data than to demonstrate that a pattern of means is consistent with an a priori theoretically predicted pattern.
In a paper reporting selectively studies that worked.
There is no reason for excluding studies that did not work. Excluding studies that were planned as demonstrations of an effect need to be reported. Otherwise the published evidence provides an overly positive picture of the robustness of a phenomenon and effect sizes are inflated.
Just like failed conditions, failed studies can be reported if there is a plausible explanation why it failed whereas other studies worked. However, to justify this claim, it should be demonstrated that the effects in failed and successful studies are really significantly different (a significant moderator effect). If this is not the case, there is no reason to treat failed and successful studies as different from each other.
A simple solution to this problem is to conduct studies with high statistical power because the main reason for failed studies is that studies have low power. If a study has only 30% power, only one out of three studies will produce a significant result. The other two studies are likely to produce a type-II error (not show a significant result when the effect exists). Rather than throwing away the two failed studies, a researcher should have conducted one study with higher power. Another solution is to report all three studies and to test for significance only in a meta-analysis across the three studies.
In a paper, rounding off a p-value just above .054 and claim that it is below .05. This is a minor problem. It is silly to change a p-value, but it does not bias a meta-analysis of effect sizes because researchers do not change effect size information. Moreover, it would be even more silly not to change the p-value and conclude that there is no effect, which is often the case when results are not significant. After all, a p-value of .054 means that the effect in this study would have occurred if the true effect is zero or has the opposite sign.
If a type-I error probability of 5.4% is considered too high, it would be possible to collect more data and test again with a larger sample (taking multiple testing into account).
Moreover, this problem should arise very infrequently. Even if a study is underpowered and has only 50% power, only 2% of p-values are expected to fall into the narrow range between .050 and .054.
In a paper, reporting an unexpected finding as having been predicted from the start. I am sure some statisticians disagree with me and I may be wrong about this one, but I simply do not understand how a statistical analysis of some data cares about the expectations of a researcher. Say, I analyze some data and find a significant effect in the data. How can this effect be influenced by the way I report it later? It may be a type-I error or it is not a type-I error, but my expectations have no influence on the causal processes that produced the empirical data. I think the practice of writing exploratory studies as if they were conducted an a priori hypothesis is considered questionable because it often requires other QRPs (e.g., excluding additional tests that didn’t work) to produce a story that is concocted to explain unexpected results. However, if the results are presented honestly and one out of five predictor variables in a multiple-regression is significant at p < .0001, it is likely to be a replicable finding, even if it is presented with a post-hoc prediction.
In a paper, claiming that results are unaffected by demographic variables (e.g., gender) when one is actually unsure (or knows that they do). Again, this is a relatively minor point because it only speaks about potential moderators of a reported effect. Moderation is important, but the conclusion about the main effect remains unchanged. For example, if an effect exists for men, but not for women, it is still true that on average there is an effect. Furthermore, a more common mistake is often to claim that gender or other factors did not moderate an effect based on an underpowered comparison of 10 men and 30 women in a study with 40 participants. Thus, false claims about moderating variables are annoying, but not a threat to the replicability of empirical results.
Falsifying Data. I personally do not include falsifying or fabricating of data in the list of questionable research practices. I think falsifying and fabrication of data is not a research practice. It is also something that is clearly considered fraudulent and punished when it is discovered. In contrast, questionable research practices are tolerated in many scientific communities and there are no clear guidelines against the use of these practices.
In conclusion, the most problematic research practices that undermine the replicability of published studies are selective reporting of dependent variables, conditions, or entire studies, and optional stopping when significance is reached. These practices make it possible to produce significant results when a study has insufficient power. However, to achieve significance without power, the type-I error rate also increases and replicability decreases. John et al. (2012) aptly compared these QRPs to the use of doping in sports. I consider the R-Index a doping test for science because it reveals that researchers used these QRPs. I hope that the R-Index will discourage the use of QRPs and increase the power and replicability of published studies.
Whether scientific organizations should ban QRPs just like sports organizations ban doping is an interesting question. Meanwhile the R-Index can be used without draconian consequences. Researchers can self-examine the replicability of their findings and they can examine the replicability of published results before they invest resources, time, and the future of their graduate students in research projects that fail. Granting agencies can use the R-Index to reward researchers who conduct fewer studies with replicable results rather than researchers with many studies that fail to replicate. Finally, the R-Index can be used to track how successful current initiatives are to increase the replicability of published studies.
A previous blog examined how and why Dr. Förster’s data showed incredibly improbable linearity.
The main hypothesis was that two experimental manipulations have opposite effects on a dependent variable.
Assuming that the average effect size of a single manipulation is similar to effect sizes in social psychology, a single manipulation is expected to have an effect size of d = .5 (change by half a standard deviation). As the two manipulations are expected to have opposite effects, the mean difference between the two experimental groups should be one standard deviation (0.5 + 0.5 = 1). With N = 40, and d = 1, a study has 87% power to produce a significant effect (p < .05, two-tailed). With power of this magnitude, it would not be surprising to get significant results in 12 comparisons (Table 1).
The R-Index for the comparison of the two experimental groups in Table is Ř = 87%
(Success Rate = 100%, Median Observed Power = 94%, Inflation Rate = 6%).
The Test of Insufficient Variance (TIVA) shows that the variance in z-scores is less than 1, but the probability of this event to occur by chance is 10%, Var(z) = .63, Chi-square (df = 11) = 17.43, p = .096.
Thus, the results for the two experimental groups are perfectly consistent with real empirical data and the large effect size could be the result of two moderately strong manipulations with opposite effects.
The problem for Dr. Förster started when he included a control condition and want to demonstrate in each study that the two experimental groups also differed significantly from the experimental group. As already pointed out in the original post, samples of 20 participants per condition do not provide sufficient power to demonstrate effect sizes of d = .5 consistently.
To make matters worse, the three-group design has even less power than two independent studies because the same control group is used in a three-group comparison. When sampling error inflates the mean in the control group (e.g, true mean = 33, estimated mean = 36), it benefits the comparison for the experimental group with the lower mean, but it hurts the comparison for the experimental group with the higher mean (e.g., M = 27, M = 33, M = 39 vs. M = 27, M = 36, M = 39). When sampling error leads to an underestimation of the true mean in the control group (e.g., true mean = 33, estimated mean = 30), it benefits the comparison of the higher experimental group with the control group, but it hurts the comparison of the lower experimental group and the control group.
Thus, total power to produce significant results for both comparisons is even lower than for two independent studies.
It follows that the problem for a researcher with real data was the control group. Most studies would have produced significant results for the comparison of the two experimental groups, but failed to show significant differences between one of the experimental groups and the control group.
At this point, it is unclear how Jens Förster achieved significant results under the contested assumption that real data were collected. However, it seems most plausible that QRPs would be used to move the mean of the control group to the center so that both experimental groups show a significant difference. When this was impossible, the control group could be dropped, which may explain why 3 studies in Table 1 did not report results for a control group.
The influence of QRPs on the control group can be detected by examining the variation of means in Table 1 across the 12(9) studies. Sampling error should randomly increase or decrease means relative to the overall mean of an experimental condition. Thus, there is no reason to expect a correlation in the pattern of means. Consistent with this prediction, the means of the two experimental groups are unrelated, r(12) = .05, p = .889; r(9) = .36, p = .347. In contrast, the means of the control group are correlated with the means of the two experimental groups, r(9) = .73, r(9) = .71. If the means in the control group are the result of the unbiased means in the experimental groups, it makes sense to predict the means in the control group from the means in the two experimental groups. A regression equation shows that 77% of the variance in the means of the control group is explained by the variation in the means in the experimental groups, R = .88, F(2,6) = 10.06, p = .01.
This analysis clarifies the source of the unusual linearity in the data. Studies with n = 20 per condition have very low power to demonstrate significant differences between a control group and opposite experimental groups because sampling error in the control group is likely to move the mean of the control group too close to one of the experimental groups to produce a significant difference.
This problem of low power may lead researchers to use QRPs to move the mean of the control group to the center. The problem for users of QRPs is that this statistical boost of power leaves a trace in the data that can be detected with various bias tests. The pattern of the three means will be too linear, there will be insufficient variance in the effect sizes, p-values, and observed power in the comparisons of experimental groups and control groups, the success rate will exceed median observed power, and, as shown here, the means in the control group will be correlated with the means in the experimental group across conditions.
In a personal email Dr. Förster did not comment on the statistical analyses because his background in statistics is insufficient to follow the analyses. However, he rejected this scenario as an account for the unusual linearity in his data; “I never changed any means.” Another problem for this account of what could have happened is that dropping cases from the middle group would lower the sample size of this group, but the sample size is always close to n = 20. Moreover, oversampling and dropping of cases would be a QRP that Dr. Förster would remember and could report. Thus, I now agree with the conclusion of the LOWI commission that the data cannot be explained by using QRPs, mainly because Dr. Förster denies having used any plausible QRPs that could have produced his results.
Some readers may be confused about this conclusion because it may appear to contradict my first blog. However, my first blog merely challenged the claim by the LOWI commission that linearity cannot be explained by QRPs. I found a plausible way in which QRPs could have produced linearity, and these new analyses still suggest that secretive and selective dropping of cases from the middle group could be used to show significant contrasts. Depending on the strength of the original evidence, this use of QRPs would be consistent with the widespread use of QRPs in the field and would not be considered scientific misconduct. As Roy F. Baumeister, a prominent social psychologist put it, “this is just how the field works.” However, unlike Roy Baumeister, who explained improbable results with the use of QRPs, Dr. Förster denies any use of QRPs that could potentially explain the improbable linearity in his results.
In conclusion, the following facts have been established with sufficient certainty:
(a) the reported results are too improbable to reflect just true effects and sampling error; they are not credible.
(b) the main problem for a researcher to obtain valid results is the low power of multiple-study articles and the difficulty of demonstrating statistical differences between one control group and two opposite experimental groups.
(c) to avoid reporting non-significant results, a researcher must drop failed studies and selectively drop cases from the middle group to move the mean of the middle group to the middle.
(d) Dr. Förster denies the use of QRPs and he denies data manipulation.
Evidently, the facts do not add up.
The new analyses suggest that there is one simple way for Dr. Förster to show that his data have some validity. The reason is that the comparison of the two experimental groups shows an R-Index of 87%. This implies that there is nothing statistically improbable about the comparison of these data. If these reported results are based on real data, a replication study is highly likely to replicate the mean difference between the two experimental groups. With n = 20 in each cell (N = 40), it would be relatively easy to conduct a preregistered and transparent replication study. However, without further credible evidence the published data lack credible scientific evidence and it would be prudent to retract all articles that show unusual statistical patterns that cannot be explained by the author.