Preprint. Draft. Comments are welcome.
A year ago, a group of 71 scientists published a commentary in the journal Nature:Human Behavior (Benjamin et al., 2017). Several of the authors are prominent members of a revolutionary movement that aims to change the way behavioral scientists do research (Brian A. Nosek, E.-J. Wagenmakers, Kenneth A. Bollen, Christopher D. Chambers, Andy P. Field, Donald P. Green, Anthony Greenwald, Larry V. Hedges, John P. A. Ioannidis, Scott E. Maxwell, Felix D. Schönbrodt, & Simine Vazire).
The main argument made in this article is that the standard criterion for statistical significance of a 5% risk to report a false positive result (i.e, the type-I error probability in Neyman-Pearson’s framework) is too high. The authors recommend lowering the false-positive risk from 0.5% to p < .005.
This recommendation is based on the authors’ shared belief that “a leading cause of non-reproducibility has not yet been adequately addressed: statistical standards of evidence for claiming new discoveries in many fields of science are simply too low. (p. 1)”
In contrast, others, including myself, have argued that the main problem for low reproducibility is that researchers conduct studies with a low probability to produce p-values less than .05, even if the null-hypothesis is false (i.e., the type-II error probability in Neyman-Pearson’s framework) (Open Science Collaboration, 2015; Schimmack, 2012).
The probability of obtaining a true positive result is called statistical power. The main problem of low power is that many studies produce inconclusive, non-significant results (Cohen, 1962). However, another problem is that low power also produces significant results that are difficult to reproduce because significance can only be obtained if sampling error boosts observed effect sizes and test statistics. Replication studies do not reproduce the same random sampling error and are likely to produce non-significant results.
The problem of low power is amplified by the use of questionable research practices, such as selective publishing of results that support a hypothesis. At least in 2011, most psychologists did not consider these practices problematic or unethical (John, Loewenstein, & Prelec, 2012).
The problem with these practices is that replication studies no longer weed out false positives that passed the significance filter in an original discovery. These practices explain why psychologists often report multiple successful replications of their original study, even if the statistical power to do so is low (Schimmack, 2012).
Benjamin et al. (2017) dismiss this explanation for low reproducibility.
“There has been much progress toward documenting and addressing several causes of this lack of reproducibility (for example, multiple testing, P-hacking, publication bias and under-powered studies).”
Notably, the authors provide no references for the claim that low power and questionable research practices have been documented, let alone addressed in the behavioral sciences.
The Open Science Collaboration documented that these problems contribute to replication failures (OSC, 2015) and there is no evidence that these practices have changed.
Even if some problems have been addressed, nobody really knows how researchers produce more significant results than statistical power predicts. As these factors remain unidentified, I call them from now on “unidentified researcher influence” (URI).
Because Benjamin et al. ignore URIs in their comment, they fail to make the most persuasive argument in favor of lowering the significance criterion from .05 to .005; namely, the significance criterion influences how much URIs contribute to discoveries. This was shown in a set of simulation studies by Simmons, Nelson, and Simonsohn (2011, see Table 1 of their article).
The most dramatic simulation of questionable research practices shows that the type-I error risk increases from 5% to 81.5% for marginally significant results, p < .01 (two-tailed). The actual type-I error risk is still 60.7% with the current standard of p < .05 (two-tailed). However, it drops to “just” 21.5% with a more conservative criterion of p < .01 (two-tailed). It would be even lower for the proposed criterion of p < .005.
Thus, a simple solution to the problem of URIs is to lower the significance criterion. Unfortunately, lowering the significance criterion for everybody has the negative effect of increasing costs for researchers who minimize the influence of URIs and conduct a priori power analyses to plan their studies.
This can be easily seen, by computing statistical power for different levels of statistical significance assuming a small, medium, or large effect size with alpha = .05 versus alpha = .005 in a power-hungry between-subject design (independent t-test).
This is the reason why I think minimizing URIs and honest reporting of replication results is the most effective way to solve the reproducibilty problem in the behavioral sciences. This is also the reason why I developed statistical tests that can reveal URIs in published data and that could be used by editors and reviewers to reduce the risk of publishing false positive discoveries.
Should we lower alpha even if the problem of URIs were addressed?
Benjamin et al. (2017) claim that a 5% false positive risk is still too high even if URIs were no longer a problem. I think their argument ignores the importance of statistical power. The percentage of false discoveries among all statistically significant results is a function of type-I error and type-II error. This can be easily seen by examining a few simple scenarios. The scenario assumes a high percentage of false positives of 50%.
With 20% power and alpha = .05, there would be a false positive result of 20% (1 out of 5 attempts). This seems, indeed, unreasonably high. However, nobody should conduct studies with 20% power. Tversky & Kahneman (1971) suggested that reasonable scientists would have at least 50% power in their studies. Now the risk of a false positive is 1 out of 11 studies. Even 50% power is low and the most widely accepted standard for statistical power is Cohen’s (1988) recommendation to plan for 80% power. Now, the risk of a false positive is reduced to 1 out of 17 studies.
Most important, the scenario assumes only a single study is being conducted. With each honestly reported replication study, the percentage of false positives decreases exponentially with the number of replication studies. For example, with a pair of an original study and a replication study and 80% power, only 1 out of 257 attempts would produce a pair of significant results, while a non-significant result in a replication study would flag the original result as a potential false positive.
The table also shows that lowering the significance criterion reduces the percentage of false positives. However, this is achieved at the cost of using more resources for a single study. It is important to consider these trade-offs. Sometimes, it might be beneficial to demonstrate significant results in two conceptual replication studies rather than a single study that tests a hypothesis with one specific paradigm. It might even be beneficial to lower the alpha level for a first study to 20% and require a larger sample and stronger evidence with an alpha level of 5% or 0.5% for a confirmatory replication study.
While these are important questions to consider in the planning of studies, the balancing of type-I and type-II errors according to the specific question being asked is at the core of Neyman-Pearson’s statistical framework. Whether lowering alpha to a fixed level of .005 is always the best option can be debated.
However, I don’t think we should have a debate about URIs. The goal of empirical science is to reduce error in human observations wherever possible. One might even define science as the practice of observing things with the least amount of human error. This also seems to be a widely held view of scientific activity. Unfortunately, science is a human activity and the results reported in scientific journals can be biased by a myriad of URIs.
As Fiske and Taylor (1984) described human information processing. “Instead of a naive scientist entering the environment in search of the truth, we find the rather unflattering picture of a charlatan trying to make the data come out in a manner most advantageous to his or her already-held theories” (p. 88).
What separates charlatans from scientists is the proper use of scientific methods and the honest reporting of all data and all the assumptions that were made in the inferential steps from data to conclusions. Unfortunately, scientists are human and the human motivation to be right can distort the collective human effort to understand the world and themselves.
Thus, I think there are no trade-offs when it comes to URIs. URIs need to be minimized as much as possible because they undermine the collective goal of science, waste resources, and undermine the credibility of scientists to inform the public about important issues.
If you agree, you can say so in the comment section and maybe become an author of another multiple-author comment on the replication crisis that calls for clear guidelines about scientific integrity that all behavioral scientists need to follow with clear consequences for violating these standards. Researchers who violate this code should not receive public money to support their research.
In conclusion, I argued that Benjamin et al. (2017) made an attribution error when they blamed the traditional significance criterion for the reproducibility crisis. The real culprit are unidentified research influences (URIs) that increase the false positive risk and inflate effect sizes. One solution to this problem is to lower alpha, but this approach requires that more resources are spent on demonstrating true findings. A better approach is to ensure that researchers minimize unintended researcher influence in their labs and that scientific organization provide clear guidelines about best practices. Most important, it is not acceptable to suppress conceptual or direct replication studies that failed to support an original discovery. Nobody should have to trust original discoveries if researchers do not replicate their work or their self-replications cannot be trusted.
4 thoughts on “The Misattribution Error in the Alpha Wars about Significance Criteria”
Both/And. You continue to make a good case that URIs are prevalent and need to be addressed. I don’t think that much weakens the case for reducing the conventional threshold. Although I’d prefer we drop significance thresholds entirely, I support the .005 threshold as an achievable first step that works within the existing system and machinery. If I’m not mistaken it can be pursued independently of URI-targeted improvements.
You do argue that .005 would make experiments more expensive. I’m all for it — fewer, better-run experiments perhaps collaborating among multiple labs to help ensure clear, reliable, reproducible results. I dreamed of such things when I was in graduate school in philosophy of science and cognitive science.
Minor: You have two tables labeled “Sample Size for 80% Power”. The second should be something like, “Proportion of False Positives”, and the first entry is 1/4 but the text says 1 in 5.
Thank you for your detailed comments. Very much appreciated. My concern is that a rigid threshold of .005 is difficult to achieve in areas like clinical psychology, medicine, or animal research. Clearly specified as exploratory work that is honestly reported with preregistered hypothesis could use alpha = .20 and report promising results that need to be verified in later research. .005 could be used for clearly confirmatory work, although I think 5 sigma makes more sense for “conclusive” studies.
Ah, a two-tiered approach where it’s obvious we can’t put too much weight on the first-tier results, but demand much more from the second tier. That also sounds like an achievable improvement, provided we *get* enough tier 2 studies, and downplay the tier 1 studies.