Open draft for a Response Article to be Submitted to PNAS. (1200 words, Commentary/Letters only allowed 500 words). Co-authors are welcome. Please indicate intent and make contributions in the comments section. This attack on valuable replication work needs a response.
Bryan, Walton, Rogers and Dweck reported three studies that suggested a slight change in message wording can have dramatic effects on voter turnout (1). Gerber, Huber, Biggers, and Hendry reported a failure to replicate this result (2). Bryan, Yeager, and O’Brien reanalyzed Gerber et al.’s data and found a significant result consistent with their original results (3). Based on this finding, Bryana et al. (2019) make two claims that go beyond the question about ways to increase voter turnout. First, Bryana et al. accuse Gerber et al. (2019) of exploiting replicators’ degrees of freedom to produce a non-significant result. Others have called this practice reverse p-hacking (4). Second, they claim that many replicators may engage in deceptive practices to produce non-significant results because these results are deemed easier to publish. We take issue with these claims about the intentions and practices of researchers who conduct replication studies. Moreover, we present evidence that Bryan et al.’s (2011) results are likely to be based by the exploitation of researchers’ degrees of freedom. This conclusion is consistent with widespread evidence that social psychologists in 2011 were abusing statistical methods to inflate effect sizes in order to publish eye-catching results that often do not replicate (5). We argue that only a pre-registered replication study with high precision will settle the dispute about the influence of subtle linguistic cues on voter turnout.
Bryan et al. (2011)
Study 1 used a very small sample size of n = 16 participants in each condition. After transforming the dependent variable, a t-test produced a just significant result (p < .05 & p > .005), p = .044. Study 2 had 88 participants, but power was reduced because the outcome variable was dichotomous. A chi-square test produced again a just significant result, p = .018. Study 3 increased sample size considerably (N = 214), which should also increase power and produce a smaller p-value if the population effect size is the same as in Study 2. However, the observed effect size was weaker and the result was again just significant, p = .027. In the wake of the replication crisis, awareness has increased that sampling error produces large variability in p-values and that a string of just-significant p-values is unlikely to occur by chance. Thus, the results reported by Bryan et al. (2011) suggest that researchers’ degrees of freedom were used to produce significant results (6). For example, converted into observed power, the p-values imply 52%, 66%, and 60% power, respectively. It is unlikely that three studies with average power of 60% can produce three significant results; the expected value is only 1.8 significant results. These calculations are conservative because questionable research practices inflate estimates of observed power. The replication index (R-Index) corrects for this bias by subtracting the inflation rate from the estimate of observed power (7). With 60% mean observed power and a 100% success rate, the inflation rate is 40 percentage points, and the R-Index is 60% – 40% = 20%. Simulations show that an R-Index of 20% is obtained when the null-hypothesis is true. Thus, the published results provide no empirical evidence that subtle linguistic cues influence voter turnout because the published results are incredible.
Gerber et al. (2016)
Gerber et al. conducted a conceptual replication study with a much larger sample (n = 2,236 noun condition, n = 2,232 verb condition). Their effect was in the same direction, but much weaker and not statistically significant, 95%CI = -1.8 to 3.8. They also noted that the original studies were conducted on the day before elections or in the morning of election day and limited their analysis to the day of elections, and reported a non-significant result for this analysis as well. Gerber et al. discuss various reasons for their replication failure that assume the original results are credible (e.g., internet vs. phone contact). They even consider the possibility that their finding could be a type-II error, although this implies that the population effect size is much smaller than the estimates in Bryant et al.’s (2011) study.
Bryan et al. (2019)
Bryan et al. (2019) noted that Gerber et al. never reported the results of a simple comparison of the two linguistic conditions, while limiting the sample to participants who were contacted on the day before elections. When they conducted this analysis with a one-sided test and alpha = .05, they obtained a significant result, p = .036. They consider these results a successful replication and they allege that Gerber et al. were intentionally not reporting this result. We do not know why Gerber et al. (2011) did not report this result, but we are skeptical that it can be considered a successful replication for several reasons. First, adding another just significant result to a series of just significant results makes the evidence weaker not stronger (5). The reason is that a credible set of studies with modest power should contain some non-significant results. The absence of such non-significant result undermines the trustworthiness of the reported results. The maximum probability of obtaining a just significant result (.05 to .005) is 33%. The probability of this outcome in four out of four studies is just .33^4 = .012. Thus, even if we consider Gerber et al.’s study a successful replication, the results do not provide strong support for the hypothesis that subtle linguistic manipulations have a strong effect on voter turnout. Another problem with Bryan et al.’s conclusions is that they put too much weight on the point estimates of effect sizes. “In sum, the evidence across the expanded set of model specifications that includes the main analytical choices by Gerber et al. supports a substantial and robust effect consistent with the original finding by Bryan et al.” (p. 6). This claim ignores that just significant p-values imply that the corresponding confidence intervals barely exclude an effect size of zero (i.e., p = .05 implies that 0 is the lower bound of the 95%CI). Thus, each result individually cannot be used to claim that the population effect size is large. It is also not possible to use standard meta-analysis to reduce sampling error because there is evidence of selection bias. In short, the reanalysis found a significant result with a one-sided test for a subset of the data. This finding is noteworthy, but hardly a smoking gun to make claims that reverse p-hacking was used to hide a robust effect.
Broader Implications for the Replication Movement
Bryan et al. (2019) generalize from their finding of a one-sided significant p-value in a conceptual replication study to replication studies in general. Many of these generalizations are invalid because Bryan et al. do not differentiate between different types of replication studies. First, there are registered replication reports (6). Registered replication reports are approved before data are collected and are ensured publication independent of the study outcome. Thus, Bryan et al.’s claim that replicators use researchers’ degrees of freedom to produce null-results because they are easier to publish do not apply to these replication studies. Nevertheless, registered replication reports have shaken the foundations of social psychology by failing to replicate ego depletion or facial feedback effects. Moreover, these replication failures were also predicted by incredible p-values in the original articles. In contrast, bias tests fail to show reverse p-hacking in replication studies. Readers of Bryan et al. (2019) should therefore simply ignore their speculations about motives and practices of researchers who conduct replication studies. Our advice for Bryan et al. (2019) is to demonstrate that subtle linguistic cues can influence voter turnout with a preregistered replication report. The 2020 elections are just around the corner. Good luck, you guys.
(1) C. J. Bryan, G. M. Walton, T. Rogers, C. S. Dweck, Motivating voter turnout by invoking the self. Proc. Natl. Acad. Sci. U.S.A. 108, 12653–12656 (2011).
(2) A. S. Gerber, G. A. Huber, D. R. Biggers, D. J. Hendry, A field experiment shows that subtle linguistic cues might not affect voter behavior. Proc. Natl. Acad. Sci. U.S.A. 113, 7112–7117 (2016).
(3) C. J. Bryana, D. S. Yeager, J. M. O’Brien, Replicator degrees of freedom allow publication of misleading failures to replicate, Proc. Natl. Acad. Sci. U.S.A. 108, (2019)
(4) F. Strack, Reflection on the smiling registered replication report. Perspective on Psychological Science, 11, 929-930 (2016)
(5) G. Francis, The frequency of excess success for articles in Psychological Science. Psychonomic Bulletin & Review, 21, 1180-1187 (2014)
(6) U. Schimmack, The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551-566 (2012).
(7) U. Schimmack, A Revised Introduction to the R-Index. https://replicationindex.com/2016/01/31/a-revised-introduction-to-the-r-index/
5 thoughts on “Christopher J. Bryan claims replicators p-hack to get non-significant results. I claim he p-hacked his original results.”
I’m somewhat detached from this debate as a computer scientist who undertakes computational experiments. However, some issues resonate but equally many do not.
I’m struck by:
1. the tenacity and detailed analysis of Uri (in my opinion the devil is really in the detail and I’m full of respect for researchers who really engage).
2. the serious scrutiny of major claims and their replications.
3. the absence of any prior theory, model or expectation (from Bayesian perspective the priors are completely uninformative), which surprises me since I would have thought quite a few things are improbable (and presumably other things might be predicted?).
4. I’m mystified by researchers who want everything to be true or false (especially since it is by definition the case that the null of precisely zero effect cannot ever be true for any continuously valued response measure). I mean what is the point? I could be picky and say evidence against the null is not evidence in favour of the alternate hypothesis but you all know that anyway!
5. the informal, and generally implicit, views surrounding replication logic seem to fuel the personal animosity between different researchers
6. whilst there are many dangers with naive meta-analysis, some sort of hierarchical modelling might be a useful starting point? Even random effects models can be seen as hierarchical models. So why not investigate the tau.
As I say it’s not my field. I’m impressed by the energy and integrity of many researchers but perhaps less by the lack of harmony. Surely everyone wants science to progress; but then I’m just a geek 😉
Just saying that NHST is a first step and the null-hypthesis doesn’t have to be an effect size of zero. For example, we can use the effect of contacting somebody and only examine whether the effect for verbs is larger than that. The main problem of meta-analyzing these data is that publication bias is present and correcting for publication bias with four studies gives very imprecise effect size estimates. In short, the data are inconclusive. There is no statistical remedy for bad data.
This is not my discipline but I love the flow of the response and it seems to incorporate a sensible summary of the dispute.
I am puzzled how Bryan is able to publish an article attacking replication efforts while seemingly employing questionable practices that he accuses replicators of.
I’d have several comments mostly regarding citations and some explanations for non-replication-experts:
from a readers’ perspecitve this needs further explanation:
– “Simulations show that an R-Index of 20% is obtained when the null-hypothesis is true.” -> it’s quite interesting but it seems like a probablisistic statement is missing. An R-index of 20% certainly does not prove that the null is always true, so how likely is it?
-“with a much larger sample (n = 2,236 noun condition, n = 2,232 verb condition)” -> it would be better if the conditions were introduced in the Bryan study
-“their effect was in the same direction, but much weaker and not statistically significant, 95%CI = -1.8 to 3.8.” -> what effects are you talking about? the paragraph before is only about p-values. an introduction of the effect sizes in Bryan would be helpful
– “The maximum probability of obtaining a just significant result (.05 to .005) is 33%.”
-> where does this number come from? shouldn’t the range “.05 to .005” be adjusted to Bryans’ reported range? or why is this range important in particular?
There are also several important statements missing citations and some minor grammatical errors, which I guess you are already aware of.
Replication, replication replication (and no mind-reading, please)
Bryan, Walton, Rogers and Dweck (1) reported that slight changes in messaging dramatically affect voter turnout. Gerber, Huber, Biggers, and Hendry report a conceptual replication ~ 100 times larger, but failing to corroborate this claim 95%CI [-1.8, 3.8]. Gerber et al. suggest the original result may yet be true, but much smaller, or dependent on mode of contact (e.g., internet vs. phone) etc. Bryan, Yeager, and O’Brien (2019) report a post-hoc alternate analysis of a subset of the Gerber data, with a one-sided test yielding p = .036. Perhaps noteworthy, but hardly a smoking gun. Bryan et al, however, conclude not only that the results show a “substantial and robust effect consistent with the original finding” (p. 6) but concluding Gerber et al. (2019) exploited replicators’ degrees of freedom to produce a non-significant result. They further claim many failures to replicate rely on such practices to produce non-significant results. We take issue with these claims about the intentions and practices of researchers who conduct replication studies.
First, regarding this particular study, each result individually cannot be used to claim that the population effect size is large (the corresponding confidence intervals barely exclude an effect size of zero: p = .05 implies that 0 is the lower bound of the 95%CI). It is also not possible to use standard meta-analysis to reduce sampling error because there is evidence of selection bias. Bryan et al. (2011) studies 1-3 yielded yielded a series of just-significant p-values: .044, .018, .027. Multiple factors suggest the data may not be replicable: despite low observed power (52%, 66%, and 60%), no study failed to reach significance (only 1.8 significant results are expected in a sequence of such underpowered studies), and p-values did not track study power. Thus, the published results provide little to no empirical evidence that subtle linguistic cues influence voter turnout. The original results and failure to replicate are consistent with widespread evidence that eye-catching results in social psychology often result from research practices that inflate effect sizes, leading to failures to replicate or undermining validity (5).
The claim that failures to replicate rely on “reverse p-hacking” (4) is controverted by the evidence. For instance, on this view registered replication reports (6) should be highly successful. In fact, however, they have shaken the foundations of social psychology, failing to replicate such classics as ego depletion or facial feedback effects. Moreover, these failures are predicted by the (in)credibility of the original p-values. Attempts to create a citable literature undermining replication, especially when themselves accompanied by ongoing questionable research practices, are harmful if taken seriously.