Open draft for a Response Article to be Submitted to PNAS. (1200 words, Commentary/Letters only allowed 500 words). Co-authors are welcome. Please indicate intent and make contributions in the comments section. This attack on valuable replication work needs a response.
Bryan, Walton, Rogers and Dweck reported three studies that suggested a slight change in message wording can have dramatic effects on voter turnout (1). Gerber, Huber, Biggers, and Hendry reported a failure to replicate this result (2). Bryan, Yeager, and O’Brien reanalyzed Gerber et al.’s data and found a significant result consistent with their original results (3). Based on this finding, Bryana et al. (2019) make two claims that go beyond the question about ways to increase voter turnout. First, Bryana et al. accuse Gerber et al. (2019) of exploiting replicators’ degrees of freedom to produce a non-significant result. Others have called this practice reverse p-hacking (4). Second, they claim that many replicators may engage in deceptive practices to produce non-significant results because these results are deemed easier to publish. We take issue with these claims about the intentions and practices of researchers who conduct replication studies. Moreover, we present evidence that Bryan et al.’s (2011) results are likely to be based by the exploitation of researchers’ degrees of freedom. This conclusion is consistent with widespread evidence that social psychologists in 2011 were abusing statistical methods to inflate effect sizes in order to publish eye-catching results that often do not replicate (5). We argue that only a pre-registered replication study with high precision will settle the dispute about the influence of subtle linguistic cues on voter turnout.
Bryan et al. (2011)
Study 1 used a very small sample size of n = 16 participants in each condition. After transforming the dependent variable, a t-test produced a just significant result (p < .05 & p > .005), p = .044. Study 2 had 88 participants, but power was reduced because the outcome variable was dichotomous. A chi-square test produced again a just significant result, p = .018. Study 3 increased sample size considerably (N = 214), which should also increase power and produce a smaller p-value if the population effect size is the same as in Study 2. However, the observed effect size was weaker and the result was again just significant, p = .027. In the wake of the replication crisis, awareness has increased that sampling error produces large variability in p-values and that a string of just-significant p-values is unlikely to occur by chance. Thus, the results reported by Bryan et al. (2011) suggest that researchers’ degrees of freedom were used to produce significant results (6). For example, converted into observed power, the p-values imply 52%, 66%, and 60% power, respectively. It is unlikely that three studies with average power of 60% can produce three significant results; the expected value is only 1.8 significant results. These calculations are conservative because questionable research practices inflate estimates of observed power. The replication index (R-Index) corrects for this bias by subtracting the inflation rate from the estimate of observed power (7). With 60% mean observed power and a 100% success rate, the inflation rate is 40 percentage points, and the R-Index is 60% – 40% = 20%. Simulations show that an R-Index of 20% is obtained when the null-hypothesis is true. Thus, the published results provide no empirical evidence that subtle linguistic cues influence voter turnout because the published results are incredible.
Gerber et al. (2016)
Gerber et al. conducted a conceptual replication study with a much larger sample (n = 2,236 noun condition, n = 2,232 verb condition). Their effect was in the same direction, but much weaker and not statistically significant, 95%CI = -1.8 to 3.8. They also noted that the original studies were conducted on the day before elections or in the morning of election day and limited their analysis to the day of elections, and reported a non-significant result for this analysis as well. Gerber et al. discuss various reasons for their replication failure that assume the original results are credible (e.g., internet vs. phone contact). They even consider the possibility that their finding could be a type-II error, although this implies that the population effect size is much smaller than the estimates in Bryant et al.’s (2011) study.
Bryan et al. (2019)
Bryan et al. (2019) noted that Gerber et al. never reported the results of a simple comparison of the two linguistic conditions, while limiting the sample to participants who were contacted on the day before elections. When they conducted this analysis with a one-sided test and alpha = .05, they obtained a significant result, p = .036. They consider these results a successful replication and they allege that Gerber et al. were intentionally not reporting this result. We do not know why Gerber et al. (2011) did not report this result, but we are skeptical that it can be considered a successful replication for several reasons. First, adding another just significant result to a series of just significant results makes the evidence weaker not stronger (5). The reason is that a credible set of studies with modest power should contain some non-significant results. The absence of such non-significant result undermines the trustworthiness of the reported results. The maximum probability of obtaining a just significant result (.05 to .005) is 33%. The probability of this outcome in four out of four studies is just .33^4 = .012. Thus, even if we consider Gerber et al.’s study a successful replication, the results do not provide strong support for the hypothesis that subtle linguistic manipulations have a strong effect on voter turnout. Another problem with Bryan et al.’s conclusions is that they put too much weight on the point estimates of effect sizes. “In sum, the evidence across the expanded set of model specifications that includes the main analytical choices by Gerber et al. supports a substantial and robust effect consistent with the original finding by Bryan et al.” (p. 6). This claim ignores that just significant p-values imply that the corresponding confidence intervals barely exclude an effect size of zero (i.e., p = .05 implies that 0 is the lower bound of the 95%CI). Thus, each result individually cannot be used to claim that the population effect size is large. It is also not possible to use standard meta-analysis to reduce sampling error because there is evidence of selection bias. In short, the reanalysis found a significant result with a one-sided test for a subset of the data. This finding is noteworthy, but hardly a smoking gun to make claims that reverse p-hacking was used to hide a robust effect.
Broader Implications for the Replication Movement
Bryan et al. (2019) generalize from their finding of a one-sided significant p-value in a conceptual replication study to replication studies in general. Many of these generalizations are invalid because Bryan et al. do not differentiate between different types of replication studies. First, there are registered replication reports (6). Registered replication reports are approved before data are collected and are ensured publication independent of the study outcome. Thus, Bryan et al.’s claim that replicators use researchers’ degrees of freedom to produce null-results because they are easier to publish do not apply to these replication studies. Nevertheless, registered replication reports have shaken the foundations of social psychology by failing to replicate ego depletion or facial feedback effects. Moreover, these replication failures were also predicted by incredible p-values in the original articles. In contrast, bias tests fail to show reverse p-hacking in replication studies. Readers of Bryan et al. (2019) should therefore simply ignore their speculations about motives and practices of researchers who conduct replication studies. Our advice for Bryan et al. (2019) is to demonstrate that subtle linguistic cues can influence voter turnout with a preregistered replication report. The 2020 elections are just around the corner. Good luck, you guys.
(1) C. J. Bryan, G. M. Walton, T. Rogers, C. S. Dweck, Motivating voter turnout by invoking the self. Proc. Natl. Acad. Sci. U.S.A. 108, 12653–12656 (2011).
(2) A. S. Gerber, G. A. Huber, D. R. Biggers, D. J. Hendry, A field experiment shows that subtle linguistic cues might not affect voter behavior. Proc. Natl. Acad. Sci. U.S.A. 113, 7112–7117 (2016).
(3) C. J. Bryana, D. S. Yeager, J. M. O’Brien, Replicator degrees of freedom allow publication of misleading failures to replicate, Proc. Natl. Acad. Sci. U.S.A. 108, (2019)
(4) F. Strack, Reflection on the smiling registered replication report. Perspective on Psychological Science, 11, 929-930 (2016)
(5) G. Francis, The frequency of excess success for articles in Psychological Science. Psychonomic Bulletin & Review, 21, 1180-1187 (2014)
(6) U. Schimmack, The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551-566 (2012).
(7) U. Schimmack, A Revised Introduction to the R-Index. https://replicationindex.com/2016/01/31/a-revised-introduction-to-the-r-index/