Department of Psychology, University of Toronto, Mississauga, ON L5L 1C6,
“It is therefore inevitable that the ongoing correction of the scientific record damages the reputation of researchers, if this reputation was earned by selective publishing of significant results.”
Abstract: Psychology is in the middle of a replicability revolution. High-profile
replication studies have produced a large number of replication
failures. The main reason why replication studies in psychology often fail
is that original studies were selected for significance. If all studies were
reported, original studies would fail to produce significant results as
often as replication studies. Replications would be less contentious if
original results were not selected for significance.
The history of psychology is characterized by revolutions. This
decade is marked by the replicability revolution. One prominent
feature of the replicability revolution is the publication of replication
studies with nonsignificant results.
The publication of several high-profile replication failures has triggered a confidence crisis. Zwaan et al. have been active participants in the replicability
revolution. Their target article addresses criticisms of direct replication
One concern is the difficulty of re-creating original studies, which may explain replication failures, particularly in social psychology. This argument fails on three counts. First, it does not explain why published studies have an apparent success rate
greater than 90%. If social psychological studies were difficult to replicate, the success rate should be lower. Second, it is not clear why it would be easier to conduct conceptual replication studies that vary crucial aspects of a successful original study.
If social priming effects were, indeed, highly sensitive to contextual variations, conceptual replication studies would be even more likely to fail than direct replication studies; however, miraculously they always seem to work. The third problem with this argument is that it ignores selection for significance. It treats successful conceptual
replication studies as credible evidence, but bias tests reveal that these studies have been selected for significance and that many original studies that failed are simply not reported (Schimmack 2017; Schimmack et al. 2017).
A second concern about direct replications is that they are less informative than conceptual replications (Crandall & Sherman 2016). This argument is misguided because it assumes a successful outcome. If a conceptual replication study is successful, it increases the probability that the original finding was true and it expands the range of conditions under which an effect can be observed. However, the advantage of a conceptual replication study becomes a disadvantage when a study fails. For example,
if the original study showed that eating green jelly beans increases happiness and a conceptual replication study with red jelly beans does not show this effect, it remains unclear whether green jelly beans make people happier or not. Even the non-significant finding with red jelly beans is inconclusive because the result could be a false negative. Meanwhile, a failure to replicate the green jelly bean effect in a direct replication study is informative because it casts doubt on the original finding.
In fact, a meta-analysis of the original and replication study might produce a non-significant result and reverse the initial inference that green jelly beans make people happy. Crandall and Sherman’s argument rests on the false assumption that only significant studies are informative. This assumption is flawed because selection for significance renders significance uninformative (Sterling 1959).
A third argument against direct replication studies is that there are multiple ways to compare the results of original and replication studies. I believe the discussion of this point also benefits from taking publication bias into account. Selection for significance
explained why the reproducibility project obtained only 36% significant results in direct replications of original studies with significant results (Open Science Collaboration 2015). As a result, the significant results of original studies are less credible than the nonsignificant results in direct replication studies. This generalizes to all comparisons of original studies and direct replication studies.
Once there is suspicion or evidence that selection for significance occurred, the results of original studies are less credible, and more weight should be given to replication studies that are not biased by selection for significance. Without selection for significance,
there is no reason why replication studies should be more likely to fail than original studies. If replication studies correct mistakes in original studies and use larger samples, they are actually more likely to produce a significant result than original studies.
Selection for Significance Explains Reputational Damage of Replication Failures
Selection for significance also explains why replication failures are damaging to the reputation of researchers. The reputation of researchers is based on their publication record, and this record is biased in favor of successful studies. Thus, researchers’ reputations are inflated by selection for significance. Once an unbiased replication
produces a nonsignificant result, the unblemished record is tainted, and it is apparent that a perfect published record is illusory and not the result of research excellence (a.k.a flair). Thus, unbiased failed replication studies not only provide new evidence; they also
undermine the credibility of existing studies. Although positive illusions may be beneficial for researchers’ eminence, they have no place in science. It is therefore inevitable that the ongoing correction of the scientific record damages the reputation of researchers, if this reputation was earned by selective publishing of significant results.
In this way direct replication studies complement statistical tools that can reveal selective publishing of significant results with statistical tests of original studies (Schimmack 2012; 2014; Schimmack & Brunner submitted for publication).
Schimmack, U. (2012) The ironic effect of significant results on the credibility of
multiple-study articles. Psychological Methods 17:551–56. [US]
Schimmack, U. (2014) The test of insufficient variance (TIVA): A new tool for the
detection of questionable research practices. Working paper. Available at: https://replicationindex.wordpress.com/2014/12/30/the-test-of-insufficient-variance-
Schimmack, U. (2017) ‘Before you know it’ by John A. Bargh: A quantitative book
review. Available at: https://replicationindex.wordpress.com/2017/11/28/beforeyou-
Schimmack, U. & Brunner, J. (submittrd for publication) Z-Curve: A method for
estimating replicability based on test statistics in original studies. Submitted for
Schimmack, U., Heene, M. & Kesavan, K. (2017) Reconstruction of a train wreck:
How priming research went off the rails. Blog post. Available at: https://replicationindex.