This blog post is a review of a manuscript that hopefully will never be published, but it probably will be. In that case, it is a draft for a PubPeer comment. As the ms. is under review, I cannot share the actual ms., but the review makes clear what the authors are trying to do.
I assume that I was selected as a reviewer for this manuscript because the editor recognized my expertise in this research area. While most of my work on replicability has been published in the form of blog posts, I have also published a few peer-reviewed publications that are relevant to this topic. Most important, I have provided estimates of replicability for social psychology using the most advanced method to do so, z-curve (Bartos & Schimmack, 2020; Brunner & Schimmack, 2020), using the extensive coding by Motyl et al. (2017) (see Schimmack, 2020). I was surprised that this work was not mentioned.
In contrast, Yeager et al.’s (2019) replication study of 12 experiments is cited and as I recall 11 of the 12 studies replicated successfully. So, it is not clear why this study is cited as evidence that replication attempts often “producing pessimistic results”
While I agree that there are many explanations that have been offered for replication failures, I do not agree that listing all of these explanations is impossible and that it is reasonable to focus on some of these explanations, especially if the main reason is left out. Namely, the main reason for replication failures is that original studies are conducted with low statistical power and only those that achieve significance are published (Sterling et al., 1995; Schimmack, 2020). Omitting this explanation undermines the contribution of this article.
The listed explanations are
(1) original articles making use of questionable research practices that result in Type I errors
This explanation conflates two problems. QRPs are used to get significance when power to do so is low, but we do not know whether the population effect size is zero (type-I error) or above zero (type-II error).
(2) original research’s pursuit of counterintuitive findings that may have lower a priori probabilities and thus poor chances at replication
This explanations assumes that there are a lot of type-I errors, but we don’t really know whether the population effect size is zero or not. So, this is not a separate explanation, but rather an explanation why we might have many type-I errors assuming that we do have many type-I errors, which we do not know.
(3) the presence of unexamined moderators that produce differences between original and replication research (Dijksterhuis, 2014; Simons et al., 2017),
This citation ignores that empirical tests of this hypothesis have failed to provide evidence for it (van Bavel et al., 2016).
4) specific design choices in original or replication research that produce different conclusions (Bouwmeester et al., 2017; Luttrell et al., 2017; Noah et al., 2018).
This argument is not different from (3). Replication failures are attributed to moderating factors that are always possible because exact replications are impossible.
To date, discussions of possible explanations for poor replication have generally been presented as distinct accounts for poor replication, with little attempt being made to organize them into a coherent conceptual framework.
This claim ignores my detailed discussion of the various explanations including some not discussed by the authors (Schooler decline effect; Fiedler, regression to the mean; Schimmack, 2020).
The selection of journals is questionable. Psychological Science is not a general (meta)-psychological journal. Instead there are two journals, The Journal of General Psychology and Meta-Psychology that contain relevant articles.
The authors then introduce Cook and Campbell’s typology of validity and try to relate it to accounts of replication failures based on some work by Fabrigar et al. (2020). This attempt is flawed because validity is a broader construct than replicability or reliability. Measures can be reliable and correlations can be replicable even if the conclusions drawn from these findings are invalid. This is Intro Psych level stuff.
Statistical conclusion validity is concerned with the question of “whether or not two or more variables are related.” This is of course nothing else than the distinction between true and false conclusions based on significant or non-significant results. As noted above, even statistical conclusion validity is not directly related to replication failures because replication failures do not tell us whether the population effect size is zero or not. Yet, we might argue that there is a risk of false positive conclusions when statistical significance is achieved with QRPs and these results do not replicate. So, in some sense statistical conclusion validity is tied to the replication crisis in experimental social psychology.
Internal validity is about the problem of inferring causality from correlations. This issue has nothing to do with the replication crisis because replication failures can occur in experiments and correlational studies. The only indirect link to internal validity is that experimental social psychology prided itself on the use of between-subject experiments to maximize internal validity and minimize demand effects, but often used ineffective manipulations (priming) that required QRPs to get significance especially in the tiny samples that were used because experiments are more time-consuming and labor intensive. In contrast, survey studies often are more replicable because they have larger samples. But the key point remains, it would be absurd to explain replication failures directly as a function of low internal validity.
Construct validity is falsely described as “the degree to which the operationalizations used in the research effectively capture their intended constructs.” The problem here is the term operationalization. Once a construct is operationalized with some procedure, it is defined by the procedure (intelligence is what the IQ test measures) and there is no way to challenge the validity of the construct. In contrast, measurement implies that constructs exist independent of one specific procedure and it is possible to examine how well a measure reflects variation in the construct (Cronbach & Meehl, 1955). That said, there is no relationship between construct validity and replicability because systematic measurement error can produce spurious correlations between measures in correlational studies that are highly replicable (e.g., social desirable responding). In experiments, systematic measurement error will attenuate effect sizes, but it will do so equally in original studies and replication studies. Thus, low construct validity also provides no explanation for replication failures.
External validity is defined as “the degree to which an effect generalizes to different populations and contexts” This validation criterion is also only slightly related to replication failures when there are concerns about contextual sensitivity or hidden moderators. A replication study in a different population or context might fail because the population effect size varies across populations or contexts. While this is possible, there is little evidence that contextual sensitivity is a major factor.
In short, it is a red herring in explanations for replication failures or the replication crisis to talk about validity. Replicability is necessary but not sufficient for good science.
It is therefore not surprising that the authors found most discussions of replication failures focus on statistical conclusion validity. Any other finding would make no sense. It is just not clear why we needed a text analysis to reveal this.
However, the authors seem to be unable to realize that the other types of validity are not related to replication failures when they write “What does this study add? Identifies that statistical conclusion validity is over-emphasized in replication analysis”
Over-emphasized??? This is an absurd conclusion based on a failure to make a clear distinction between replicability/reliability and validity.