Dr. Ulrich Schimmack Blogs about Replicability

For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).

DEFINITION OF REPLICABILITYIn empirical studies with sampling error, replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion (Schimmack, 2017). 

See Reference List at the end for peer-reviewed publications.

Mission Statement

The purpose of the R-Index blog is to increase the replicability of published results in psychological science and to alert consumers of psychological research about problems in published articles.

To evaluate the credibility or “incredibility” of published research, my colleagues and I developed several statistical tools such as the Incredibility Test (Schimmack, 2012); the Test of Insufficient Variance (Schimmack, 2014), and z-curve (Version 1.0; Brunner & Schimmack, 2020; Version 2.0, Bartos & Schimmack, 2021). 

I have used these tools to demonstrate that several claims in psychological articles are incredible (a.k.a., untrustworthy), starting with Bem’s (2011) outlandish claims of time-reversed causal pre-cognition (Schimmack, 2012). This article triggered a crisis of confidence in the credibility of psychology as a science. 

Over the past decade it has become clear that many other seemingly robust findings are also highly questionable. For example, I showed that many claims in Nobel Laureate Daniel Kahneman’s book “Thinking: Fast and Slow” are based on shaky foundations (Schimmack, 2020).  An entire book on unconscious priming effects, by John Bargh, also ignores replication failures and lacks credible evidence (Schimmack, 2017).  The hypothesis that willpower is fueled by blood glucose and easily depleted is also not supported by empirical evidence (Schimmack, 2016). In general, many claims in social psychology are questionable and require new evidence to be considered scientific (Schimmack, 2020).  

Each year I post new information about the replicability of research in 120 Psychology Journals (Schimmack, 2021).  I also started providing information about the replicability of individual researchers and provide guidelines how to evaluate their published findings (Schimmack, 2021). 

Replication is essential for an empirical science, but it is not sufficient. Psychology also has a validation crisis (Schimmack, 2021).  That is, measures are often used before it has been demonstrate how well they measure something. For example, psychologists have claimed that they can measure individuals’ unconscious evaluations, but there is no evidence that unconscious evaluations even exist (Schimmack, 2021a, 2021b). 

If you are interested in my story how I ended up becoming a meta-critic of psychological science, you can read it here (my journey). 

References

Brunner, J., & Schimmack, U. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta-Psychology, 4, MP.2018.874, 1-22
https://doi.org/10.15626/MP.2018.874

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17, 551–566
http://dx.doi.org/10.1037/a0029487

Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne, 61(4), 364–376. 
https://doi.org/10.1037/cap0000246

Reevaluating the Predictive Validity of the Race Implicit Association Test

Over the past two decades, social psychological research on prejudice has been dominated by the implicit cognition paradigm (Meissner, Grigutsch, Koranyi, Müller, & Rothermund, 2019). This paradigm is based on the assumption that many individuals of the majority group (e.g., White US Americans) have an automatic tendency to discriminate against members of a stigmatized minority group (e.g., African Americans). It is assumed that this tendency is difficult to control because many people are unaware of their prejudices.

The implicit cognition paradigm also assumes that biases vary across individuals of the majority group. The most widely used measure of individual differences in implicit biases is the race Implicit Association Test (rIAT; Greenwald, McGhee, & Schwartz, 1998). Like any other measure of individual differences, the race IAT has to meet psychometric criteria to be a useful measure of implicit bias. Unfortunately, the race IAT has been used in hundreds of studies before its psychometric properties were properly evaluated in a program of validation research (Schimmack, 2021a, 2021b).

Meta-analytic reviews of the literature suggest that the race IAT is not as useful for the study of prejudice as it was promised to be (Greenwald et al., 1998). For example, Meissner et al. (2019) concluded that “the predictive value for behavioral criteria is weak and their incremental validity over and above self-report measures is negligible” (p. 1).

In response to criticism of the race IAT, Greenwald, Banaji, and Nosek (2015) argued that “statistically small effects of the implicit association test can have societally large effects” (p. 553). At the same time, Greenwald (1975) warned psychologists that they may be prejudiced against the null-hypothesis. To avoid this bias, he proposed that researchers should define a priori a range of effect sizes that are close enough to zero to decide in favor of the null-hypothesis. Unfortunately, Greenwald did not follow his own advice and a clear criterion for a small, but practically significant amount of predictive validity is lacking. This is a problem because estimates have decreased over time from r = .39 (McConnell & Leibold, 2001), to r = .24 in 2009 ( Greenwald, Poehlman, Uhlmann, and Banaji, 2009), to r = .148 in 2013 (Oswald, Mitchell, Blanton, Jaccard, & Tetlock (2013), and r = .097 in 2019 (Greenwald & Lai, 2020; Kurdi et al., 2019). Without a clear criterion value, it is not clear how this new estimate of predictive validity should be interpreted. Does it still provide evidence for a small, but practically significant effect, or does it provide evidence for the null-hypothesis (Greenwald, 1975)?

Measures are not Causes

To justify the interpretation of a correlation of r = .1 as small but important, it is important to revisit Greenwald et al.’s (2015) arguments for this claim. Greenwald et al. (2015) interpret this correlation as evidence for an effect of the race IAT on behavior. For example, they write “small effects can produce substantial discriminatory impact also by cumulating over repeated occurrences to the same person” (p. 558). The problem with this causal interpretation of a correlation between two measures is that scores on the race IAT have no influence on individuals’ behavior. This simple fact is illustrated in Figure 1. Figure 1 is a causal model that assumes the race IAT reflects valid variance in prejudice and prejudice influences actual behaviors (e.g., not voting for a Black political candidate). The model makes it clear that the correlation between scores on the race IAT (i.e., the iat box) and scores on a behavioral measures (i.e., the crit box) do not have a causal link (i.e., no path leads from the iat box to the crit box). Rather, the two measured variables are correlated because they both reflect the effect of a third variable. That is, prejudice influences race IAT scores and prejudice influences the variance in the criterion variable.

There is general consensus among social scientists that prejudice is a problem and that individual differences in prejudice have important consequences for individuals and society. The effect size of prejudice on a single behavior has not been clearly examined, but to the extent that race IAT scores are not perfectly valid measures of prejudice, the simple correlation of r = .1 is a lower limit of the effect size. Schimmack (2021) estimated that no more than 20% of the variance in race IAT scores is valid variance. With this validity coefficient, a correlation of r = .1 implies an effect of prejudice on actual behaviors of .1 / sqrt(.2) = .22.

Greenwald et al. (2015) correctly point out that effect sizes of this magnitude, r ~ .2, can have practical, real-world implications. The real question, however, is whether predictive validity of .1 justifies the use of the race IAT as a measure of prejudice. This question has to be evaluated in a comparison of predictive validity for the race IAT with other measures of prejudice. Thus, the real question is whether the race IAT has sufficient incremental predictive validity over other measures of prejudice. However, this question has been largely ignored in the debate about the utility of the race IAT (Greenwald & Lai, 2020; Greenwald et al., 2015; Oswald et al., 2013).

Kurdi et al. (2019) discuss incremental predictive validity, but this discussion is not limited to the race IAT and makes the mistake to correct for random measurement error. As a result, the incremental predictive validity for IATs of b = .14 is a hypothetical estimate for IATs that are perfectly reliable. However, it is well-known that IATs are far from perfectly reliable. Thus, this estimate overestimates the incremental predictive validity. Using Kurdi et al.’s data and limiting the analysis to studies with the race IAT, I estimated incremental predictive validity to be b = .08, 95%CI = .04 to .12. It is difficult to argue that this a practically significant amount of incremental predictive validity. At the very least, it does not justify the reliance on the race IAT as the only measure of prejudice or the claim that the race IAT is a superior measure of prejudice (Greenwald et al., 2009).

The meta-analytic estimate of b = .1 has to be interpreted in the context of evidence of substantial heterogeneity across studies (Kurdi et al., 2019). Kurdi et al. (2019) suggest that “it may be more appropriate to ask under what conditions the two [race IAT scores and criterion variables] are more or less highly correlated” (p. 575). However, little progress has been made in uncovering moderators of predictive validity. One possible explanation for this is that previous meta-analysis may have overlooked one important source of variation in effect sizes, namely publication bias. Traditional meta-analyses may be unable to reveal publication bias because they include many articles and outcome measures that did not focus on predictive validity. For example, Kurdi’s meta-analysis included a study by Luo, Li, Ma, Zhang, Rao, and Han (2015). The main focus of this study was to examine the potential moderating influence of oxytocin on neurological responses to pain expressions of Asian and White faces. Like many neurological studies, the sample size was small (N = 32), but the study reported 16 brain measures. For the meta-analysis, correlations were computed across N = 16 participants separately for two experimental conditions. Thus, this study provided as many effect sizes as it had participants. Evidently, power to obtain a significant result with N = 16 and r = .1 is extremely low, and adding these 32 effect sizes to the meta-analysis merely introduced noise. This may undermine the validity of meta-analytic results ((Sharpe, 1997). To address this concern, I conducted a new meta-analysis that differs from the traditional meta-analyses. Rather than coding as many effects from as many studies as possible, I only include focal hypothesis tests from studies that aimed to investigate predictive validity. I call this a focused meta-analysis.

Focused Meta-Analysis of Predictive Validity

Coding of Studies

I relied on Kurdi et al.’s meta-analysis to find articles. I selected only published articles that used the race IAT (k = 96). The main purpose of including unpublished studies is often to correct for publication bias (Kurdi et al., 2019). However, it is unlikely that only 14 (8%) studies that were conducted remained unpublished. Thus, the unpublished studies are not representative and may distort effect size estimates.

Coding of articles in terms of outcome measures that reflect discrimination yielded 60 studies in 45 articles. I examined whether this selection of studies influenced the results by limiting a meta-analysis with Kurdi et al.’s coding of studies to these 60 articles. The weighted average effect size was larger than the reported effect size, a = .167, se = .022, 95%CI = .121 to .212. Thus, Kurdi et al.’s inclusion of a wide range of studies with questionable criterion variables diluted the effect size estimate. However, there remained substantial variability around this effect size estimate using Kurdi et al.’s data, I2 = 55.43%.

Results

The focused coding produced one effect-size per study. It is therefore not necessary to model a nested structure of effect sizes and I used the widely used metafor package to analyze the data (Viechtbauer, 2010). The intercept-only model produced a similar estimate to the results for Kurdi et al.’s coding scheme, a = .201, se = .020, 95%CI = .171 to .249. Thus, focal coding does seem to produce the same effect size estimate as traditional coding. There was also a similar amount of heterogeneity in the effect sizes, I2 = 50.80%.

However, results for publication bias differed. Whereas Kurdi et al.’s coding shows no evidence of publication bias, focused coding produced a significant relationship emerged, b = 1.83, se = .41, z = 4.54, 95%CI = 1.03 to 2.64. The intercept was no longer significant, a = .014, se = .0462, z = 0.31, 95%CI = -.077 to 95%CI = .105. This would imply that the race IAT has no incremental predictive validity. Adding sampling error as a predictor reduced heterogeneity from I2 = 50.80% to 37.71%. Thus, some portion of the heterogeneity is explained by publication bias.

Stanley (2017) recommends to accept the null-hypothesis when the intercept in the previous model is not significant. However, a better criterion is to compare this model to other models. The most widely used alternative model regresses effect sizes on the squared sampling error (Stanley, 2017). This model explained more of the heterogeneity in effect sizes as reflected in a reduction of unexplained heterogeneity from 50.80% to 23.86%. The intercept for this model was significant, a = .113, se = .0232, z = 4.86, 95%CI = .067 to .158.

Figure 2 shows the effect sizes as a function of sampling error and the regression lines for the three models.

Inspection of Figure 1 provides further evidence that the squared-SE model. The red line (squared sampling error) fits the data better than the blue line (sampling error) model. In particular for large samples, PET underestimates effect sizes.

The significant relationship between sample size (sampling error) and effect sizes implies that large effects in small studies cannot be interpreted at face value. For example, the most highly cited study of predictive validity had only a sample size of N = 42 participants (McConnell & Leibold, 2001). The squared-sampling-error model predicts an effect size estimate of r = .30, which is close to the observed correlation of r = .39 in that study.

In sum, a focal meta-analysis replicates Kurdi et al.’s (2019) main finding that the average predictive validity of the race IAT is small, r ~ .1. However, the focal meta-analysis also produced a new finding. Whereas the initial meta-analysis suggested that effect sizes are highly variable, the new meta-analysis suggests that a large portion of this variability is explained by publication bias.

Moderator Analysis

I explored several potential moderator variables, namely (a) number of citations, (b) year of publication, (c) whether IAT effects were direct or moderator effects, (d) whether the correlation coefficient was reported or computed based on test statistics, and (e) whether the criterion was an actual behavior or an attitude measure. The only statistically significant result was a weaker correlation in studies that predicted a moderating effect of the race IAT, b = -.11, se = .05, z = 2.28, p = .032. However, the effect would not be significant after correction for multiple comparison and heterogeneity remained virtually unchanged, I2 = 27.15%.

During the coding of the studies, the article “Ironic effects of racial bias during interracial interactions” stood out because it reported a counter-intuitive result. in this study, Black confederates rated White participants with higher (pro-White) race IAT scores as friendlier. However, other studies find the opposite effect (e.g., McConnell & Leibold, 2001). If the ironic result was reported because it was statistically significant, it would be a selection effect that is not captured by the regression models and it would produce unexplained heterogeneity. I therefore also tested a model that excluded all negative effect. As bias is introduced by this selection, the model is not a test of publication bias, but it may be better able to correct for publication bias. The effect size estimate was very similar, a = .133, se = .017, 95%CI = .010 to .166. However, heterogeneity was reduced to 0%, suggesting that selection for significance fully explains heterogeneity in effect sizes.

In conclusion, moderator analysis did not find any meaningful moderators and heterogeneity was fully explained by publication bias, including publishing counterintuitive findings that suggest less discrimination by individuals with more prejudice. The finding that publication bias explains most of the variance is extremely important because Kurdi et al. (2019) suggested that heterogeneity is large and meaningful, which would suggest that higher predictive validity could be found in future studies. In contrast, the current results suggest that correlations greater than .2 in previous studies were largely due to selection for significance with small samples, which also explains unrealistically high correlations in neuroscience studies with the race IAT (cf. Schimmack, 2021b).

Predictive Validity of Self-Ratings

The predictive validity of self-ratings is important for several reasons. First, it provides a comparison standard for the predictive validity of the race IAT. For example, Greenwald et al. (2009) emphasized that predictive validity for the race IAT was higher than for self-reports. However, Kurdi et al.’s (2019) meta-analysis found the opposite. Another reason to examine the predictive validity of explicit measures is that implicit and explicit measures of racial attitudes are correlated with each other. Thus, it is important to establish the predictive validity of self-ratings to estimate the incremental predictive validity of the race IAT.

Figure 2 shows the results. The sampling-error model shows a non-zero effect size, but sampling error is large, and the confidence interval includes zero, a = .121, se = .117, 95%CI = -.107 to .350. Effect sizes are also extremely heterogeneous, I2 = 62.37%. The intercept for the squared-sampling-error model is significant, a = .176, se = .071, 95%CI = .036 to .316, but the model does not explain more of the heterogeneity in effect sizes than the squared-sampling-error model, I2 = 63.33%. To remain comparability, I use the squared-sampling error estimate. This confirms Kurdi et al.’s finding that self-ratings have slightly higher predictive validity, but the confidence intervals overlap. For any practical purposes, predictive validity of the race IAT and self-reports is similar. Repeating the moderator analyses that were conducted with the race IAT revealed no notable moderators.

Implicit-Explicit Correlations

Only 21 of the 60 studies reported information about the correlation between the race IAT and self-report measures. There was no indication of publication bias, and the effect size estimates of the three models converge on an estimate of r ~ .2 (Figure 3). Fortunately, this result can be compared with estimates from large internet studies (Axt, 2017) and a meta-analysis of implicit-explicit correlations (Hofmann et al., 2005). These estimates are a bit higher, r ~ .25. Thus, using an estimate of r = .2 is conservative for a test of the incremental predictive validity of the race IAT.

Incremental Predictive Validity

It is straightforward to estimate the incremental predictive validity of the race IAT and self-reports on the basis of the correlations between race IAT, self-ratings, and criterion variables. However, it is a bit more difficult to provide confidence intervals around these estimates. I used a simulated dataset with missing values to reproduce the correlations and sampling error of the meta-analysis. I then regressed, the criterion on the implicit and explicit variable. The incremental predictive validity for the race IAT was b = .07, se = .02, 95%CI = .03 to .12. This finding implies that the race IAT on average explains less than 1% unique variance in prejudice behavior. The incremental predictive validity of the explicit measure was b = .165, se = .03, 95%CI = .11 to .23. This finding suggests that explicit measures explain between 1 and 4 percent of the variance in prejudice behaviors.

Assuming that there is no shared method variance between implicit and explicit measures and criterion variables and that implicit and explicit measures reflect a common construct, prejudice, it is possible to fit a latent variable model to the correlations among the three indicators of prejudice (Schimmack, 2021). Figure 4 shows the model and the parameter estimates.

According to this model, prejudice has a moderate effect on behavior, b = .307, se = .043. This is consistent with general findings about effects of personality traits on behavior (Epstein, 1973; Funder & Ozer, 1983). The loading of the explicit variable on the prejudice factor implies that .582^2 = 34% of the variance in self-ratings of prejudice is valid variance. The loading of the implicit variable on the prejudice factor implies that .353^2 = 12% of the variance in race IAT scores is valid variance. Notably, similar estimates were obtained with structural equation models of data that are not included in this meta-analysis (Schimmack, 2021). Using data from Cunningham et al., (2001) I estimated .43^2 = 18% valid variance. Using Bar-Anan and Vianello (2018), I estimated .44^2 = 19% valid variance. Using data from Axt, I found .44^2 = 19% valid variance, but 8% of the variance could be attributed to group differences between African American and White participants. Thus, the present meta-analytic results are consistent with the conclusion that no more than 20% of the variance in race IAT scores reflects actual prejudice that can influence behavior.

In sum, incremental predictive validity of the race IAT is low for two reasons. First, prejudice has only modest effects on actual behavior in a specific situation. Second, only a small portion of the variance in race IAT scores is valid.

Discussion

In the 1990s, social psychologists embraced the idea that behavior is often influenced by processes that occur without conscious awareness. This assumption triggered the implicit revolution (Greenwald & Banaji, 2017). The implicit paradigm provided a simple explanation for low correlations between self-ratings of prejudice and implicit measures of prejudice, r ~ .2. Accordingly, many people are not aware how prejudice their unconscious is. The Implicit Association Test seemed to support this view because participants showed more prejudice on the IAT than on self-report measures. First studies of predictive validity also seemed to support this new model of prejudice (McConnell & Leibold, 2001), and the first meta-analysis suggested that implicit bias has a stronger influence on behavior than self-reported attitudes (Greenwald, Poehlman, Uhlmann, & Banaji, 2009, p. 17).

However, the following decade produced many findings that require a reevaluation of the evidence. Greenwald et al. (2009) published the largest test (N = 1057) of predictive validity. This study examined the ability of the race IAT to predict racial bias in the 2008 US presidential election. Although the race IAT was correlated with voting for McCain versus Obama, incremental predictive validity was close to zero and no longer significant when explicit measures were included in the regression model. Then subsequent meta-analyses produced lower estimates of predictive validity and it is no longer clear that predictive validity, especially incremental predictive validity, is high enough to reject the null-hypothesis. Although incremental predictive validity may vary across conditions, no conditions have been identified that show practically significant incremental predictive validity. Unfortunately, IAT proponents continue to make misleading statements based on single studies with small samples. For example, Kurdi et al. claimed that “effect sizes tend to be relatively large in studies on physician–patient interactions” (p. 583). However, this claim was based on a study with just 15 physicians, which makes it impossible to obtain precise effect size estimates about implicit bias effects for physicians.

Beyond Nil-Hypothesis Testing

Just like psychology in general, meta-analyses also suffer from the confusion of nil-hypothesis testing and null-hypothesis testing. The nil-hypothesis is the hypothesis that an effect size is exactly zero. Many methodologists have pointed out that it is rather silly to take the nil-hypothesis at face value because the true effect size is rarely zero (Cohen, 1994). The more important question is whether an effect size is sufficiently different from zero to be theoretically and practically meaningful. As pointed out by Greenwald (1975), effect size estimation has to be complemented with theoretical predictions about effect sizes. However, research on predictive validity of the race IAT lacks clear criteria to evaluate effect size estimates.

As noted in the introduction, there is agreement about the practical importance of statistically small effects for the prediction of discrimination and other prejudiced behaviors. The contentious question is whether the race IAT is a useful measure of dispositions to act prejudiced. Viewed from this perspective, focus on the race IAT is myopic. The real challenge is to develop and validate measures of prejudice. IAT proponents have often dismissed self-reports as invalid, but the actual evidence shows that self-reports have some validity that is at least equal to the validity of the race IAT. Moreover, even distinct self-report measures like the feeling thermometer and the symbolic racism have incremental predictive validity. Thus, prejudice researchers should use a multi-method approach. At present it is not clear that the race IAT can improve the measurement of prejudice (Greenwald et al., 2009; Schimmack, 2021a).

Methodological Implications

This article introduced a new type of meta-analysis. Rather than trying to find as many vaguely related studies and to code as many outcomes as possible, focused meta-analysis is limited to the main test of the key hypothesis. This approach has several advantages. First, the classic approach creates a large amount of heterogeneity that is unique to a few studies. This noise makes it harder to find real moderators. Second, the inclusion of vaguely related studies may dilute effect sizes. Third, the inclusion of non-focal studies may mask evidence of publication bias that is virtually present in all literatures. Finally, focal meta-analysis are much easier to do and can produce results much faster than the laborious meta-analyses that psychologists are used to. Even when classic meta-analysis exist, they often ignore publication bias. Thus, an important task for the future is to complement existing meta-analysis with focal meta-analysis to ensure that published effect sizes estimates are not diluted by irrelevant studies and not inflated by publication bias.

Prejudice Interventions

Enthusiasm about implicit biases has led to interventions that aim to reduce implicit biases. This focus on implicit biases in the real world needs to be reevaluated. First, there is no evidence that prejudice typically operates outside of awareness (Schimmack, 2021a). Second, individual differences in prejudice have only a modest impact on actual behaviors and are difficult to change. Not surprisingly, interventions that focus on implicit bias are not very infective. Rather than focusing on changing individuals’ dispositions, interventions may be more effective by changing situations. In this regard, the focus on internal factors is rather different from the general focus in social psychology on situational factors (Funder & Ozer, 1983). In recent years, it has become apparent that prejudice is often systemic. For example, police training may have a much stronger influence on racial disparities in fatal use of force than individual differences in prejudice of individual officers (Andersen, Di Nota, Boychuk, Schimmack, & Collins, 2021).

Conclusion

The present meta-analysis of the race IAT provides further support for Meissner et al.’s (2019) conclusion that IATs “predictive value for behavioral criteria is weak and their incremental validity over and above self-report measures is negligible” (p. 1). The present meta-analysis provides a quantitative estimate of b = .07. Although researchers can disagree about the importance of small effect sizes, I agree with Meissner that the gains from adding a race IAT to the measurement of prejudice is negligible. Rather than looking for specific contexts in which the race IAT has higher predictive validity, researchers should use a multi-method approach to measure prejudice. The race IAT may be included to further explore its validity, but there is no reason to rely on the race IAT as the single most important measure of individual differences in prejudice.

References

Funder, D.C., & Ozer, D.J. (1983). Behavior as a function of the situation. Journal of Personality and Social Psychology, 44, 107–112.

Kurdi, B., Seitchik, A. E., Axt, J. R., Carroll, T. J., Karapetyan, A., Kaushik, N., et al. (2019). Relationship between the implicit association test and intergroup behavior: a meta-analysis. American Psychologist. 74, 569–586. doi: 10.1037/amp0000364

Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software36(3), 1–48. https://www.jstatsoft.org/v036/i03.

Failure to Accept the Null-Hypothesis: A Case Study

The past decade has revealed many flaws in the way psychologists conduct empirical tests of theories. The key problem is that psychologists lacked an accepted strategy to conclude that a prediction was not supported. This fundamental flaw can be traced back to Fisher’s introduction of significance testing. In Fisher’s the null-hypothesis is typically specified as the absence of an effect in either direction. That is, the effect size is exactly zero. Significance testing examines how much empirical results deviate from this prediction. If the probability of the result or even more extreme deviations is less than 5%, the null-hypothesis is rejected. However, if the p-value is greater than .05, no inferences can be drawn from the finding because there are two explanations for this finding. Either the null-hypothesis is true or it is false and the result is a false negative result. The probability of this false negative results is unspecified in Fisher’s framework. This asymmetrical approach to significance testing continues to dominate psychological science.

Criticism of this one-sided approach to significance testing is nearly as old as nil-hypothesis significance testing itself (Greenwald, 1975; Sterling, 1959). Greenwald’s (1975) article is notable because it provided a careful analysis of the problem and it pointed towards a solution to this problem that is rooted in Neyman-Pearson’s alternative to Fisher’s significance testing. Greenwald (1975) showed how it is possible to “Accept the Null-Hypothesis Gracefully” (p. 16).

“Use a range, rather than a point, null hypothesis. The procedural recommendations to follow are much easier to apply if the researcher has decided, in advance of data collection, just what magnitude of effect on a dependent measure or measure of association is large enough not to be considered trivial. This decision may have to be made somewhat arbitrarily but seems better to be made somewhat arbitrarily before data collection than to be made after examination of the data.” (p. 16).

The reason is simply that it is impossible to provide evidence for the nil-hypothesis that an effect size is exactly zero, just like it is impossible to show than an effect size equals any other precise value (e..g., r = .1). Although Greenwald made this sensible suggestion over 40 years ago, it is nearly impossible to find articles that specify a range of effect sizes a priori (e.g.., we expected the effect size to be in the range between r = .3 and r = .5 or we expected the correlation to be larger than r = .1).

Bad training continues to be a main reason for the lack of progress in psychological science. However, other factors also play a role. First, specifying effect sizes a priori has implications for the specification of sample sizes. A researcher who declares that effect sizes as small as r = .1 are meaningful and expected needs large samples to obtain precise effect size estimates. For example, assuming the population correlation is r = .2 and a researcher wants to show that it is at least r = .1, a one-sided test with alpha = .05 and 95% power (i.e., the probability of a successful outcome) is N = 1,035. As most sample sizes in psychology are below N = 200, most studies simply lack the precision to test hypothesis that predict small effects. A solution to this might be to focus on hypotheses that predict large effect sizes. However, to show that a population correlation of r = .4 is greater than r = .3, still requires N = 833 participants. In fact, most studies in psychology barely have enough power to demonstrate that moderate correlations, r = .3, are greater than zero, N = 138. In short, most studies are too small to provide evidence for the null-hypothesis that effect sizes are small than a minimum effect size. Not surprisingly, psychological theories are rarely abandoned because empirical results seemed to support the null-hypothesis.

However, occasionally studies do have large samples and it would be possible to follow Greenwald’s (1975) recommendation to specify a minimum effect size a priori. For example, Greenwald and colleagues conducted a study with N = 1,411 participants who reported their intentions to vote for Obama or McCain in the 2008 US elections. The main hypothesis was that implicit measures of racial attitudes like the race IAT would add to the prediction because some White Democrats might not vote for a Black Democratic candidate. It would have been possible to specify an minimum effect size based on a meta-analysis that was published in the same year. This meta-analysis of smaller studies suggested that the average race IAT – criterion correlation was r = .236. The explicit – criterion correlation was r = .186, effect, and the explicit-implicit correlation was only r = .117. Given the lower estimates for the explicit measures and the low explicit-implicit correlation, a regression analysis would only slightly reduce the effect size for the incremental predictive validity of the race IAT, b = .225. Thus, it would have been possible to test the hypothesis that the effect size is at least b = .1, which would imply that adding the race IAT as a predictor explains at least 1% additional variance in voting behaviors.

In reality, the statistical analyses were conducted with prejudice against the null-hypothesis. First, Greenwald et al. (2009) noted that “conservatism and symbolic racism
were the two strongest predictors of voting intention (see Table 1)” (p. 247).

A straightforward way to test the hypothesis that the race IAT contributes to the prediction of voting would simply add the standardized race IAT as an additional predictor and use the regression coefficient to test the prediction that implicit bias as measured with the race IAT contributes to voting against Obama. A more stringent test of incremental predictive validity would also include the other explicit prejudice measures because measurement error alone can produce incremental predictive validity for measures of the same construct. However, this is not what the authors did. Instead, they examined whether the four racial attitude measures jointly predicted variance in addition to political orientation. This was the case, with 2% additional explained variance (p < .0010). However, this result does not tell us anything about the unique contribution of the race IAT. The unique contributions of the four measures were not reported. Instead, another regression model tested whether the race IAT and a second implicit measure (the Affective Misattribution Task) explained incremental variance in addition to political orientation. In this model “the pair of implicit measures incrementally predicted only 0.6% of voting intention variance, p = .05” (p. 247). This model also does not tell us anything about the importance of the race IAT because it was not reported how much of the joint contribution was explained by the race IAT alone. The inclusion of the AMP also makes it impossible to test the statistical significance for the race IAT because most of the prediction may come from the shared variance between the two implicit measures, r = .218. Most important, the model does not test whether the race IAT predicts voting above and beyond explicit measures, including symbolic racism.

Another multiple regression analysis entered symbolic racism and the two implicit measures. In this analysis, the two implicit measures combined explained an additional 0.7% of the variance, but this was not statistically significant, p = .07.

They then fitted the model with all predictor variables. In this model, the four attitude measures explained an additional 1.3% of the variance, p = .01, but no information is provided about the unique contribution of the race IAT or the joint contribution of the two implicit measures. The authors merely comment that “among the four race attitude measures,
the thermometer difference measure was the strongest incremental predictor and was also the only one of the four that was individually statistically significant in their simultaneous entry after both symbolic racism and conservatism (p. 247).

To put it mildly, the presented results carefully avoid reporting the crucial result about the incremental predictive validity of the race IAT after explicit measures of prejudice are entered into the equation. Adding the AMP only creates confusion because the empirical question is how much the race IAT adds to the prediction of voting behavior. Whether this variance is shared with another implicit measure or not is not relevant.

Table 1 can be used to obtain the results that were not reported in the article. A regression analysis shows a standardized effect size estimate of 0.000 with a 95%CI that ranges from -.047 to .046. The upper limit of this confidence interval is below the minimum effect size of .1 that was used to specify a reasonable null-hypothesis. Thus, the only study that had sufficient precision to the incremental predictive validity of the race IAT shows that the IAT does not make a meaningful, notable, practically significant contribution to the prediction of racial bias in voting. In contrast, several self-report measures did show that racial bias influenced voting behavior above and beyond the influence of political orientation.

Greenwald et al.’s (2009) article illustrates Greenwald’s (1975) prejudice against the null-hypotheses. Rather than reporting a straightforward result, they present several analyses that disguise the fact that the race IAT did not predict voting behavior. Based on these questionable analyses, the authors misrepresent the findings. For example, they claim that “both the implicit and explicit (i.e., self-report) race attitude measures successfully predicted voting.” They omit that this statement is only correct when political orientation and symbolic racism are not used as predictors.

They then argue that their results “supplement the substantial existing evidence that race attitude IAT measures predict individual behavior (reviewed by Greenwald et al., 2009)” (p. 248). This statement is false. The meta-analysis suggested that incremental predictive validity of the race IAT is r ~ .2, whereas this study shows an effect size of r ~ 0 when political orientation is taken into account.

The abstract, often the only information that is available or read, further misleads readers. “The implicit race attitude measures (Implicit Association Test and Affect Misattribution Procedure) predicted vote choice independently of the self-report race attitude measures, and also independently of political conservatism and symbolic racism. These findings support construct validity of the implicit measures” (p. 242). Careful reading of the results section shows that the statement refers to separate analyses in which implicit measures are tested controlling for explicit attitude ratings OR political orientation OR symbolic racism. The new results presented here show that the race IAT does not predict voting controlling for explicit attitudes AND political orientation AND symbolic racism.

The deceptive analysis of these data has led to many citations that the race IAT is an important predictor of actual behavior. For example, in their popular book “Blindspot” Banaji and Greenwald list this study as an example that “the Race IAT predicted racially discriminatory behavior. A continuing stream of additional studies that have been completed since publication of the meta-analysis likewise supports that conclusion. Here are a few examples of race-relevant behaviors that were predicted by automatic White preference in these more recent studies: voting for John McCain rather than Barack Obama in the 2008 U.S. presidential election” (p. 49)

Kurdi and Banaji (2017) use the study to claim that “investigators have used implicit race attitudes to predict widely divergent outcome measures” (p. 282), without noting that even the reported results showed less than 1% incremental predictive validity. A review of prejudice measures features this study as an example of predictive validity (Fiske & North, 2014).

Of course, a single study with a single criterion is insufficient to accept the null-hypothesis that the race IAT lacks incremental predictive validity. A new meta-analysis by Kurdi with Greenwald as co-author provides new evidence about the typical amount of incremental predictive validity of the incremental predictive validity of the race IAT. The only problem is that this information is not provided. I therefore analyzed the open data to get this information. The meta-analytic results suggest an implicit-criterion correlation of r = .100, se = .01, an explicit-criterion correlation of r = .127, se = .02, and an implicit-explicit correlation of of r = .139, se = .022. A regression analysis yields an estimate of the incremental predictive validity for the race IAT of .084, 95%CI = .040 to .121. While this effect size is statistically significant in a test against the nil-hypothesis, it is also statistically different from Greenwald et al.s’ (2009) estimate of b = .225. Moreover, the point estimate is below .1, which could be used to affirm the null-hypothesis, but the confidence interval includes a value of .1. Thus, there is a 20% chance (an 80%CI would not include .1) that the effect size is greater than .1, but it is unlikely(p < .05) that it is greater than .12.

Greenwald and Lai (2020) wrote an Annual Review article about implicit measures. It mentions that estimates of the predictive validity of IATs have decreased from r = .274 (Greenwald et all, 2009) to r = .097 (Kurdi et al., 2019). No mention is made of a range of effect sizes that would support the null-hypothesis that implicit measures do not add to the prediction of prejudice because they do not measure an implicit cause of behavior that is distinct from causes of prejudice that are reflected in self-report measures. Thus, Greenwald fails to follow the advice of his younger self to provide a strong test of a theory by specifying effect sizes that would provide support for the null-hypothesis and against his theory of implicit cognitions.

It is not only ironic to illustrate the prejudice against falsification with Greenwald’s own research. It also shows that the one-sided testing of theories that avoids failures is not only a lack of proper training in statistics or philosophy of science. After all, Greenwald demonstrated that he is well aware of the problems with nil-hypothesis testing. Thus, only motivated biases can explain the one-sided examination of the evidence. Once a researcher has made a name for themselves, they are no longer neutral observers like judges or juries. They are more like prosecutors who will try as hard as possible to get a conviction and ignore evidence that may support a non-guilty verdict. To make matters worse, science does not really have an adversarial system where a defense lawyer stands up for the defendant (i.e., the null-hypothesis) and no evidence can be presented to support the defendant.

Once we realize the power of motivated reasoning, it is clear that we need to separate the work of theory development and theory evaluation. We cannot let researchers who developed a theory conduct meta-analyses and write review articles, just like we cannot ask film directors to write their own movie reviews. We should leave meta-analyses and reviews to a group of theoretical psychologists who do not conduct original research. As grant money for original research is extremely limited and a lot of time and energy is wasted on grant proposals, there is ample capacity for psychologist to become meta-psychologist. Their work also needs to be evaluated differently. The aim of meta-psychology is not to make novel discoveries, but to confirm that claims by original researchers about their discoveries are actually robust, replicable, and credible. Given the well-documented bias in the published literature, a lot of work remains to be done.

Incidental Anchoring Bites the Dust

Update: 6/10/21

After I posted this post, I learned about a published meta-analysis and new studies of incidental anchoring by David Shanks and colleagues that came to the same conclusion (Shanks et al., 2020).

Introduction

“The most expensive car in the world costs $5 million. How much does a new BMW 530i cost?”

According to anchoring theory, information about the most expensive car can lead to higher estimates for the cost of a BMW. Anchoring effects have been demonstrated in many credible studies since the 1970s (Kahneman & Tversky, 1973).

A more controversial claim is that anchoring effects even occur when the numbers are unrelated to the question and presented incidentally (Criticher & Gilovich, 2008). In one study, participants saw a picture of a football player and were asked to guess how likely it is that the player will sack the football player in the next game. The player’s number on jersey was manipulated to be 54 or 94. The study produced a statistically significant result suggesting that a higher number makes people give higher likelihood judgments. This study started a small literature on incidental anchoring effects. A variation on this them are studies that presented numbers so briefly on a computer screen that most participants did not actually see the numbers. This is called subliminal priming. Allegedly, subliminal priming also produced anchoring effects (Mussweiler & Englich (2005).

Since 2011, many psychologists are skeptical whether statistically significant results in published articles can be trusted. The reason is that researchers only published results that supported their theoretical claims even when the claims were outlandish. For example, significant results also suggested that extraverts can foresee where pornographic images are displayed on a computer screen even before the computer randomly selected the location (Bem, 2011). No psychologist, except Bem, believes these findings. More problematic is that many other findings are equally incredible. A replication project found that only 25% of results in social psychology could be replicated (Open Science Collaboration, 2005). So, the question is whether incidental and subliminal anchoring are more like classic anchoring or more like extrasensory perception.

There are two ways to assess the credibility of published results when publication bias is present. One approach is to conduct credible replication studies that are published independent of the outcome of a study. The other approach is to conduct a meta-analysis of the published literature that corrects for publication bias. A recent article used both methods to examine whether incidental anchoring is a credible effect (Kvarven et al., 2020). In this article, the two approaches produced inconsistent results. The replication study produced a non-significant result with a tiny effect size, d = .04 (Klein et al., 2014). However, even with bias-correction, the meta-analysis suggested a significant, small to moderate effect size, d = .40.

Results

The data for the meta-analysis were obtained from an unpublished thesis (Henriksson, 2015). I suspected that the meta-analysis might have coded some studies incorrectly. Therefore, I conducted a new meta-analysis, using the same studies and one new study. The main difference between the two meta-analysis is that I coded studies based on the focal hypothesis test that was used to claim evidence for incidental anchoring. The p-values were then transformed into fisher-z transformed correlations and and sampling error, 1/sqrt(N – 3), based on the sample sizes of the studies.

Whereas the old meta-analysis suggested that there is no publication bias, the new meta-analysis showed a clear relationship between sampling error and effect sizes, b = 1.68, se = .56, z = 2.99, p = .003. Correcting for publication bias produced a non-significant intercept, b = .039, se = .058, z = 0.672, p = .502, suggesting that the real effect size is close to zero.

Figure 1 shows the regression line for this model in blue and the results from the replication study in green. We see that the blue and green lines intersect when sampling error is close to zero. As sampling error increases because sample sizes are smaller, the blue and green line diverge more and more. This shows that effect sizes in small samples are inflated by selection for significance.

However, there is some statistically significant variability in the effect sizes, I2 = 36.60%, p = .035. To further examine this heterogeneity, I conducted a z-curve analysis (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020). A z-curve analysis converts p-values into z-statistics. The histogram of these z-statistics shows publication bias, when z-statistics cluster just above the significance criterion, z = 1.96.

Figure 2 shows a big pile of just significant results. As a result, the z-curve model predicts a large number of non-significant results that are absent. While the published articles have a 73% success rate, the observed discovery rate, the model estimates that the expected discovery rate is only 6%. That is, for every 100 tests of incidental anchoring, only 6 studies are expected to produce a significant result. To put this estimate in context, with alpha = .05, 5 studies are expected to be significant based on chance alone. The 95% confidence interval around this estimate includes 5% and is limited at 26% at the upper end. Thus, researchers who reported significant results did so based on studies with very low power and they needed luck or questionable research practices to get significant results.

A low discovery rate implies a high false positive risk. With an expected discovery rate of 6%, the false discovery risk is 76%. This is unacceptable. To reduce the false discovery risk, it is possible to lower the alpha criterion for significance. In this case, lowering alpha to .005 produces a false discovery risk of 5%. This leaves 5 studies that are significant.

One notable study with strong evidence, z = 3.70, examined anchoring effects for actual car sales. The data came from an actual auction of classic cars. The incidental anchors were the prices of the previous bid for a different vintage car. Based on sales data of 1,477 cars, the authors found a significant effect, b = .15, se = .04 that translates into a standardized effect size of d = .2 (fz = .087). Thus, while this study provides some evidence for incidental anchoring effects in one context, the effect size estimate is also consistent with the broader meta-analysis that effect sizes of incidental anchors are fairly small. Moreover, the incidental anchor in this study is still in the focus of attention and in some way related to the actual bid. Thus, weaker effects can be expected for anchors that are not related to the question at all (a player’s number) or anchors presented outside of awareness.

Conclusion

There is clear evidence that evidence for incidental anchoring cannot be trusted at face value. Consistent with research practices in general, studies on incidental and subliminal anchoring suffer from publication bias that undermines the credibility of the published results. Unbiased replication studies and meta-analysis suggest that incidental anchoring effects are either very small or zero. Thus, there exists currently no empirical support for the notion that irrelevant numeric information can bias numeric judgments. More research on anchoring effects that corrects for publication bias is needed.

Why Are Ease of Retrieval Effects so Hard to Replicate?

Abstract

Social psychology suffers from a replication crisis because publication bias undermines the evidential value of published significant results. Meta-analysis that do not correct for publication bias are biased and cannot be used to estimate effect sizes. Here I show that a meta-analysis of the ease-of-retrieval effect (Weingarten & Hutchinson, 2018) did not fully correct for publication bias and that 200 significant results for the ease-of-retrieval effect can be fully explained by publication bias. This conclusion is consistent with the results of the only registered replication study of ease of retrieval (Groncki et al., 2021). As a result, there is no empirical support for the ease-of-retrieval effect. Implications for the credibility of social psychology are discussed.

Introduction

Until 2011, social psychology appeared to have made tremendous progress. Daniel Kahneman (2011) reviewed many of the astonishing findings in his book “Thinking: Fast and Slow.” His book used Schwarz et al.’s (2011) ease-of-retrieval research as an example of rigorous research on social judgments.

The ease-of-retrieval paradigm is simple. Participants are assigned to two groups. In one group, they are asked to recall a small number of examples from memory. The number is chosen to make it easy to do this. In the other conditions, participants are asked to recall a larger number of examples. The number is chosen so that it is hard to come up with the requested number of examples. This task is used to elicit a feeling of ease or difficulty. Hundreds of studies have used this paradigm to study the influence of ease-of-retrieval on a variety of judgments.

In the classic studies that introduced the paradigm, participants were asked to retrieve a few or many examples of assertiveness behaviors before answering a question about their assertiveness. Three studies suggested that participants based their personality judgments on the ease of retrieval.

However, this straightforward finding is not always found. Kahneman points out that participants sometimes do not rely on the ease of retrieval. Paradoxically, they sometimes rely on the number of examples they retrieved even though the number was given by the experimenter. What made ease-of-retrieval a strong theory was that ease of retrieval researchers seemed to be able to predict the conditions that made people use ease as information and the conditions when they would use other information. “The proof that you truly understand a pattern of behavior is that you know how to reverse it” (Kahneman, 2011).

This success story had one problem. It was not true. In 2011, it became apparent that social psychologists used questionable research practices to produce significant results. Thus, rather than making amazing predictions about the outcome of studies, they searched for statistical significance and then claimed that they predicted these effects (John, Loewenstein, & , 2012; Kerr, 1998). Since 2011, it has become clear that only a small percentage of results in social psychology can be replicated without questionable practices (Open Science Collaboration, 2015).

I had my doubts about the ease-of-retrieval literature because I had heard rumors that researchers were unable to replicate these effects, but it was common not to publish these replication failures. My suspicions appeared to be confirmed, when John Krosnick gave a talk about a project that replicated 12 experiments in a large nationally representative sample. All but one experiment was successfully replicate. The exception was the ease-of-retrieval study; a direct replication of Schwarz et al.’s (1991) assertiveness studies. These results were published several years later (Yeager et al., 2019).

I was surprised when Weingarten and Hutchinson (2018) published a detailed and comprehensive meta-analysis of published and unpublished ease-of-retrieval studies and found evidence for a moderate effect size (d ~ .4) even after correcting for publication bias. This conclusion based on many small studies seemed inconsistent with the replication failure in the large national representative sample (Yeager et al., 2019). Moreover, the first pre-registered direct replication of Schwarz et al. (1991) also produced a replication failure (Groncki et al., 2021). One possible explanation for the discrepancy between the meta-analytic results and the replication results could be that the meta-analysis did not fully correct for publication bias. To test this hypothesis, I used the openly shared data to examine the robustness of the effect size estimate. I also conducted a new meta-analysis that included studies published after 2014, using a different coding of studies that codes only one focal hypothesis test per study. The results showed that the effect size estimate in Weingarten and Hutchinson’s (2018) is not robust and depends heavily on outliers. I also find that the coding scheme attenuates the detection of bias which leads to inflated effect size estimates. The new meta-analysis shows an effect size estimate close to zero. It also shows that heterogeneity is fully explained by publication bias.

Reproducing the Original Meta-Analysis

All effect sizes are Fisher-z transformed correlation coefficients. The predictor is the standard error; 1/sqrt(N – 3). Figure 1 reproduces the funnel plot in Weingarten and Hutchinson (2018), with the exception that sampling error is plotted on the x-axis and effect sizes are plotted on the y-axis.

Figure 1 also includes the predictions (regression lines) for three models. The first model is an unweighted average. This model assumes that there is no publication bias. The straight orange line shows that this model assumes an average effect size of z = .23 for all sample sizes. The second model assumes that there is publication bias and that bias increases in a linear fashion with sampling error. The slope of the blue regression line is significance and suggests that publication bias is present. The intercept of this model can be interpreted as the unbiased effect size estimate (Stanley, 2017). The intercept is z = .115 with a 95% confidence interval that ranges from .036 to .193. These results reproduce the results in Weingarten and Hutchinson (2018) closely, but not exactly, r = .104, 95%CI = .034 to .172. Simulation studies suggest that this effect size estimate underestimates the true effect size when the intercept is significantly different from zero (Stanley, 2017). In this case, it is recommended to use the variance (sampling error squared) as a model of publication bias. The red curve shows the predictions of this model. Most important, the intercept is now nearly at the same level as the model without publication bias, z = .221, 95%CI = .174 to .267. Once more, these results closely reproduce the published results, r = .193, 95%CI = .153 to .232.

The problem with unweighted models is that data points from small studies are given equal weights to studies with large samples. In this particular case, small studies are given even more weight than larger studies because small studies with extremely small sample sizes (N < 20) are outliers and outliers are weighted more heavily in regression analysis. Inspection of the scatter plot shows that 7 studies with sample sizes less than 10 (5 per condition) have a strong influence on the regression line. As a result, all three regression lines in Figure 1 overestimate effect sizes for studies with more than 100 participants. Thus, the intercept overestimates the effect sizes for large studies, including Yeager et al.’s (2019) study with N = 1,323 participants. In short, the effect size estimate in the meta-analysis is strongly influenced by 7 data points that represent fewer than 100 participants.

A simple solution to this problem is to weight observations by sample size so that larger samples are given more weight. This is actually the default option for many meta-analysis programs like the metafor package in R (Viechbauer, 2010). Thus, I reran the same analyses with weighting of observations by sample size. Figure 2 shows the results. In Figure 2 the size of observations reflects weights. The most important difference in the results is that the intercept for the model with a linear effect of sampling error is practically zero and not statistically significant, z = .006, 95%CI = -.040 to .052. The confidence interval is small enough to infer that the typical effect size is close enough to zero to accept the null-hypothesis.

Proponents of ease-of-retrieval will, however, not be satisfied with this answer. First, inspection of Figure 2 shows that the intercept is now strongly influenced by a few large samples. Moreover, the model does show heterogeneity in effect sizes, I2 = 33.38%, suggesting that at least some of the significant results were based on real effects.

Coding of Studies

Effect size meta-analysis evolved without serious consideration of publication bias. Although publication bias has been known to be present since meta-analysis was invented (Sterling, 1959), it was often an afterthought rather than part of the meta-analytic model (Rosenthal, 1979). Without having to think about publication bias, it became a common practice to code individual studies without a focus on the critical test that was used to publish a study. This practice obscures the influence of publication bias and may lead to an overestimation of the average effect size. To illustrate this, I am going to focus on the 7 data points in Figure 1 that were coded with sample sizes less than 10.

Six of the observations stem from an unpublished dissertation by Bares (2007) that was supervised by Norbert Schwarz. The dissertation was a study with children. The design had the main manipulation of ease of retrieval (few vs. many) as a between subject factor. Additional factors were gender, age (kindergartners vs. second graders) and 5 content domains (books, shy, friendly, nice, mean). The key dependent variable were frequency estimates. The total sample size was 198, with 98 participants in the easy condition and 100 in the difficult condition. The hypothesis was that ease-of-retrieval would influence judgments independent of gender or content. However, rather than testing the overall main effect across all participants, the dissertation presents analyses separately for different ages and contents. This led to the coding of this study with a reasonable sample size of N = 198 as 20 effects with sample sizes of N = 7 to 9. Only six of these effects were included in the meta-analysis. Thus, the meta-analysis added 6 studies with non-significant results, when there was only one study with non-significant results that was put in the file-drawer. As a result, the meta-analysis does no longer represent the amount of publication bias in the ease-of-retrieval literature. Adding these six effects to the meta-analysis makes the data look less biased and attenuates the regression of effect sizes on sampling error, which in turn leads to a higher intercept. Thus, traditional coding of effect sizes in meta-analyses can lead to inflated effect size estimates even in models that aim to correct for publication bias.

An Updated Meta-Analysis of Ease-of-Retrieval

Building on Weingarten and Hutchinson’s (2018) meta-analysis, I conducted a new meta-analysis that relied on test statistics that were reported to test ease-of-retrieval effects. I only used published articles because the only reason to search for unpublished studies is to correct for publication bias. However, Weingarten and Hutchinson’s meta-analysis showed that publication bias is still present even with a diligent attempt to obtain all data. I extended the time frame of the meta-analysis by searching for new publications since the last year that was included in Weingarten and Hutchinson’s meta-analysis (i.e., 2014). For each study, I looked for the focal hypothesis test of the ease-of-retrieval effect. In some studies, this was a main effect. In other studies, it was a post-hoc test following an interaction effect. The exact p-values were converted into t-values and t-values were converted into fisher-z scores as effect sizes. Sampling error was based on the sample size of the study or the subgroup in which the ease of retrieval effect was predicted. For the sake of comparability, I again show unweighted and weighted results.

The effect size estimate for the random effects model that ignores publication bias is z = .340, 95%CI = .317 to .363. This would be a moderate effect size (d ~ .6). The model also shows a moderate amount of heterogeneity, I2 = 33.48%. Adding sampling error as a predictor dramatically changes the results. The effect size estimate is now practically zero, z = .020. and the 95%CI is small enough to conclude that any effect would be small, 95%CI = -.048 to .088. Moreover, publication bias fully explains heterogeneity, I2 = 0.00%. Based on this finding, it is not recommended to use the variance as a predictor (Stanley, 2017). However, for the sake of comparison, Figure 1 also shows the results for this model. The red curve shows that the model makes similar predictions in the middle, but overestimates effect sizes for large samples and for small samples. Thus, the intercept is not a reasonable estimate of the average effect size, z = .183, 95%CI = .144 to .222. In conclusion, the new coding shows clearer evidence of publication bias and even the unweighted analysis shows no evidence that the average effect size differs from zero.

Figure 4 shows that the weighted models produce very similar results to the unweighted results.

The key finding is that the intercept is not significantly different from zero, z = -.016, 95%CI = -.053 to .022. The upper bound of the 95%CI corresponds to an effect size of r = .022 or d = .04. Thus, the typical ease of retrieval effect is practically zero and there is no evidence of heterogeneity.

Individual Studies

Meta-analysis treats individual studies as interchangeable tests of a single hypothesis. This makes sense when all studies are more or less direct replications of the same experiments. However, meta-analysis in psychology often combine studies that vary in important details such as the population (adults vs. children) and the dependent variables (frequency judgments vs. attitudes). Even if a meta-analysis would show a significant average effect size, it remains unclear which particular conditions show the effect and which ones do not. This is typically examined in moderator analyses, but when publication bias is strong and effect sizes are dramatically inflated, moderator analyses have low power to detect signals in the noise.

In Figure 4, real moderators would produce systematic deviations from the blue regression line. As these residuals are small and strongly influenced by sampling error, finding a moderator is like looking for a needle in a heystack. To do so, it is useful to look for individual studies that produced more credible results than the average study. A new tool that can be used for this purpose is z-curve (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020).

Z-curve does not decompose p-values into separate effect size and sampling error components. Rather it converts p-values into z-scores and models the distribution of z-scores with a finite mixture model. The results provide complementary information about publication bias that does not rely on variation in sample sizes. As correlations between sampling error and effect sizes can be produced by other factors, z-curve provides a more direct test of publication bias.

Z-curve also provides information about the false positive risk in individual studies. If a literature has a low discovery rate (many studies produce non-significant results), the false discovery risk is high (Soric, 1989). Z-curve estimates the size of the file drawer and provides a corrected estimate of the expected discovery rate. To illustrate z-curve, I fitted z-curve to the ease-of-retrieval studies in the new meta-analysis (Figure 5).

Visual inspection shows that most z-statistics are just above the criterion for statistical significance z = 1.96. This corresponds to the finding that most effect sizes are about 2 times the magnitude of sampling error, which produces a just significant result. The z-curve shows a rapid decline of z-statistics as z-values increase. The z-curve model uses the shape of this distribution to estimate the expected discovery rate; that is, the proportion of significant results that are observed if all tests that were conducted were available. The estimate of 8% implies that most ease-of-retrieval tests are extremely underpowered and can only produce significant results with the help of sampling error. Thus, most of the observed effect size estimates in Figure 4 reflect sampling error rather than any population effect sizes.

The expected discovery rate can be compared to the observed discovery rate to assess the amount of publication bias. The observed discovery rate is simply the percentage of significant results for the 128 studies. The observed discovery rate is 83% and would be even higher if marginally significant results, p < .10, z > 1.65, were counted as significant. Thus, the observed discovery rate is 10 times higher than the expected discovery rate. This shows massive publication bias.

The difference between the expected and observed discovery rate is also important for the assessment of the false positive risk. As Soric (1989) showed, the risk of false positives increases as the discovery risk decreases. The observed discovery rate of 83% implies that the false positive risk is very small (1%). Thus, readers of journals are given the illusion that ease-of-retrieval effects are robust and researchers have a very good understanding of the conditions that can produce the effect. Hence, Kahneman’s praise of researchers’ ability to show the effect and to reverse it seemingly at will. The z-curve results show that this is an illusion because researchers only publish results when a study was successful. With an expected discovery rate of 8%, the false discovery risk is 61%. Thus, there is a high chance that studies with large samples will produce effect size estimates close to zero. This is consistent with the effect size estimates close to zero.

One solution to reduce the false-positive risk is to lower the significance criterion (Benjamin et al., 2017). Z-curve can be fitted with different alpha-levels to examine the influence on the false positive risk. By setting alpha to .005, the false positive risk is below 5% (Figure 6).

This leaves 36 studies that may have produced a real effect. A true positive result does not mean that a direct replication study will produce a significant result. To estimate replicability, we can select only the studies with p < .005 (z > 2.8) and fit z-curve to these studies using the standard significance criterion of .05. The false discovery risk inched up a bit, but may be considered acceptable with 8%. However, the expected replication rate with the same sample sizes is only 47%. Thus, replication studies need to increase sample sizes to avoid false negative results.

Five of the studies with strong evidence are by Sanna and colleagues. This is noteworthy because Sanna retracted 8 articles, including an article with ease-of-retrieval effects under suspicion of fraud (Yong, 2012). It is therefore unlikely that these studies provide credible evidence for ease-of-retrieval effects.

An article with three studies reported consistently strong evidence (Ofir et al., 2008). All studies manipulated the ease of recall of products and found that recalling a few low priced items made participants rate a store as less expensive than recalling many low priced items. It seems simple enough to replicate this study to test the hypothesis that ease of retrieval effects influence judgments of stores. Ease of retrieval may have a stronger influence for these judgments because participants may have less chronically accessible and stable information to make these judgments. In contrast, assertiveness judgments may be harder to move because people have highly stable self-concepts that show very little situational variance (Eid & Diener, 2004; Anusic & Schimmack, 2016).

Another article that provided three studies examined willingness to pay for trips to England (Sinha & Naykankuppam, 2013). A major difference to other studies was that this study supplied participants with information about tourist destinations in England and after a delay used recall of this information to manipulate ease of retrieval. Once more, ease-of-retrieval may have had an effect in these studies because participants had little chronically accessible information to make willingness-to-pay judgments.

A third, three study article with strong evidence found that participants rated the quality of their memory for specific events (e.g., New Year’s Eve) worse when they were asked to recall many (vs. few) facts about the event (Echterhoff & Hirst, 2006). These results suggest that ease-of-retrieval is used for judgments about memory, but may not influence other judgments.

The focus on individual studies shows why moderator analyses in effect-size meta-analysis often produce non-significant results. Most of the moderators that can be coded are not relevant, whereas moderators that are relevant can be limited to a single article and are not coded.

The Original Paradigm

It is not clear why Schwarz et al. (1991) decided to manipulate personality ratings of assertiveness. A look into the personality literature suggests that these judgments are often made quickly and with high temporal stability. Thus, they seemed a challenging target to demonstrate the influence of situational effects.

It was also risky to conduct these studies with small sample sizes that require large effect sizes to produce significant results. Nevertheless, the first study with 36 participants produced an encouraging, marginally significant result, F(1,34) = .07. Study 2 followed up on this result with a larger sample to boost power and did produce a real significant result, F(1, 142) = 6.35, p = .01. However, observed power (70%)) was still below the recommended level of 80%. Thus, the logical next step would have been to test the effect again with an even larger sample. However, the authors tested a moderator hypothesis in a smaller sample, which surprisingly produced a significant three-way interaction, F(1, 70) = 9.75, p < .001. Despite this strong interaction, the predicted ease-of-retrieval effects were not statistically significant because sample sizes were very small, assertive: t(18) = 1.55, p = .14, unassertive: t(18) = 1.91, p = .07.

It is unlikely to obtain supportive evidence in three underpowered studies (Schimmack, 2012), suggesting that the reported results were selected from a larger set of tests. This hypothesis can be tested with the Test of Insufficient Variance (TIVA), a bias test for small sets of studies (Renkewitz & Keiner, 2019; Schimmack, 2015). TIVA shows that the variation in p-values is less than expected, but the evidence is not conclusive. Nevertheless, even if the authors were just lucky, future studies are expected to produce non-significant results unless sample sizes are increased considerably. However, most direct replication studies of the original design used equally small sample sizes, but reported successful outcomes.

Yahalom and Schul (2016) reported a successful replication in another small sample (N = 20), with an inflated effect size estimate, t(18) = 2.58, p < .05, d = 1.15. Rather than showing the robustness of the effect, it strengthens the evidence that bias is present, TIVA p = .05. Another study in the same article finds evidence for the effect again, but only when participants are instructed to hear some background noise and not when they are instructed to listen to background noise, t(25) = 2.99, p = .006. The bias test remains significant, TIVA p = .05. Kuehnen did not find the effect, but claimed an interaction with item-order for questions about ease-of-retrieval and assertiveness. A non-significant trend emerged when ease-of-retrieval questions were asked first, which was not reported, t(34) = 1.10, p = .28. The bias test remains significant, TIVA p = .08. More evidence from small samples comes from Caruso (2008). In Study 1a, 30 participants showed an ease-of-retrieval effect, F(1,56) = 6.91, p = .011. The bias test remains significant, TIVA p = .06. Study 1b with more participants (N = 55), the effect was not significant, F(1, 110) = 1.05, p = .310. The bias test remains significant despite the non-significant result, TIVA p = .08. Tomala et al. (2007) added another just significant result with 79 participants, t(72) = 1.97, p = .05. This only strengthens the evidence of bias, TIVA p = .05. Yahalom and Schul (2013) also found a just significant effect with 130 students, t(124) = 2.54, only to strengthen evidence of bias, TIVA p = .04. Study 2 reduced the number of participants to 40, yet reported a significant result, F(1,76) = 8.26, p = .005. Although this p-value nearly reached the .005 level, there is no theoretical explanation why this particular direct replication of the original finding should have produced a real effect. Evidence for bias remains significant, TIVA p = .05. Study 3 reverts back to a marginally significant result that only strengthens evidence of bias, t(114) = 1.92, p = .06, TIVA bias p = .02. Greifeneder and Bless (2007) manipulated cognitive load and found the predicted trend only in the low-load condition, t(76) = 1.36, p = .18. Evidence for bias remained unchanged, TIVA p = .02.

In conclusion, from 1991 to 2016 published studies appeared to replicate the original findings, but this evidence is not credible because there is evidence of publication bias. Not a single one of these studies produced a p-value below .005, which has been suggested as a significance level that keeps the type-I error rate at an acceptable level (Benjamin et al., 2017).

Even meta-analyses of these small studies that correct for bias are inconclusive because sampling error is large and effect size estimates are imprecise. The only way to provide strong and credible evidence is to conduct a transparent and ideally pre-registered replication study with a large sample. One study like this was published by Yeager et al. (2019). With N = 1,325 participants the study failed to show a significant effect, F(1, 1323) = 1.31, p = .25. Groncki et al. (2021) conducted the first pre-registered replication study with N = 659 participants. They also ended up with a non-significant result, F(1, 657) = 1.34, p = .25.

These replication failures are only surprising if the inflated observed discovery rate is used to predict the outcome of future studies. Accordingly, we would have an 80% probability to get significant results and an even higher probability given the larger sample sizes. However, when we take publication bias into account, the expected discovery rate is only 8% and even large sample sizes will not produce significant results if the true effect size is close to zero.

In conclusion, the clear evidence of bias and the replication failures in two large replication studies suggest that the original findings were only obtained with luck or with questionable research practices. However, naive interpretation of these results created a literature with over 200 published studies without a real effect. In this regard, ease of retrieval is akin to the ego-depletion literature that is now widely considered invalid (Inzlicht, Werner, Briskin, & Roberts, 2021).

Discussion

2011 has been a watershed moment in the history of social psychology. It has split social psychology into two camps. One camp denies that questionable research practices undermine the validity of published results and continue to rely on published studies as credible empirical evidence (Schwarz & Strack, 2016). The other camp assumes that most published results are false positives and trusts only new studies that are published following open science practices with badges for sharing of materials, data, and ideally pre-registration.

Meta-analysis can help to find a middle ground by examining carefully whether published results can be trusted, even if some publication bias is present. To do so, meta-analysis have to take publication bias seriously. Given the widespread use of questionable practices in social psychology, we have to assume that bias is present (Schimmack, 2020). Published meta-analyses that did not properly correct for publication bias can at best provide an upper limit for effect sizes, but they cannot establish that an effect exists or that the effect size has practical significance.

Weingarten and Hutchinson (2018) tried to correct for publication bias by using the PET-PEESE approach (Stanley, 2017). This is currently the best bias-correction method, but it is by no means perfect (Hong & Reed, 2021; Stanley, 2017). Here I demonstrated one pitfall in the use of PET-PEESE. Coding of studies that does not match the bias in the original articles can obscure the amount of bias and lead to inflated effect size estimates, especially if the PET model is incorrectly rejected and the PEESE results are accepted at face value. As a result, the published effect size of r = .2 (d = .4) was dramatically inflated and new results suggest that the effect size is close to zero.

I also showed in a z-curve analysis that the false positive risk for published ease-of-retrieval studies is high because the expected discovery rate is low and the file drawer of unpublished studies is large. To reduce the false positive risk, I recommend to adjust the significance level to alpha = .005, which is also consistent with other calls for more stringent criteria to claim discoveries (Benjamin et al., 2017). Based on this criterion, neither the original studies, nor any direct replication s of the original studies were significant. A few variations of the paradigm may have produced real effects, but pre-registered replication studies are needed to examine this question. For now, ease of retrieval is a theory without credible evidence.

For many social psychologists, these results are shocking and hard to believe. However, the results are by no means unique to the ease-of-retrieval literature. It has been estimated that only 25% to 50% of published results in social psychology can be replicated (Schimmack, 2020). Other large literatures such as implicit priming, ego-depletion, and facial feedback have also been questioned by rigorous meta-analyses and large replication studies.

For methodologists the replication crisis in psychology is not a surprise. They have warned for decades that selection for significance renders significant results insignificant (Sterling, 1961) and that sample sizes are too low (Cohen, 1961). To avoid similar mistakes in the future, researchers should conduct continuous power analyses and bias tests. As demonstrated here for the assertiveness paradigm, bias tests ring the alarm bells from the start and continue to show bias. In the future, we do not need to wait 40 years before we realize that researchers are chasing an elusive phenomenon. Sample sizes need to be increased or research needs to stop. Amassing a literature of 200 studies with a median sample size of N = 53 and 8% power has to be a mistake that should not be repeated.

Social psychologists should be the least surprised that they fooled themselves in believing their results. After all, they have demonstrated with credible studies that confirmation bias has a strong influence on human information processing. They should therefore embrace the need for open science, bias-checks, and replication studies as necessary evils that are necessary to minimize confirmation bias and to make scientific progress.

References

Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software36(3), 1–48. https://www.jstatsoft.org/v036/i03.

Aber bitte ohne Sanna

Abstract

Social psychologists have failed to clean up their act and their literature. Here I show unusually high effect sizes in non-retracted articles by Sanna, who retracted several articles. I point out that non-retraction does not equal credibility and I show that co-authors like Norbert Schwarz lack any motivation to correct the published record. The inability of social psychologists to acknowledge and correct their mistakes renders social psychology a para-science that lacks credibility. Even meta-analyses cannot be trusted because they do not correct properly for the use of questionable research practices.

Introduction

When I grew up, a popular German Schlager was the song “Aber bitte mit Sahne.” The song is about Germans love of deserts with whipped cream. So, when I saw articles by Sanna, I had to think about whipped cream, which is delicious. Unfortunately, articles by Sanna are the exact opposite. In the early 2010s, it became apparent that Sanna had fabricated data. However, unlike the thorough investigation of a similar case in the Netherlands, the extent of Sanna’s fraud remains unclear (Retraction Watch, 2012). The latest count of Sanna’s retracted articles was 8 (Retraction Watch, 2013).

WebOfScience shows 5 retraction notices for 67 articles, which means 62 articles have not been retracted. The question is whether these article can be trusted to provide valid scientific information? The answer to this question matters because Sanna’s articles are still being cited at a rate of over 100 citations per year.

Meta-Analysis of Ease of Retrieval

The data are also being used in meta-analyses (Weingarten & Hutchinson, 2018). Fraudulent data are particularly problematic for meta-analysis because fraud can produce large effect size estimates that may inflate effect size estimates. Here I report the results of my own investigation that focusses on the ease-of-retrieval paradigm that was developed by Norbert Schwarz and colleagues (Schwarz et al., 1991).

The meta-analysis included 7 studies from 6 articles. Two studies produced independent effect size estimates for 2 conditions for a total of 9 effect sizes.

Sanna, L. J., Schwarz, N., & Small, E. M. (2002). Accessibility experiences and the hindsight bias: I knew it all along versus it could never have happened. Memory & Cognition, 30(8), 1288–1296. https://doi.org/10.3758/BF03213410 [Study 1a, 1b]

Sanna, L. J., Schwarz, N., & Stocker, S. L. (2002). When debiasing backfires: Accessible content and accessibility experiences in debiasing hindsight. Journal of Experimental Psychology: Learning, Memory, and Cognition, 28(3), 497–502. https://doi.org/10.1037/0278-7393.28.3.497
[Study 1 & 2]

Sanna, L. J., & Schwarz, N. (2003). Debiasing the hindsight bias: The role of accessibility experiences and (mis)attributions. Journal of Experimental Social Psychology, 39(3), 287–295. https://doi.org/10.1016/S0022-1031(02)00528-0 [Study 1]

Sanna, L. J., Chang, E. C., & Carter, S. E. (2004). All Our Troubles Seem So Far Away: Temporal Pattern to Accessible Alternatives and Retrospective Team Appraisals. Personality and Social Psychology Bulletin, 30(10), 1359–1371. https://doi.org/10.1177/0146167204263784
[Study 3a]

Sanna, L. J., Parks, C. D., Chang, E. C., & Carter, S. E. (2005). The Hourglass Is Half Full or Half Empty: Temporal Framing and the Group Planning Fallacy. Group Dynamics: Theory, Research, and Practice, 9(3), 173–188. https://doi.org/10.1037/1089-2699.9.3.173 [Study 3a, 3b]

Carter, S. E., & Sanna, L. J. (2008). It’s not just what you say but when you say it: Self-presentation and temporal construal. Journal of Experimental Social Psychology, 44(5), 1339–1345. https://doi.org/10.1016/j.jesp.2008.03.017 [Study 2]

When I examined Sanna’s results, I found that all 9 of these 9 effect sizes were extremely large with effect size estimates being larger than one standard deviation. A logistic regression analysis that predicted authorship (With Sanna vs. Without Sanna) showed that the large effect sizes in Sanna’s articles were unlikely to be due to sampling error alone, b = 4.6, se = 1.1, t(184) = 4.1, p = .00004 (1 / 24,642).

These results show that Sanna’s effect sizes are not typical for the ease-of-retrieval literature. As one of his retracted articles used the ease-of retrieval paradigm, it is possible that these articles are equally untrustworthy. As many other studies have investigated ease-of-retrieval effects, it seems prudent to exclude articles by Sanna from future meta-analysis.

These articles should also not be cited as evidence for specific claims about ease-of-retrieval effects for the specific conditions that were used in these studies. As the meta-analysis shows, there have been no credible replications of these studies and it remains unknown how much ease of retrieval may play a role under the specified conditions in Sanna’s articles.

Discussion

The blog post is also a warning for young scientists and students of social psychology that they cannot trust researchers who became famous with the help of questionable research practices that produced too many significant results. As the reference list shows, several articles by Sanna were co-authored by Norbert Schwarz, the inventor of the ease-of-retrieval paradigm. It is most likely that he was unaware of Sanna’s fraudulent practices. However, he seemed to lack any concerns that the results might be too good to be true. After all, he encountered replicaiton failures in his own lab.

of course, we had studies that remained unpublished. Early on we experimented with different manipulations. The main lesson was: if you make the task too blatantly difficult, people correctly conclude the task is too difficult and draw no inference about themselves. We also had a couple of studies with unexpected gender differences” (Schwarz, email communication, 5/18,21).

So, why was he not suspicious when Sanna only produced successful results? I was wondering whether Schwarz had some doubts about these studies with the help of hindsight bias. After all, a decade or more later, we know that he committed fraud for some articles on this topic, we know about replication failures in larger samples (Yeager et al., 2019), and we know that the true effect sizes are much smaller than Sanna’s reported effect sizes (Weingarten & Hutchinson, 2018).

Hi Norbert, 
   thank you for your response. I am doing my own meta-analysis of the literature as I have some issues with the published one by Evan. More about that later. For now, I have a question about some articles that I came across, specifically Sanna, Schwarz, and Small (2002). The results in this study are very strong (d ~ 1).  Do you think a replication study powered for 95% power with d = .4 (based on meta-analysis) would produce a significant result? Or do you have concerns about this particular paradigm and do not predict a replication failure?
Best, Uli (email

His response shows that he is unwilling or unable to even consider the possibility that Sanna used fraud to produce the results in this article that he co-authored.

Uli, that paper has 2 experiments, one with a few vs many manipulation and one with a facial manipulation.  I have no reason to assume that the patterns won’t replicate. They are consistent with numerous earlier few vs many studies and other facial manipulation studies (introduced by Stepper & Strack,  JPSP, 1993). The effect sizes always depend on idiosyncracies of topic, population, and context, which influence accessible content and accessibility experience. The theory does not make point predictions and the belief that effect sizes should be identical across decades and populations is silly — we’re dealing with judgments based on accessible content, not with immutable objects.  

This response is symptomatic of social psychologists response to decades of research that has produced questionable results that often fail to replicate (see Schimmack, 2020, for a review). Even when there is clear evidence of questionable practices, journals are reluctant to retract articles that make false claims based on invalid data (Kitayama, 2020). And social psychologist Daryl Bem wants rather be remembered as loony para-psychologists than as real scientists (Bem, 2021).

The problem with these social psychologists is not that they made mistakes in the way they conducted their studies. The problem is their inability to acknowledge and correct their mistakes. While they are clinging to their CVs and H-Indices to protect their self-esteem, they are further eroding trust in psychology as a science and force junior scientists who want to improve things out of academia (Hilgard, 2021). After all, the key feature of science that distinguishes it from ideologies is the ability to correct itself. A science that shows no signs of self-correction is a para-science and not a real science. Thus, social psychology is currently para-science (i.e., “Parascience is a broad category of academic disciplines, that are outside the scope of scientific study, Wikipedia).

The only hope for social psychology is that young researchers are unwilling to play by the old rules and start a credibility revolution. However, the incentives still favor conformists who suck up to the old guard. Thus, it is unclear if social psychology will ever become a real science. A first sign of improvement would be to retract articles that make false claims based on results that were produced with questionable research practices. Instead, social psychologists continue to write review articles that ignore the replication crisis (Schwarz & Strack, 2016) as if repression can bend reality.

Nobody should believe them.

Predicting Replication Outcomes: Prediction Markets vs. R-Index

Conclusion

Gordon et al. (2021) conducted a meta-analysis of 103 studies that were included in prediction markets to forecast the outcome of replication studies. The results show that prediction markets can forecast replication outcomes above chance levels, but the value of this information is limited. Without actual replication studies, it remains unclear which published results can be trusted or not. Here I compare the performance of prediction markets to the R-Index and the closely related p < .005 rule. These statistical forecasts perform nearly as well as markets and are much easier to use to make sense of thousands of published articles. However, even these methods have a high failure rate. The best solution to this problem is to rely on meta-analyses of studies rather than to predict the outcome of a single study. In addition to meta-analyses, it will be necessary to conduct new studies that are conducted with high scientific integrity to provide solid empirical foundations for psychology. Claims that are not supported by bias-corrected meta-analyses or new preregistered studies are merely suggestive and currently lack empirical support.

Introduction

Since 2011, it became apparent that many published results in psychology, especially social psychology fail to replicate in direct replication studies (Open Science Collaboration, 2015). In social psychology the success rate of replication studies is so low (25%) that it makes sense to bet on replication failures. This would produce 75% successful outcomes, but it would also imply that an entire literature has to be discarded.

It is practically impossible to redo all of the published studies to assess their replicability. Thus, several projects have attempted to predict replication outcomes of individual studies. One strategy is to conduct prediction markets in which participants can earn real money by betting on replication outcomes. There have been four prediction markets with a total of 103 studies with known replication outcomes (Gordon et al., 2021). The key findings are summarized in Table 1.

Markets have a good overall success rate, (28+47)/103 = 73% that is above chance (flipping a coin). Prediction markets are better at predicting failures, 28/31 = 90%, than predicting successes, 47/72 = 65%. The modest success rate for success is a problem because it would be more valuable to be able to identify studies that will replicate and do not require a new study to verify the results.

Another strategy to predict replication outcomes relies on the fact that the p-values of original studies and the p-values of replication studies are influenced by the statistical power of a study (Brunner & Schimmack, 2020). Studies with higher power are more likely to produce lower p-values and more likely to produce significant p-values in replication studies. As a result, p-values also contain valuable information about replication outcomes. Gordon et al. (2021) used p < .005 as a rule to predict replication outcomes. Table 2 shows the performance of this simple rule.

The overall success rate of this rule is nearly as good as the prediction markets, (39+35)/103 = 72%; a difference by k = 1 studies. The rule does not predict failures as well as the markets, 39/54 = 72% (vs. 90%), but it predicts successes slightly better than the markets, 35/49 = 71% (vs. 65%).

A logistic regression analysis showed that both predictors independently contribute to the prediction of replication outcomes, market b = 2.50, se = .68, p = .0002; p < .005 rule: b = 1.44, se = .48, p = .003.

In short, p-values provide valuable information about the outcome of replication studies.

The R-Index

Although a correlation between p-values and replication outcomes follows logically from the influence of power on p-values in original and replication studies, the cut-off value of .005 appears to be arbitrary. Gordon et al. (2017) justify its choice with an article by Benjamin et al. (2017) that recommended a lower significance level (alpha) to ensure a lower false positive risk. Moreover, they advocated for this rule for new studies that preregister hypotheses and do not suffer from selection bias. In contrast, the replication crisis was caused by selection for significance which produced success rates of 90% or more in psychology journals (Motyl et al., 2017; Sterling, 1959; Sterling et al., 1995). One main reason for replication failures is that selection for significance inflates effect sizes and due to regression to the mean, effect sizes in replication studies are bound to be weaker, resulting in non-significant results, especially if the original p-value was close to the threshold value of alpha = .05. The Open Science Collaboration (2015) replicability project showed that effect sizes are on average inflated by over 100%.

The R-Index provides a theoretical rational for the choice of a cut-off value for p-values. The theoretical cutoff value happens to be p = .0084. The fact that it is close to Benjamin et al.’s (2017) value of .005 is merely a coincidence.

P-values can be transformed into estimates of the statistical power of a study. These estimates rely on the observed effect size of a study and are sometimes called observed power or post-hoc power because power is computed after the results of a study are known. Figure 1 illustrates observed power with an example of a z-test that produced a z-statistic of 2.8 which corresponds to a two-sided p-value of .005.

A p-value of .005 corresponds to z-value of 2.8 for the standard normal distribution centered over zero (the nil-hypothesis). The standard level of statistical significance, alpha = .05 (two-sided) corresponds to z-value of 1.96. Figure 1 shows the sampling distribution of studies with a non-central z-score of 2.8. The green line cuts this distribution into a smaller area of 20% below the significance level and a larger area of 80% above the significance level. Thus, the observed power is 80%.

Selection for significance implies truncating the normal distribution at the level of significance. This means the 20% of non-significant results are discarded. As a result, the median of the truncated distribution is higher than the median of the full normal distribution. The new median can be found using the truncnorm package in R.

qtruncnorm(.5,a = qnorm(1-.05/2),mean=2.8) = 3.05

This value corresponds to an observed power of

qnorm(3.05,qnorm(1-.05/2) = .86

Thus, selection for significance inflates observed power of 80% to 86%. The amount of inflation is larger when power is lower. With 20% power, the inflated power after selection for significance is 67%.

Figure 3 shows the relationship between inflated power on the x-axis and adjusted power on the y-axis. The blue curve uses the truncnorm package. The green line shows the simplified R-Index that simply substracts the amount of inflation from the inflated power. For example, if inflated power is 86%, the inflation is 1-.86 = 14% and subtracting the inflation gives an R-Index of 86-14 = 82%. This is close to the actual value of 80% that produced the inflated value of 86%.

Figure 4 shows that the R-Index is conservative (underestimates power) when power is over 50%, but is liberal (overestimates power) when power is below 50%. The two methods are identical when power is 50% and inflated power is 75%. This is a fortunate co-incidence because studies with more than 50% power are expected to replicate and studies with less than 50% power are expected to fail in a replication attempt. Thus, the simple R-Index makes the same dichotomous predictions about replication outcomes as the more sophisticated approach to find the median of the truncated normal distribution.

The inflated power for actual power of 50% is 75% and 75% power corresponds to a z-score of 2.63, which in turn corresponds to a p-value of p = .0084.

Performance of the R-Index is slightly worse than the p < .005 rule because the R-Index predicts 5 more successes, but 4 of these predictions are failures. Given the small sample size, it is not clear whether this difference is reliable.

In sum, the R-Index is based on a transformation of p-values into estimates of statistical power, while taking into account that observed power is inflated when studies are selected for significance. It provides a theoretical rational for the atheoretical p < .005 rule, because this rule roughly cuts p-values into p-values with more or less than 50% power.

Predicting Success Rates

The overall success rate across the 103 replication studies was 50/103 = 49%. This percentage cannot be generalized to a specific population of studies because the 103 are not a representative sample of studies. Only the Open Science Collaboration project used somewhat representative sampling. However, the 49% success rate can be compared to the success rates of different prediction methods. For example, prediction markets predict a success rate of 72/103 = 70%, a significant difference (Gordon et al., 2021). In contrast, the R-Index predicts a success rate of 54/103 = 52%, which is closer to the actual success rate. The p < .005 rule does even better with a predicted success rate of 49/103 = 48%.

Another method that has been developed to estimate the expected replication rate is z-curve (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020). Z-curve transforms p-values into z-scores and then fits a finite mixture model to the distribution of significant p-values. Figure 5 illustrates z-curve with the p-values from the 103 replicated studies.

The z-curve estimate of the expected replication rate is 60%. This is better than the prediction market, but worse than the R-Index or the p < .005 rule. However, the 95%CI around the ERR includes the true value of 49%. Thus, sampling error alone might explain this discrepancy. However, Bartos and Schimmack (2021) discussed several other reasons why the ERR may overestimate the success rate of actual replication studies. One reason is that actual replication studies are not perfect replicas of the original studies. So called, hidden moderators may create differences between original and replication studies. In this case, selection for significance produces even more inflation that the model assumes. In the worst case scenario, a better estimate of actual replication outcomes might be the expected discovery rate (EDR), which is the power of all studies that were conducted, including non-significant studies. The EDR for the 103 studies is 28%, but the 95%CI is wide and includes the actual rate of 49%. Thus, the dataset is too small to decide between the ERR or the EDR as best estimates of actual replication outcomes. At present it is best to consider the EDR the worst possible and the ERR the best possible scenario and to expect the actual replication rate to fall within this interval.

Social Psychology

The 103 studies cover studies from experimental economics, cognitive psychology, and social psychology. Social psychology has the largest set of studies (k = 54) and the lowest success rate, 33%. The prediction markets overpredict successes, 50%. The R-Index also overpredicted successes, 46%. The p < .005 rule had the least amount of bias, 41%.

Z-curve predicted an ERR of 55% s and the actual success rate fell outside the 95% confidence interval, 34% to 74%. The EDR of 22% underestimates the success rate, but the 95%CI is wide and includes the true value, 95%CI = 5% to 70%. Once more the actual success rate is between the EDR and the ERR estimates, 22% < 34% < 55%.

In short, prediction models appear to overpredict replication outcomes in social psychology. One reason for this might be that hidden moderators make it difficult to replicate studies in social psychology which adds additional uncertainty to the outcome of replication studies.

Regarding predictions of individual studies, prediction markets achieved an overall success rate of 76%. Prediction markets were good at predicting failures, 25/27 = 93%, but not so good in predicting successes, 16/27 = 59%.

The R-Index performed as well as the prediction markets with one more prediction of a replication failure.

The p < .005 rule was the best predictor because it predicted more replication failures.

Performance could be increased by combining prediction markets and the R-Index and only bet on successes when both predictors predicted a success. In particular, the prediction of success improved to 14/19 = 74%. However, due to the small sample size it is not clear whether this is a reliable finding.

Non-Social Studies

The remaining k = 56 studies had a higher success rate, 65%. The prediction markets overpredicted success, 92%. The R-Index underpredicted successes, 59%. The p < .005 rule underpredicted successes even more.

This time z-curve made the best prediction with an ERR of 67%, 95%CI = 45% to 86%. The EDR underestimates the replication rate, although the 95%CI is very wide and includes the actual success rate, 5% to 81%. The fact that z-curve overestimated replicability for social psychology, but not for other areas, suggests that hidden moderators may contribute to the replication problems in social psychology.

For predictions of individual outcomes, prediction markets had a success rate of (3 + 31)/49 = 76%. The good performance is due to the high success rate. Simply betting on success would have produced 32/49 = 65% successes. Predictions of failures had a s success rate of 3/4 = 75% and predictions of successes had a success rate of 31/45 = 69%.

The R-Index had a lower success rate of (9 +21)/49 = 61%. The R-Index was particularly poor at predicting failures, 9/20 = 45%, but was slightly better at predicting successes than the prediction markets, 21/29 = 72%.

The p < .500 rule had a success rate equal to the R-Index, (10 + 20)/49 = 61%, with one more correctly predicted failure and one less correctly predicted success.

Discussion

The present results reproduce the key findings of Gordon et al. (2021). First, prediction markets overestimate the success of actual replication studies. Second, prediction markets have some predictive validity in forecasting the outcome of individual replication studies. Third, a simple rule based on p-values also can forecast replication outcomes.

The present results also extend Gordon et al.’s (2021) findings based on additional analyses. First, I compared the performance of prediction markets to z-curve as a method for the prediction of the success rates of replication outcomes (Bartos & Schimmack, 2021; Brunner & Schimmack, 2021). Z-curve overpredicted success rates for all studies and for social psychology, but was very accurate for the remaining studies (economics, cognition). In all three comparisons, z-curve performed better than prediction markets. Z-curve also has several additional advantages over prediction markets. First, it is much easier to code a large set of test statistics than to run prediction markets. As a result, z-curve has already been used to estimate the replication rates for social psychology based on thousands of test statistics, whereas estimates of prediction markets are based on just over 50 studies. Second, z-curve is based on sound statistical principles that link the outcomes of original studies to the outcomes of replication studies (Brunner & Schimmack, 2020). In contrast, prediction markets rest on unknown knowledge of market participants that can vary across markets. Third, z-curve estimates are provided with validated information about the uncertainty in the estimates, whereas prediction markets provide no information about uncertainty and uncertainty is large because markets tend to be small. In conclusion, z-curve is more efficient and provides better estimates of replication rates than prediction markets.

The main goal of prediction markets is to assess the credibility of individual studies. Ideally, prediction markets would help consumers of published research to distinguish between studies that produced real findings (true positives) and studies that produced false findings (false positives) without the need to run additional studies. The encouraging finding is that prediction markets have some predictive validity and can distinguish between studies that replicate and studies that do not replicate. However, to be practically useful it is necessary to assess the practical usefulness of the information that is provided by prediction markets. Here we need to distinguish the practical consequences of replication failures and successes. Within the statistical framework of nil-hypothesis significance testing, successes and failures have different consequences.

A replication failure increases uncertainty about the original finding. Thus, more research is needed to understand why the results diverged. This is also true for market predictions. Predictions that a study would fail to replicate cast doubt about the original study, but do not provide conclusive evidence that the original study reported a false positive result. Thus, further studies are needed, even if a market predicts a failure. In contrast, successes are more informative. Replicating a previous finding successfully strengthens the original findings and provides fairly strong evidence that a finding was not a false positive result. Unfortunately, the mere prediction that a finding will replicate does not provide the same reassurance because markets only have an accuracy of about 70% when they predict a successful replication. The p < .500 rule is much easier to implement, but its ability to forecast successes is also around 70%. Thus, neither markets nor a simple statistical rule are accurate enough to avoid actual replication studies.

Meta-Analysis

The main problem of prediction markets and other forecasting projects is that single studies are rarely enough to provide evidence that is strong enough to evaluate theoretical claims. It is therefore not particularly important whether one study can be replicated successfully or not, especially when direct replications are difficult or impossible. For this reason, psychologists have relied for a long time on meta-analyses of similar studies to evaluate theoretical claims.

It is surprising that prediction markets have forecasted the outcome of studies that have been replicated many times before the outcome of a new replication study was predicted. Take the replication of Schwarz, Strack, and Mai (1991) in Many Labs 2 as an example. This study manipulated the item-order of questions about marital satisfaction and life-satisfaction and suggested that a question about marital satisfaction can prime information that is used in life-satisfaction judgments. Schimmack and Oishi (2005) conducted a meta-analysis of the literature and showed that the results by Schwarz et al. (1991) were unusual and that the actual effect size is much smaller. Apparently, the market participants were unaware of this meta-analysis and predicted that the original result would replicate successfully (probability of success = 72%). Contrary to the market, the study failed to replicate. This example suggests that meta-analyses might be more valuable than prediction markets or the p-value of a single study.

The main obstacle for the use of meta-analyses is that many published meta-analyses fail to take selection for significance into account and overestimate replicability. However, new statistical methods that correct for selection bias may address this problem. The R-Index is a rather simple tool that allows to correct for selection bias in small sets of studies. I use the article by Nairne et al. (2008) that was used for the OSC project as an example. The replication project focused on Study 2 that produced a p-value of .026. Based on this weak evidence alone, the R-Index would predict a replication failure (observed power = .61, inflation = .39, R-Index = .61 – .39 = .22). However, Study 1 produced much more convincing evidence for the effect, p = .0007. If this study had been picked for the replication attempt, the R-Index would have predicted a successful outcome (observed power = .92, inflation = .08, R-Index = .84). A meta-analysis would average across the two power estimates and also predict a successful replication outcome (mean observed power = .77, inflation = .23, R-Index = .53). The actual replication study was significant with p = .007 (observed power = .77, inflation = .23, R-Index = .53). A meta-analysis across all three studies also suggests that the next study will be a successful replication (R-Index = .53), but the R-Index also shows that replication failures are likely because the studies have relatively low power. In short, prediction markets may be useful when only a single study is available, but meta-analysis are likely to be superior predictors of replication outcomes when prior replication studies are available.

Conclusion

Gordon et al. (2021) conducted a meta-analysis of 103 studies that were included in prediction markets to forecast the outcome of replication studies. The results show that prediction markets can forecast replication outcomes above chance levels, but the value of this information is limited. Without actual replication studies, it remains unclear which published results can be trusted or not. Statistical methods that simply focus on the strength of evidence in original studies perform nearly as well and are much easier to use to make sense of thousands of published articles. However, even these methods have a high failure rate. The best solution to this problem is to rely on meta-analyses of studies rather than to predict the outcome of a single study. In addition to meta-analyses, it will be necessary to conduct new studies that are conducted with high scientific integrity to provide solid empirical foundations for psychology.

Prediction Markets of Replicability

Abstract

I reinvestigate the performance of prediction markets for the Open Science Collaboration replicability project. I show that performance of prediction markets varied considerably across the two markets, with the second market failing to replicate the excellent performance of the first market. I also show that the markets did not perform significantly better than a “burn everything to the ground” rule that bets on failure every time. Finally, I suggest a simple rule that can be easily applied to published studies that only treats results with p-values below .005 as significant. Finally, I discuss betting on future studies as a way to optimize resource allocation for future studies.

Introduction

For decades, psychologists failed to properly test their hypotheses. Statistically significant results in journals are meaningless because published results are selected for significance. A replication project with 100 studies from three journals that reported significant results found that only 37% (36/97) of published significant results could be replicated (Open Science Collaboration, 2015).

Unfortunately, it is impossible to rely on actual replication studies to examine the credibility of thousands of findings that have been reported over the years. Dreber, Pfeiffer, Almenberg, Isakssona, Wilson, Chen, Nosek, and Johannesson (2015) proposed prediction markets as a solution to this problem. Prediction markets rely on a small number of traders to bet on the outcome of replication studies. They can earn a small amount of real money for betting on studies that actually replicate.

To examine the forecasting abilities of prediction markets, Dreber et al. (2015) conducted two studies. The first study with 23 studies started in November 2012 and lasted two month (N = 47 participants). The second study with 21 studies started in October 2014 (N = 45 participants). The studies are very similar to each other. Thus, we can consider Study 2 a close replication of Study 1.

Studies were selected from the set of 100 studies based on time of completion. To pay participants, studies were chosen that were scheduled to be completed within two month after the completion of the prediction market. It is unclear how completion time may influence the type of study that was included or the outcome of the included studies.

The published article reports the aggregated results across the two studies. A market price above 50% was considered to be a prediction of a successful replication and a market price below 50% was considered to be a prediction of a replication failure. The key finding was that “the prediction markets correctly predict the outcome of 71% of the replications (29 of 41 studies” (p. 15344). The authors compare this finding to a proverbial coin flip which implies a replication rate of 50% and find that 71% is [statistically] significantly higher than than 50%.

Below I am conducting some additional statistical analyses of the open data. First, we can compare the performance of the prediction market with a different prediction rule. Given the higher prevalence of replication failures than successes, a simple rule is to use the higher base rate of failures to predict that all studies will fail to replicate. As the failure rate for the total set of 97 studies was 37%, this prediction rule has a success rate of 1-.37 = 63%. For the 43 studies with significant results, the success rate of replication studies was also 37% (15/41). Somewhat surprisingly, the success rates were also close to 37% for Prediction Market 1, 32% (7/22) and Prediction Market 2, 42% (8 / 19).

In comparison to a simple prediction rule that everything in psychology journals does not replicate, prediction markets are no longer statistically significantly better, chi2(1) = 1.82, p = .177.

Closer inspection of the original data also revealed a notable difference in the performance of the two prediction markets. Table 1 shows the results separately for prediction markets 1 and 2. Whereas the performance of the first prediction market is nearly perfect, 91% (20/22), the replication market performed only at chance levels (flip a coin), 53% (10/19). Despite the small sample size, the success rates in the two studies are statistically significantly different, chi2(1) = 5.78, p = .016.

There is no explanation for these discrepancies and the average result reported in the article can still be considered the best estimate of prediction markets’ performance, but trust in their ability is diminished by the fact that a close replication of excellent performance failed to replicate. Not reporting the different outcomes for two separate studies could be considered a questionable decision.

The main appeal of prediction markets over the nihilistic trash-everything rule is that decades of research would have produced some successes. However, the disadvantage of prediction markets is that they take a long time, cost money, and the success rates are currently uncertain. A better solution might be to find rules that can be applied easily to large sets of studies (Yang, Wu, & Uzzi, 2020). One simple rule is suggested by the simple relationship between strength of evidence and replicability. The stronger the evidence against the null-hypothesis is (i.e., lower p-values), the more likely it is that the original study had high power and that the results will replicate in a replication study. There is no clear criterion for a cut-off point to optimize prediction, but the results of the replication project can be used to validate cut-off points empirically.

One suggests has been to consider only p-values below .005 as statistically significant (Benjamin et al., 2017). This rule is especially useful when selection bias is present. Selection bias can produce many results with p-values between .05 and .005 that have low power. However, p-values below .005 are more difficult to produce with statistical tricks.

The overall success rate for the 41 studies included in the Prediction Markets was 63% (26/41), a difference of 4 studies. The rule also did better for the first market, 81% (18.22) than for the second market, 42% (8/19).

Table 2 shows that the main difference between the two markets was that the first market contained more studies with questionable p-values between .05 and .005 (15 vs. 6). For the second market, the rule overpredicts successes and there are more false (8) than correct (5) predictions. This finding is consistent with examinations of the total set of replication studies in the replicability project (Schimmack, 2015). Based on this observation, I recommended a 4-sigma rule, p < .00006. The overall success rate increases to 68% (28/41) and improvement by 2 studies. However, an inspection of correct predictions of successes shows that the rule only correctly predicts 5 of the 15 successes (33%), whereas the p < .005 rule correctly predicted 10 of the 15 successes (67%). Thus, the p < .005 rule has the advantage that it salvages more studies.

Conclusion

Meta-scientists are still scientists and scientists are motivated to present their results in the best possible light. This is also true for Derber et al.’s (2015) article. The authors enthusiastically recommend prediction markets as a tool “to quickly identify findings
that are unlikely to replicate” Based on their meta-analysis of two prediction markets with a total of just 41 studies, the authors conclude that “prediction markets are well suited” to assess the replicability of published results in psychology. I am not the only one to point out that this conclusion is exaggerated (Yang, Wu, & Uzzi, 2020). First, prediction markets are not quick at identifying replicable results, especially when we compare the method to a simple computation of the exact p-values to decide whether the p-value is below .005 or not. It is telling that nobody has used prediction markets to forecast the outcome of new replication studies. One problem is that a market requires a set of studies, which makes it difficult to use them to predict outcomes of single studies. It is also unclear how well prediction markets really work. The original article omitted the fact that it worked extremely well in the first market and not at all in the second market, a statistically significant difference. The outcome seems to depend a lot on the selection of studies in the market. Finally, I showed that a simple statistical rule alone can predict replication outcomes nearly as well as prediction markets.

There is no reason to use markets for multiple studies. One could also set up betting for individual studies, just like individuals can bet on the outcome of a single match in sports or a single election outcome. Betting might be more usefully employed for the prediction of original studies than to vet the outcome of replication studies. For example, if everybody bets that a study will produce a significant result, there appears to be little uncertainty about the outcome, and the actual study may be a waste of resources. One concerns in psychology is that many studies merely produce significant p-values for obvious predictions. Betting on effect sizes would help to make effect sizes more important. If everybody bets on a very small effect size a study might not be useful to run because the expected effect size is trivial, even if the effect is greater than zero. Betting on effect sizes could also be used for informal power analyses to determine the sample size of the actual study.

References

Dreber, A., Pfeiffer, T., Almenberg, J., Isaksson, S., Wilson, B., Chen, Y., Nosek, B. A., & Johannesson, M. (2015). Using prediction markets to estimate the reproducibility of scientific research. PNAS Proceedings of the National Academy of Sciences of the United States of America, 112(50), 15343–15347. https://doi.org/10.1073/pnas.1516179112 1 commen

Bill von Hippel and Ulrich Schimmack discuss Bill’s Replicability Index

Background:

We have never met in person or interacted professionally before we embarked on this joint project. This blog emerged from a correspondence that was sparked by one of Uli’s posts in the Replicability-Index blog. We have carried the conversational nature of that correspondence over to the blog itself.

Bill: Uli, before we dive into the topic at hand, can you provide a super brief explanation of how the Replicability Index works? Just a few sentences for those who might be new to your blog and don’t understand how one can assess the replicability of the findings from a single paper.

Uli: The replicability of a finding is determined by the true power of a study, and true power depends on the sample size and the population effect size. We have to estimate replicability because the population effect size is unknown, but studies with higher power are more likely to produce smaller p-values. We can convert the p-values into a measure of observed power. For a single statistical test this estimate is extremely noisy, but it is the best guess we can make. So, a result with a p-value of .05 (50% observed power) is less likely to replicate than a p-value of .005 (80% power). A value of 50% may still look good, but observed power is inflated if we condition on significance. To correct for this inflation, the R-Index takes the difference between the success rate and observed power. For a single test, the success rate is 1 (100%) because a significant result was observed. This means that observed power of 50%, produces an R-Index of 50% – (100% – 50%) = 0. In contrast, 80% power, still produces an R-Index of 80% – (100% – 80%) = 60%. The main problem is that observed p-values are very variable. It is therefore better to use several p-values and to compute the R-Index based on average power. For larger sets of studies, a more sophisticated method called z-curve can produce actual estimates of power. It also can be used to estimate the false positive risk. Sorry, if this is not super brief.

Bill: That wasn’t super brief, but it was super informative. It does raise a worry in me, however. Essentially, your formula stipulates that every p value of .05 is inherently unreliable. Do we have any empirical evidence that the true replicability of p = .05 is functionally zero?

Uli:  The inference is probabilistic. Sometimes p = .05 will occur with high power (80%). However, empirically we know that p = .05 more often occurs with low power. The Open Science Collaboration project showed that results with p > .01 rarely replicated, whereas results with p < .005 replicated more frequently. Thus, in the absence of other information, it is rational to bet on replication failures when the p-value is just or marginally significant.

Bill: Good to know. I’ll return to this “absence of other information” issue later in our blog, but in the meantime, back to our story…I was familiar with your work prior to this year, as Steve Lindsay kept the editorial staff at Psychological Science up to date with your evaluations of the journal. But your work was made much more personally relevant on January 19, when you wrote a blog on “Personalized p-values for social/personality psychologists.”

Initially, I was curious if I would be on the list, hoping you had taken the time to evaluate my work so that I could see how I was doing. Your list was ordered from the highest replicability to lowest, so when I hadn’t seen my name by the half-way point, my curiosity changed to trepidation. Alas, there I was – sitting very near the bottom of your list with one of the lowest replicability indices of all the social psychologist’s you evaluated.

I was aghast. I had always thought we maintained good data practices in our lab: We set a desired N at the outset and never analyzed the data until we were done (having reached our N or occasionally run out of participants); we followed up any unexpected findings to be sure they replicated before reporting them, etc. But then I thought about the way we used to run the lab before the replication crisis:

  1. We never reported experiments that failed to support our hypotheses, but rather tossed them in the file drawer and tried again.
  2. When an effect had a p value between .10 and the .05 cut-off, we tried various covariates/control variables to see if they would push our effect over that magical line. Of course we reported the covariates, but we never reported their ad-hoc nature – we simply noted that we included them.
  3. We typically ran studies that were underpowered by today’s standards, which meant that the effects we found were bouncy and could easily be false positives.
  4. When we found an effect on one set of measures and not another, sometimes we didn’t report the measures that didn’t work.

The upshot of these reflections was that I emailed you to get the details on my individual papers to see where the problems were (which led to my next realization; I doubt I would have bothered contacting you if my numbers had been better). So here’s my first question: How many people have contacted you about this particular blog and is there any evidence that people are taking it seriously?

Uli:  I have been working on replicability for 10 years now. The general response to my work is to ignore it with the justification that it is not peer-reviewed. I also recall only two requests. One to evaluate a department and one to evaluate an individual. However, the R-Index analysis is easy to do and Mickey Inzlicht published a blog post about his self-analysis. I don’t know how many researchers have evaluated their work in private. It is harder to evaluate how many people take my work seriously. The main indicator of impact is the number of views of my blogs which has increased from 50,000 in 2015 to 250,000 in 2020. The publication of the z-curve package for R also has generated interest among researchers to conduct their own analyses.

Bill: That’s super impressive. I don’t think my entire body of research has been viewed anywhere near 250K times.

OK, once again, back to our story. When you sent me the data file on my papers, initially I was unhappy that you only used a subset of my empirical articles (24 of the 70 or so empirical papers I’ve published) and that your machine coding had introduced a bit of error into the process. But we decided to turn this into a study of sorts, so we focused on those 24 papers and the differences that would emerge as a function of machine vs. hand-coding and as a function of how many stats we pulled out of each experiment (all the focal statistics vs. just one stat for each experiment). Was that process useful for you? If so, what did you learn from it?

Uli: First of all, I was pleased that you engaged with the results. More important, I was also curious how the results would compare to hand-coding. I had done some comparisons for other social psychologists and I had some confidence in the results to post them, but I am aware that the method is not flawless and can produce misleading results in individual cases. I am also aware that my own hand-coding can be biased. So, for you to offer to do your own coding was a fantastic opportunity to examine the validity of my results.

Bill: Great. I’m still a little unsure what I’ve learned from this particular proctology exam, so let’s see what we can figure out here. If you’ll humor me, let’s start with our paper that has the lowest replicability index in your subsample – no matter which we way we calculate it, we find less than a 50% chance that it will replicate. It was published in 2005 in PSPB and took an evolutionary approach on grandparenting. Setting aside the hypothesis, the relevant methods were as follows:

  1. We recruited all of the participants who were available that year in introductory psychology, so our N was large, but determined by external constraints.
  2. We replicated prior findings that served as the basis of our proposed effect.
  3. The test of our new hypothesis yielded a marginally significant interaction (F(1, 412) = 2.85, p < .10). In decomposing the interaction, we found a simple effect where it was predicted (F(1, 276) = 5.92, p < .02) and no simple effect where it wasn’t predicted (F < 1, ns. – *apologies for the imprecise reporting practices).

Given that: 1) we didn’t exclude any data (a poor practice we sometimes engaged in by reporting some measures and not others, but not in this paper), 2) we didn’t include any ad-hoc control variables (a poor practice we sometimes engaged in, but not in this paper), 3) we didn’t run any failed studies that were tossed out (a poor practice we regularly engaged in, but not in this paper), and 4) we reported the a priori test of our hypothesis exactly as planned…what are we to conclude from the low replicability index? Is the only lesson here that marginally significant interactions are highly unlikely to replicate? What advice would you have given me in 2004, if I had shown you these data and said I wanted to write them up?

Uli: There is a lot of confusion about research methods, the need for preregistration, and the proper interpretation of results. First, there is nothing wrong with the way you conducted the study. The problems arise when the results are interpreted as a successful replication of prior studies. Here is why. First, we do not know whether prior studies used questionable research practices and reported inflated effect sizes. Second, the new findings are reported without information about effect sizes. What we really would like to know is the confidence interval around the predicted interaction effect, which would be the difference in the effect sizes between the two conditions. With a p-value greater than .05, we know that the 95%CI includes a value of 0. So, we cannot reject the hypotheses that the two conditions differ at that level of confidence. We can increase uncertainty in the conclusion by using a 90% or 80% confidence interval, but we still would want to know what effect sizes we can reject. It would also be important to specify what effect sizes would be considered too small to warrant a theory that predicts this interaction effect. Finally, the results suggest that the sample size of about 400 participants was still too small to have good power to detect and replicate the effect. A conclusive study would require a larger sample.

Bill: Hmm, very interesting. But let me clarify one thing before we go on. In this study, the replication of prior effects that I mentioned wasn’t the marginal interaction that yielded the really low replicability index. Rather, it was a separate main effect, whereby participants felt closest to their mother’s mother, next to their mother’s father, next to their father’s mother, and last to their father’s father. The pairwise comparisons were as follows: “participants felt closer to mothers’ mothers than mothers’ fathers, F(1,464) = 35.88, p < .001, closer to mothers’ fathers than fathers’ mothers, F(1, 424) = 3.96, p < .05, and closer to fathers’ mothers than fathers’ fathers, F(1, 417) = 4.88, p < .03.”

We were trying to build on that prior effect by explaining the difference in feelings toward father’s mothers and mothers’ fathers, and that’s where we found the marginal interaction (which emerged as a function of a third factor that we had hypothesized would moderate the main effect).

I know it’ll slow things down a bit, but I’m inclined to take your advice and rerun the study with a larger sample, as you’ve got me wondering whether this marginal interaction and simple effect are just random junk or meaningful. We could run the study with one or two thousand people on Prolific pretty cheaply, as it only involves a few questions.

Shall we give it a try before we go on? In the spirit of both of us trying to learn something from this conversation, you could let me know what sample size would satisfy you as giving us adequate power to attempt a replication of the simple effect that I’ve highlighted in red above. I suspect that the sample size required to have adequate power for a replication of the marginal interaction would be too expensive, but I believe a study that is sufficiently powered to detect that simple effect will reveal an interaction of at least the magnitude we found in that paper (as I still believe in the hypothesis we were testing).

If that suits you, I’m happy to post this opener on your blog and then return in a few weeks with the results of the replication effort and the goal of completing our conversation.

Uli:  Testing our different intuitions about this particular finding with empirical data is definitely interesting, but I am a bit puzzled about the direction this discussion has taken. It is surely interesting to see whether this particular finding is real and can be replicated. Let’s assume for the moment that it does. This unfortunately, increases the chances that some of the other studies in the z-curve are even less likely to be replicated because there is clear evidence of selection bias and a low probability of replication. Think about it as an urn with 9 red and 1 green marble. Red ones do not replicate and green ones do replicate. After we pick the green marble on the first try, there are only red marbles left.

One of the biggest open questions is what researchers actually did to get too many significant results. We have a few accounts of studies with non-significant results that were dropped and anonymous surveys show that a variety of questionable research practices were used. Even though these practices are different from fraud and may have occurred without intent, researchers have been very reluctant to talk about the mistakes they made in the past. Carney walked away from power posing by admitting to the use of statistical shortcuts. I wonder whether you can tell us a bit more about the practices that led to the low EDR estimate for your focal tests. I know it is a big ask, but I also know that young social psychologists would welcome open disclosure of past practices. As Mickey Inzlicht always tells me “Hate the sin. Love the sinner.” As my own z-curve shows, I also have a mediocre z-curve and I am currently in the process of examining my past articles to see which ones I still believe and which ones I no longer believe.

Bill: Fair question Uli – I’ve made more mistakes than I care to remember! But (at least until your blog post) I’ve comforted myself in the belief that peer-review corrected most of them and that the work I’ve published is pretty solid. So forgive me for banging on for so long, but I have a two-part answer to your question. Part 1 refers back to your statement above, about your work being “in the absence of other information”, and also incorporates your statement above about red and green marbles in an urn. And Part 2 builds on Part 1 by digging through my studies with low Replicability indices and letting you know whether (and if so where) I think they were problematic.

Part 1: “In the absence of other information” is a really important caveat. I understand that it’s the basis of your statistical approach, but of course research isn’t conducted in the absence of other information. In my own case, some of my hypotheses were just hunches about the world, based on observations or possible links between other ideas. I have relatively little faith in these hypotheses and have abandoned them frequently in the face of contrary or inconsistent evidence. But some of my hypotheses are grounded in a substantial literature or prior theorizing that strike me as rock solid. The Darwinian Grandparenting paper is just such an example, and thus it seems like a perfect starting point. The logic is so straightforward and sensible that I’d be very surprised if it’s not true. As a consequence, despite the weak statistical support for it, I’m putting my money on it to replicate (and it’s just self-report, so super easy to conduct a replication online).

And this line of reasoning leads me to dispute your “red and green marbles in the urn” metaphor. Your procedure doesn’t really tell us how many marbles are in the urn of these two colors. Rather, your procedure makes a guess about the contents of the urn, and that guess intentionally ignores all other information. Thus, I’d argue that a successful or failed replication of the grandparenting paper tells us nothing at all about the probability of replicating other papers I’ve published, as I’m bringing additional information to bear on the problem by including the theoretical strength of the claims being made in the paper. In other words, I believe your procedure has grossly underestimated the replicability of this paper by focusing only on the relevant statistics and ignoring the underlying theory. That doesn’t mean your procedure has no value, but it does mean that it’s going to make predictable mistakes.

Part 2: HereI’m going to focus on papers that I first authored, as I don’t think it’s appropriate for me to raise concerns about work that other people led without involving them in this conversation. With that plan in mind, let’s start at the bottom of the replication list you made for me in your collection of 24 papers and work our way up.

  1. Darwinian Grandparenting – discussed above and currently in motion to replicate (Laham, S. M., Gonsalkorale, K., & von Hippel, W. (2005). Darwinian grandparenting: Preferential investment in more certain kin. Personality and Social Psychology Bulletin, 31, 63-72.)
  2. The Chicken-Foot paper – I love this paper but would never conduct it that way now. The sample was way too small and the paper only allowed for a single behavioral DV, which was how strongly participants reacted when they were offered a chicken foot to eat. As a consequence, it was very under-powered. Although we ran that study twice, first as a pilot study in an undergraduate class with casual measurement and then in the lab with hidden cameras, and both studies “worked”, the first one was too informal and the second one was too small and would never be published today. Do I believe it would replicate? The effect itself is consistent with so many other findings that I continue to believe in it, but I would never place my money on replicating this particular empirical demonstration without a huge sample to beat down the inevitable noise (which must have worked in our favor the first time).

(von Hippel, W., & Gonsalkorale, K. (2005). “That is bloody revolting!” Inhibitory control of thoughts better left unsaid. Psychological Science, 16, 497-500.)

  • Stereotyping Against your Will – this paper was wildly underpowered, but I think it’s low R-index reflects the fact that in our final data set you asked me to choose just a single statistic for each experiment. In this study there were a few key findings with different measures and they all lined up as predicted, which gave me a lot more faith in it. Since its publication 20 years ago, we (and others) have found evidence consistent with it in a variety of different types of studies. I think we’ve failed to find the predicted effect in one or maybe two attempts (which ended up in the circular file, as all my failed studies did prior to the replication crisis), but all other efforts have been successful and are published. When we included all the key statistics from this paper in our Replicability analysis, it has an R-index of .79, which may be a better reflection of the reliability of the results.

Important caveat: With all that said, the original data collection included three or four different measures of stereotyping, only one of which showed the predicted age effect. I never reported the other measures, as the goal of the paper was to see if inhibition would mediate age differences in stereotyping and prejudice. In retrospect that’s clearly problematic, but at the time it seemed perfectly sensible, as I couldn’t mediate an effect that didn’t exist. On the positive side, the experiment included only two measures of prejudice, and both are reported in the paper.

(von Hippel, W., Silver, L. A., & Lynch, M. E. (2000). Stereotyping against your will: The role of inhibitory ability in stereotyping and prejudice among the elderly. Personality and Social Psychology Bulletin, 26, 523-532.)

  • Inhibitory effect of schematic processing on perceptual encoding – given my argument above that your R-index makes more sense when we include all the focal stats from each experiment, I’ve now shifted over to the analysis you conducted on all of my papers, including all of the key stats that we pulled out by hand (ignoring only results with control variables, etc.). That analysis yields much stronger R-indices for most of my papers, but there are still quite a few that are problematic. Sadly, this paper is the second from the bottom on my larger list. I say sadly because it’s my dissertation. But…when I reflect back on it, I remember numerous experiments that failed. I probably ran two failed studies for each successful one. At the time, no one was interested in them, and it didn’t occur to me that I was engaging in poor practices when I threw them in the bin. The main conclusion I came to when I finished the project was that I didn’t want to work on it anymore as it seemed like I spent all my time struggling with methodological details trying to get the experiments to work. Maybe each successful study was the one that found just the right methods and materials (as I thought at the time), but in hindsight I suspect not. And clearly the evidentiary value for the effect is functionally zero if we collapse across all the studies I ran. With that said, the key finding followed from prior theory in a pretty straightforward manner and we later found evidence for the proposed mechanism (which we published in a follow-up paper*). I guess I’d conclude from all this that if other people have found the effect since then, I’d believe in it, but I can’t put any stock in my original empirical demonstration.

(von Hippel, W., Jonides, J., Hilton, J. L., & Narayan, S. (1993). Inhibitory effect of schematic processing on perceptual encoding. Journal of Personality and Social Psychology, 64, 921-935.

*von Hippel, W., & Hawkins, C. (1994). Stimulus exposure time and perceptual memory. Perception and Psychophysics, 56, 525-535.)

  • The Linguistic Intergroup Bias (LIB) as an Indicator of Prejudice. This is the only other paper on which I was first author that gets an R-index of less than .5 when you include all the focal stats in the analysis. I have no doubt it’s because, like all of my work at the time, it was wildly under-powered and the effects weren’t very strong. Nonetheless, we’ve used the LIB many times since, and although we haven’t found the predicted results every time, I believe it works pretty reliably. Of course, I could easily be wrong here, so I’d be very interested if any readers of this blog have conducted studies using the LIB as an indicator of prejudice, and if so, whether it yielded the predicted results.

(von Hippel, W., Sekaquaptewa, D., & Vargas, P. (1997). The Linguistic Intergroup Bias as an implicit indicator of prejudice. Journal of Experimental Social Psychology, 33, 490-509.)

  • All my articles published in the last ten years with a low Rindex – Rather than continuing to torture readers with the details of each study, in this last section I’ve gone back and looked at all my papers with an R-index less than .70 based on all the key statistics (not just a single stat for each experiment) that were published in the last 10 years. This exercise yields 9 out of 25 empirical papers with an R-index ranging from .30 to .62 (with five other papers in which an R-index apparently couldn’t be calculated). The evidentiary value of these 9 papers is clearly in doubt, despite the fact that they were published at a time when we should have known better. So what’s going on here? Six of them were conducted on special samples that are expensive to run or incredibly difficult to recruit (e.g., people who have suffered a stroke, people who inject drugs, studies in an fMRI scanner), and as a result they are all underpowered. Perhaps we shouldn’t be doing that work, as we don’t have enough funding in our lab to run the kind of sample sizes necessary to have confidence in the small effects that often emerge. Or perhaps we should publish the papers anyway, and let readers decide if the effects are sufficiently meaningful to be worthy of further investigation. I’d be curious to hear your thoughts on this Uli. Of the remaining three papers, one of them reports all four experiments we ran prior to publication, but since then has proven difficult to replicate and I have my doubts about it (let’s call that a clear predictive win for the R-Index). Another is well powered and largely descriptive without much hypothesis testing and I’m not sure an R-index makes sense for it. And the last one is underpowered (despite being run on undergraduates), so we clearly should have done better.

What do I conclude from this exercise? A consistent theme in our findings that have a low R-index is that they have small sample sizes and report small effects. Some of those probably reflect real findings, but others probably don’t. I suspect the single greatest threat to their validity (beyond the small samples sizes) was the fact that until very recently we never reported experiments that failed. In addition, sometimes we didn’t report measures we had gathered if they didn’t work out as planned and sometimes we added control variables into our equations in an ad-hoc manner. Failed experiments, measures that don’t work, and impactful ad-hoc controls are all common in science and reflect the fact that we learn what we’re doing as we go. But the capacity for other people to evaluate the work and its evidentiary value is heavily constrained when we don’t report those decisions. In retrospect, I deeply regret placing a greater emphasis on telling a clear story than on telling a transparent and complete story.

Has this been a wildly self-serving tour through the bowels of a social psychology lab whose R-index is in the toilet? Research on self-deception suggests you should probably decide for yourself.

Uli:  Thank you for your candid response. I think for researchers our age (not sure really how old you are) it will be easier to walk away from some articles published in the anything-goes days of psychological science because we still have time to publish some new and better work. As time goes on, it may become easier for everybody to acknowledge mistakes and become less defensive. I hope that your courage and our collaboration encourage more people to realize that the value of a researcher is not measured in terms of number of publications or citations. Research is like art and not every van Gogh is a masterpiece. We are lucky if we make at least one notable contribution to our science. So, losing a few papers to replication failures is normal. Let’s see what the results of the replication study will show.

Bill: I couldn’t agree with you more! (Except for the ‘courage’ part; I’m trepidatious as hell that my work won’t be taken seriously [or funded] anymore. But so be it.) I’ll be in touch as soon as we get ethics approval and run our replication study…

Self-Replications in JPSP: A Meta-Analysis

Prefix

This blog post was inspired by my experience to receive a rejection of a replication manuscript. We replicated Diener et al.’s (1995) JPSP article on the personality structure of affect. For the most part, it was a successful replication and a generalization to non-student samples. The manuscript was desk rejected because the replication study was not close enough in terms of the items and methods that we used. I was shocked that JPSP would reject replication studies, which made me wonder what the criteria for acceptance are.

Abstract

In 2015, JPSP started to publish online only replication articles. I examined what these articles have revealed about the replicability of articles published in JPSP before 215. There were only 21 articles published between 2015 and 2020. Only 7 of these articles reported replications of JPSP articles, one included replications of 3 articles. Out of these 9 replications, six were successful and 3 were failures. This finding shows once more that psychologists do everything in their power to appear trustworthy without doing the things that are required to gain or regain trust. While fabulous review articles tout the major reforms that have been made (Nelson et al., 2019), the reality is often much less glamourous. It remains unclear which articles in JPSP can be trusted and selection for significance may undermine the value of self-replications in JPSP.

Introduction

The past decade has revealed a replication crisis in social psychology. First, Bargh’s famous elderly walking study did not replicate. Then, only 14 out of 55 (25%) significant results could be replicated in an investigation of the replicability of social psychology (Open Science Collaboration, 2015). While some social psychologists tried to dismiss this finding, additional evidence further confirmed that social psychology has a replication crisis (Motyl et al., 2017). A statistical method that corrects for publication bias and other questionable research practices estimates a replication rate of 43% (Schimmack, 2020). This estimate was replicated with a larger dataset of the most cited articles by eminent social psychologists (49%; Schimmack, 2021). However, the statistical estimates assume that it is possible to replicate studies exactly, but most replication studies are often conceptual replications that vary in some attributes. Most often the population between original and replication studies differ. Due to regression to the mean, effect sizes in replication studies are likely to be weaker. Thus, the statistical estimates are likely to overestimate the success rate of actual replication studies (Bartos & Schimmack, 2021). Thus, 49% is an upper limit and we can currently conclude that the actual replication rate is somewhere between 25% and 50%. This is also consistent with analyses of statistical power in social psychology (Cohen, 1961; Sedlmeier & Gigerenzer, 1989).

There are two explanations for the emergence of replication failures in the past decade. One explanation is that social psychologists simply did not realize the importance of replication studies and forgot to replicate their findings. They only learned about the need to replicate findings in 2011 and when they started conducting replication studies, they realized that many of their findings are not replicable. Consistent with this explanation, Nelson, Simmons, and, Simonsohn (2019) report that out of over 1,000 curated replication attempts, 96% have been conducted since 2011. The problem with this explanation is that it is not true. Psychologists have conducted replication studies since the beginning of their science. Since the late 1990, many articles in social psychology reported at least two and sometimes many more conceptual replication studies. Bargh reported two close replications of his elderly priming study in an article with four studies (Bargh et al., 1996).

The real reason for the replication crisis is that social psychologists selected studies for significance (Motyl et al., 2017; Schimmack, 2021; Sterling, 1959; Sterling et al., 1995). As a result, only replication studies with significant results were published. What changed in 2011 is that researchers suddenly were able to circumvent censorship at traditional journals and were able to published replication failures in new journals that were less selective, which in this case, was a good thing (Doyen ,Klein, Pichon, & Cleeremans, 2012; Ritchie, Wiseman, & French, 2012). The problem with this explanation is that it is true, but it makes psychological scientists look bad. Even undergraduate students with little formal training in philosophy of science realize that selective publishing of successful studies is inconsistent with the goal to search for the truth (Ritchie, 2021). However, euphemistic descriptions of the research culture before 2011 avoid mentioning questionable research practices (Weir, 2015) or describe these practices as honest (Nelson et al. 2019). Even suggestions that these practices were at best honest mistakes are often met with hostility (Fiske, 2016). Rather than cleaning up the mess that has been created by selection for significance, social psychologists avoid discussion of their practices to hide replication failures. As a result, not much progress has been made in vetting the credibility of thousands of published articles that provide no empirical support of their claims because most of these results might not replicate.

In short, social psychology suffers from a credibility crisis. The question is what social psychologists can do to restore credibility and to regain trust in their published results. For new studies this can be achieved by avoiding the pitfalls of the past. For example, studies can be pre-registered and journals may accept articles before the results are known. But what should researchers, teachers, students, and the general public do with the thousands of published results?

One solution to this problem is to conduct replication studies of published findings and to publish the results of these studies whether they are positive or negative. In their fantastic (i.e., imaginative or fanciful; remote from reality) review article, Nelson et al. (2019) proclaim that “top journals [are] routinely publishing replication attempts, both failures and successes” (p. 512). That would be wonderful, if it were true, but top journals are considered top journals because they are highly selective in what they are publishing and they have limited journal space. So, every replication study competes with an article that has an intriguing, exiting, and groundbreaking new discovery. Editors would need superhuman strength to resist the temptation to publish the sexy new finding and instead to publish a replication of an article from 1977 or 1995. Surely, there are specialized journals for this laudable effort that makes an important contribution to science, but unfortunately do not meet the high threshold of a top journal that has to maintain its status as a top journal.

The Journal of Personality and Social Psychology found an ingenious solution to this problem. To avoid competition with groundbreaking new research, replication studies can be published in the journal, but only online. Thus, these extra articles do not count towards the limited page numbers that are needed to ensure high profit margins for predatory (i.e., for-profit) publisher. Here, I examined what articles JPSP has published as e-online only publications.

Data

The first e-only replication study was published in 2015. Over the past five years, JPSP has published 21 articles as e-replications (not counting 2021).

In the years from 1965 to 2014, JPSP has published 9,428 articles. Thus, the 21 replication articles provide new, credible evidence for 21/9428 = 0.22% of articles that were published before 2015, when selection bias undermined the credibility of the evidence in these articles. Despite the small sample size, it is interesting to examine the nature and the outcome of the studies reported in these 21 articles.

1. SUCCESS

Eschleman, K. J., Bowling, N. A., & Judge, T. A. (2015). The dispositional basis of attitudes: A replication and extension of Hepler and Albarracín (2013). Journal of Personality and Social Psychology, 108(5), e1–e15. https://doi.org/10.1037/pspp0000017

Hepler, J., & Albarracín, D. (2013). Attitudes without objects: Evidence for a dispositional attitude, its measurement, and its consequences. Journal of Personality and Social Psychology, 104(6), 1060–1076. https://doi.org/10.1037/a0032282

The original authors introduced a new measure called the Dispositional Attitude Measure (DAM). The replication study was designed to examine whether the DAM shows discriminant validity compared to an existing measure, the Neutral Objects Satisfaction Questionnaire (NOSQ). The replication studies replicated the previous findings, but also suggested that DAM and NOSQ are overlapping measures of the same construct. If we focus narrowly on replicability, this replication study is a success.

2. FAILURE

Van Dessel, P., De Houwer, J., Roets, A., & Gast, A. (2016). Failures to change stimulus evaluations by means of subliminal approach and avoidance training. Journal of Personality and Social Psychology, 110(1), e1–e15. https://doi.org/10.1037/pspa0000039

This article failed to show evidence that subliminal stimuli to change evaluations that was reported by Kawakami et al. in 2007. Thus, this article counts as a failure.

Kawakami, K., Phills, C. E., Steele, J. R., & Dovidio, J. F. (2007). (Close) distance makes the heart grow fonder: Improving implicit racial attitudes and interracial interactions through approach behaviors. Journal of Personality and Social Psychology, 92, 957–971. http://dx.doi.org/10.1037/0022-3514.92.6.957

Citation counts suggest that the replication failure has reduced citations, although 4 articles already cited it in 2021.

Most worrisome, an Annual Review of Psychology chapter (editor Susan Fiske) perpetuates the idea that subliminal stimuli could reduce prejudice. “Interventions seeking to automate more positive responses to outgroup members may train people to have an “approach” response to Black faces (e.g., by pulling a joystick toward themselves when Black faces appear on a screen; see Kawakami et al. 2007)” (Paluck, Porat, Clark, & Green, 2021, p. 543). The chapter does not cite the replication failure.

3. SUCCESS

Rieger, S., Göllner, R., Trautwein, U., & Roberts, B. W. (2016). Low self-esteem prospectively predicts depression in the transition to young adulthood: A replication of Orth, Robins, and Roberts (2008). Journal of Personality and Social Psychology, 110(1), e16–e22. https://doi.org/10.1037/pspp0000037

The original article used a cross-lagged panel model to claim that low self-esteem causes depression (rather than depression causing low self-esteem).

Orth, U., Robins, R. W., & Roberts, B. W. (2008). Low self-esteem prospectively predicts depression in adolescence and young adulthood. Journal of Personality and Social Psychology, 95(3), 695–708. https://doi.org/10.1037/0022-3514.95.3.695

The replication study showed the same results. In this narrow sense it is a success.

The same year, JPSP also published an “original” article that showed the same results.

Orth, U., Robins, R. W., Meier, L. L., & Conger, R. D. (2016). Refining the vulnerability model of low self-esteem and depression: Disentangling the effects of genuine self-esteem and narcissism. Journal of Personality and Social Psychology, 110(1), 133–149. https://doi.org/10.1037/pspp0000038

Last year, the authors published a meta-analysis of 10 studies that all consistently show the main result.

Orth, U., Clark, D. A., Donnellan, M. B., & Robins, R. W. (2021). Testing prospective effects in longitudinal research: Comparing seven competing cross-lagged models. Journal of Personality and Social Psychology, 120(4), 1013-1034. http://dx.doi.org/10.1037/pspp0000358

The high replicability of the key finding in these articles is not surprising because it is a statistical artifact (Schimmack, 2020). The authors also knew about this because I told them as a reviewer when their first manuscript was under review at JPSP, but neither the authors not the editor seemed to care about it. In short, statistical artifacts are highly replicable.

4. SUCCESS

Davis, D. E., Rice, K., Van Tongeren, D. R., Hook, J. N., DeBlaere, C., Worthington, E. L., Jr., & Choe, E. (2016). The moral foundations hypothesis does not replicate well in Black samples. Journal of Personality and Social Psychology, 110(4), e23–e30. https://doi.org/10.1037/pspp0000056

The main focus of this “replication” article was to test the generalizability of the key finding in Graham, Haidt, and Nosek’s (2009) original article to African Americans. They also examined whether the results replicate in White samples.

Graham, J., Haidt, J., & Nosek, B. A. (2009). Liberals and conservatives rely on different sets of moral foundations. Journal of Personality and Social Psychology, 96(5), 1029–1046. https://doi.org/10.1037/a0015141

Study 1 found weak evidence that the relationship between political conservatism and authority differs across racial groups, beta = .25, beta = .47, chi(1) = 3.92, p = .048. Study 2 replicated this finding, but the p-value was still above .005, beta = .43, beta = .00, chi(1) = 7.04. While stronger evidence for the moderator effect of race is needed, the study counts as a successful replication of the relationship among White or predominantly White samples.

5. EXCLUDED

Crawford, J. T., Brandt, M. J., Inbar, Y., & Mallinas, S. R. (2016). Right-wing authoritarianism predicts prejudice equally toward “gay men and lesbians” and “homosexuals”. Journal of Personality and Social Psychology, 111(2), e31–e45. https://doi.org/10.1037/pspp0000070

This article reports replication studies, but the original studies were not published in JPSP. Thus, the results provide no information about the replicability of JPSP.

Rios, K. (2013). Right-wing authoritarianism predicts prejudice against “homosexuals” but not “gay men and lesbians.” Journal of Experimental Social Psychology, 49, 1177–1183. http://dx.doi.org/10.1016/j.jesp.2013.05.013

6. EXCLUDED

Panero, M. E., Weisberg, D. S., Black, J., Goldstein, T. R., Barnes, J. L., Brownell, H., & Winner, E. (2016). Does reading a single passage of literary fiction really improve theory of mind? An attempt at replication. Journal of Personality and Social Psychology, 111(5), e46–e54. https://doi.org/10.1037/pspa0000064

I excluded this article because it did not replicate a JPSP article. The original article was published in Science. Thus, the outcome of this replication study tells us nothing about the replicability published in JPSP.

7. EXCLUDED

Twenge, J. M., Carter, N. T., & Campbell, W. K. (2017). Age, time period, and birth cohort differences in self-esteem: Reexamining a cohort-sequential longitudinal study. Journal of Personality and Social Psychology, 112(5), e9–e17. https://doi.org/10.1037/pspp0000122

This article challenges the conclusions of the original article and presents new analyses using the same data. Thus, it is not a replication study.

8. EXCLUDED

Gebauer, J. E., Sedikides, C., Schönbrodt, F. D., Bleidorn, W., Rentfrow, P. J., Potter, J., & Gosling, S. D. (2017). The religiosity as social value hypothesis: A multi-method replication and extension across 65 countries and three levels of spatial aggregation. Journal of Personality and Social Psychology, 113(3), e18–e39. https://doi.org/10.1037/pspp0000104

This article is a successful self-replication of an article by the first two authors. The original article was published in Psychological Science. Thus, it does not provide evidence about the replicability of JPSP articles.

Gebauer, J. E., Sedikides, C., & Neberich, W. (2012). Religiosity, social self-esteem, and psychological adjustment: On the cross-cultural specificity of the psychological benefits of religiosity. Psychological Science, 23, 158–160. http://dx.doi.org/10.1177/0956797611427045

9. EXCLUDED

Siddaway, A. P., Taylor, P. J., & Wood, A. M. (2018). Reconceptualizing Anxiety as a Continuum That Ranges From High Calmness to High Anxiety: The Joint Importance of Reducing Distress and Increasing Well-Being. Journal of Personality and Social Psychology, 114(2), e1–e11. https://doi.org/10.1037/pspp0000128

This article replicates an original study published in Psychological Assessment. Thus, it does not tell us anything about the replicability of research in JPSP.

Vautier, S., & Pohl, S. (2009). Do balanced scales assess bipolar constructs? The case of the STAI scales. Psychological Assessment, 21, 187–193. http://dx.doi.org/10.1037/a0015312

10. EXCLUDED

Hounkpatin, H. O., Boyce, C. J., Dunn, G., & Wood, A. M. (2018). Modeling bivariate change in individual differences: Prospective associations between personality and life satisfaction. Journal of Personality and Social Psychology, 115(6), e12-e29. http://dx.doi.org/10.1037/pspp0000161

This article is a method article. The word replication does not appear once in it.

11. EXCLUDED

Burns, S. M., Barnes, L. N., McCulloh, I. A., Dagher, M. M., Falk, E. B., Storey, J. D., & Lieberman, M. D. (2019). Making social neuroscience less WEIRD: Using fNIRS to measure neural signatures of persuasive influence in a Middle East participant sample. Journal of Personality and Social Psychology, 116(3), e1–e11. https://doi.org/10.1037/pspa0000144

“In this study, we demonstrate one approach to addressing the imbalance by using portable neuroscience equipment in a study of persuasion conducted in Jordan with an Arabic-speaking sample. Participants were shown persuasive videos on various health and safety topics while their brain activity was measured using functional near infrared spectroscopy (fNIRS). Self-reported persuasiveness ratings for each video were then recorded. Consistent with previous research conducted with American subjects, this work found that activity in the dorsomedial
and ventromedial prefrontal cortex predicted how persuasive participants found the videos and how much they intended to engage in the messages’ endorsed behaviors.”

This article reports a conceptual replication study. It uses a different population (US vs. Jordan) and a different methodology. As a key finding did replicate, it might be considered a successful replication, but a failure could have been attributed to the difference in population and methodology. It is also not clear that a failure would have been reported. The study should have been conducted as a registered report.

12. FAILURE

Wilmot, M. P., Haslam, N., Tian, J., & Ones, D. S. (2019). Direct and conceptual replications of the taxometric analysis of type a behavior. Journal of Personality and Social Psychology, 116(3), e12–e26. https://doi.org/10.1037/pspp0000195

This article fails to replicate the claim that Type A and Type B are distinct types rather than extremes on a continuum. Thus, this article counts as a failure.

Strube, M. J. (1989). Evidence for the Type in Type A behavior: A taxometric analysis. Journal of Personality and Social Psychology, 56(6), 972–987. https://doi.org/10.1037/0022-3514.56.6.972

It is difficult to evaluate the impact of this replication failure because the replication study was just published and the original article received hardly any citations in recent years. Overall, it has 68 citations since 89.

13. EXCLUDED

Kim, J., Schlegel, R. J., Seto, E., & Hicks, J. A. (2019). Thinking about a new decade in life increases personal self-reflection: A replication and reinterpretation of Alter and Hershfield’s (2014) findings. Journal of Personality and Social Psychology, 117(2), e27–e34. https://doi.org/10.1037/pspp0000199

This article replicated an original articles published in PNAS. It therefore cannot be used to examine the replicability of articles published in JPSP.

Alter, A. L., & Hershfield, H. E. (2014). People search for meaning when they approach a new decade in chronological age. Proceedings of the National Academy of Sciences of the United States of America, 111, 17066–17070. http://dx.doi.org/10.1073/pnas.1415086111

14. EXCLUDED

Mõttus, R., Sinick, J., Terracciano, A., Hřebíčková, M., Kandler, C., Ando, J., Mortensen, E. L., Colodro-Conde, L., & Jang, K. L. (2019). Personality characteristics below facets: A replication and meta-analysis of cross-rater agreement, rank-order stability, heritability, and utility of personality nuances. Journal of Personality and Social Psychology, 117(4), e35–e50. https://doi.org/10.1037/pspp0000202

This article replicates a previous study of personality structure using the same items and methods using a different sample. The results are a close replication. Thus, it is a success, but the study is excluded because the original study was published in 2017. Thus, the study does not shed light on the replicability of articles published in JPSP before 2015.

Mõttus, R., Kandler, C., Bleidorn, W., Riemann, R., & McCrae, R. R. (2017). Personality traits below facets: The consensual validity, longitudinal stability, heritability, and utility of personality nuances. Journal of Personality and Social Psychology, 112, 474–490. http://dx.doi.org/10.1037/pspp0000100

15. EXCLUDED

van Scheppingen, M. A., Chopik, W. J., Bleidorn, W., & Denissen, J. J. A. (2019). Longitudinal actor, partner, and similarity effects of personality on well-being. Journal of Personality and Social Psychology, 117(4), e51–e70. https://doi.org/10.1037/pspp0000211

This study is a replication and extension of a previous study that examined the influence of personality on well-being in couples. A key finding was that personality similarity explained very little variance in well-being. While evidence for the lack of an effect is important, the replication crisis is about the reporting of too many significant results. A concern could be that the prior article reported a false negative results, but the studies were based on large samples with high power to detect even small effects.

Dyrenforth, P. S., Kashy, D. A., Donnellan, M. B., & Lucas, R. E. (2010). Predicting relationship and life satisfaction from personality in nationally representative samples from three countries: The relative importance of actor, partner, and similarity effects. Journal of Personality and Social Psychology, 99, 690–702. http://dx.doi.org/10.1037/a0020385

16. EXCLUDED

Buttrick, N., Choi, H., Wilson, T. D., Oishi, S., Boker, S. M., Gilbert, D. T., Alper, S., Aveyard, M., Cheong, W., Čolić, M. V., Dalgar, I., Doğulu, C., Karabati, S., Kim, E., Knežević, G., Komiya, A., Laclé, C. O., Ambrosio Lage, C., Lazarević, L. B., . . . Wilks, D. C. (2019). Cross-cultural consistency and relativity in the enjoyment of thinking versus doing. Journal of Personality and Social Psychology, 117(5), e71–e83. https://doi.org/10.1037/pspp0000198

This article mainly aims to examine the cross-cultural generality of a previous study by Wilson et al. (2014). Moreover, the study was published in Science. Thus, it does not help to examine the replicability of research published in JPSP before 2015.

Wilson, T. D., Reinhard, D. A., Westgate, E. C., Gilbert, D. T., Ellerbeck, N., Hahn, C., . . . Shaked, A. (2014). Just think: The challenges of the disengaged mind. Science, 345, 75–77. http://dx.doi.org/10.1126/science.1250830

17. MIXED

Yeager, D. S., Krosnick, J. A., Visser, P. S., Holbrook, A. L., & Tahk, A. M. (2019). Moderation of classic social psychological effects by demographics in the U.S. adult population: New opportunities for theoretical advancement. Journal of Personality and Social Psychology, 117(6), e84-e99. http://dx.doi.org/10.1037/pspa0000171

17a. EXCLUDED

This article reports replications of seven original studies. It also examined whether results are moderated by age / student status. Conformity to a simply presented descriptive norm (Asch, 1952; Cialdini, 2003; Sherif, 1936).
[None of these references are from JPSP]

17a. SUCCESS

The effect of a content-laden persuasive message on attitudes as moderated by argument quality and need for cognition (e.g., Cacioppo, Petty, & Morris, 1983).

Cacioppo, J. T., Petty, R. E., & Morris, K. J. (1983). Effects of need for cognition on message evaluation, recall, and persuasion. Journal of Personality and Social Psychology, 45, 805–818. http://dx.doi.org/10.1037/0022-3514.45.4.805

17c. EXCLUDED

Base-rate underutilization (using the “lawyer/engineer” problem; Kahneman & Tversky, 1973). [not in JPSP]

17d. EXCLUDED

The conjunction fallacy (using the “Linda” problem; Tversky & Kahneman, 1983). [not in JPSP]

17e. EXCLUDED

Underappreciation of the law of large numbers (using the “hospital” problem; Tversky & Kahneman, 1974). [Not in JPSP]

17f. SUCCESS

The false consensus effect (e.g., Ross, Greene, & House, 1977). Ross, L., Greene, D., & House, P. (1977).

The “false consensus effect”: An egocentric bias in social perception and attribution processes. Journal of Experimental Social Psychology, 13, 279 –301. http://dx.doi.org/10
.1016/0022-1031(77)90049-X

17g. FAILURE

The effect of “ease of retrieval” on self-perceptions (e.g., Schwarz et al., 1991).

Schwarz, N., Bless, H., Strack, F., Klumpp, G., Rittenauer-Schatka, H., & Simons, A. (1991). Ease of retrieval as information: Another look at the availability heuristic. Journal of Personality and Social Psychology, 61, 195–202. http://dx.doi.org/10.1037/0022-3514.61.2.195

Because the replication failure was just published, it is not possible to examine whether it had any effect on citations.

18. SUCCESS

In sum, the article successfully replicated 2 JPSP articles and failed to replicate 1.

Van Dessel, P., De Houwer, J., Gast, A., Roets, A., & Smith, C. T. (2020). On the effectiveness of approach-avoidance instructions and training for changing evaluations of social groups. Journal of Personality and Social Psychology, 119(2), e1–e14. https://doi.org/10.1037/pspa0000189

This is another replication of the Kawakami et al. (2007) article, but it focusses on Experiment 1 that did not use subliminal stimuli. This article reports a successful replication in Study 1, t(61) = 1.72, p = .045 (one-tailed), Study 3, t(981) = 2.19, p = .029, t(362) = 2.76, p = .003. Thus, this article counts as a success. It should be noted, however, that these effects disappear in studies with a delay between the training and testing sessions (Lai et al., 2016).

Kawakami, K., Phills, C. E., Steele, J. R., & Dovidio, J. F. (2007). (Close) distance makes the heart grow fonder: Improving implicit racial attitudes and interracial interactions through approach behaviors. Journal of Personality and Social Psychology, 92, 957–971. http://dx.doi.org/10.1037/0022-3514.92.6.957

19. EXCLUDED

Aknin, L. B., Dunn, E. W., Proulx, J., Lok, I., & Norton, M. I. (2020). Does spending money on others promote happiness?: A registered replication report. Journal of Personality and Social Psychology, 119(2), e15–e26. https://doi.org/10.1037/pspa0000191

This article replicated a study that was published in Science. It therefore does not tell us anything about the replicability of articles published in JPSP.

Dunn, E. W., Aknin, L. B., & Norton, M. I. (2008). Spending money on others promotes happiness. Science, 319, 1687–1688. http://dx.doi.org/10.1126/science.1150952

20. EXCLUDED

Calderon, S., Mac Giolla, E., Ask, K., & Granhag, P. A. (2020). Subjective likelihood and the construal level of future events: A replication study of Wakslak, Trope, Liberman, and Alony (2006). Journal of Personality and Social Psychology, 119(5), e27–e37. https://doi.org/10.1037/pspa0000214

Although this article reports two replication studies (both failures), the original studies were published in a different journal. Thus, the results do not provide information about the replicability of research published in JPSP.

Wakslak, C. J., Trope, Y., Liberman, N., & Alony, R. (2006). Seeing the forest when entry is unlikely: Probability and the mental representation of events. Journal of Experimental Psychology: General, 135, 641–653. http://dx.doi.org/10.1037/0096-3445.135.4.641

21. EXCLUDED

Burnham, B. R. (2020). Are liberals really dirty? Two failures to replicate Helzer and Pizarro’s (2011) study 1, with meta-analysis. Journal of Personality and Social Psychology, 119(6), e38–e42. https://doi.org/10.1037/pspa0000238

Although this article reports two replication studies (both failures), the original studies were published in a different journal. Thus, the results do not provide information about the replicability of research published in JPSP.

Helzer, E. G., & Pizarro, D. A. (2011). Dirty liberals! Reminders of physical cleanliness influence moral and political attitudes. Psychological Science, 22, 517–522. http://dx.doi.org/10.1177/0956797611402514

Results

Out of the 21 articles published under the e-replication format, only 7 articles report replications of studies published in JPSP before 2015. One of these article reports replications of three articles, but two of these articles report replications of different studies in the same article (one failure, one success; Kawakami et al., 2007). Thus, there are a total of 9 replications with 6 success and 3 failures. This is a success rate of 67%, 95%CI = 31% to 98%.

The first observation is that the number of replication studies of studies published in JPSP is abysmally low. It is not clear why this is the case. Either researchers are not interested in conducting replication studies or JPSP is not accepting all submissions of replication studies for publication. Only the editors of JPSP know.

The second Some evidence that JPSP published more successful than failed replications. This is inconsistent with the results of the Open Science Collaboration project and predictions based on statistical analyses of the p-values in JPSP articles (Open Science Collaboration, 2015; Schimmack, 2020, 2021). Although this difference may simply be sampling error because the sample of replication studies in JPSP is so small, it is also possible that this high success rate reflects reflects systematic factors that select for significance.

First, researchers may be less motivated to conduct studies with a low probability of success, especially in research areas that have been tarnished by the replication crisis. Who still wants to do priming studies in 2021? Thus, bad research that was published before 2015 may simply die out. The problem with this slow death model of scientific self-correction is that old studies continue to be cited as evidence. Thus, JPSP should solicit replication studies of prominent articles with high citations even if these replication studies may produce failures.

Second, it is unfortunately possible that editors at JPSP prefer to publish studies that report successful outcomes rather than replication failures. To ensure consumers of JPSP, editors should make it transparent whether replication studies get rejected and why they get rejected. Given the e-only format, it is not clear why any replication studies would be rejected.

Conclusion

Unfortunately, the results of this meta-analysis show once more that psychologists do everything in their power to appear trustworthy without doing the things that are required to gain or regain trust. While fabulous review articles tout the major reforms that have been made (Nelson et al., 2019), the reality is often much less glamourous. Trying to get social psychologists to openly admit that they made (honest) mistakes in the past and to correct themselves is akin to getting Trump to admit that he lost the 2020 election. Most of the energy is wasted on protecting the collective self-esteem of card carrying social psychologists in the face of objective, scientific criticism of their past and current practices. It remains unclear which results in JPSP are replicable and provide solid foundations for a science of human behavior and which results are nothing but figments of social psychologists’ imagination. Thus, I can only warn consumers of social psychological research to be as careful as they would be when they are trying to buy a used car. Often the sales pitch is better than the product (Ritchie, 2020; Singal, 2021).