The Race Implicit Association Test Is Biased

This is a preprint (not yet submitted to a journal) of a manuscript that examines the validity of the race IAT as a measure of in-group and out-group attitudes for African and White Americans. We show that research on intergroup relationships and attitudes benefits from insights (insights by means of being inside the experience) by African Americans that are often ignored by White psychologists. Data and Syntax are here (https://osf.io/rvfz8/)

The Race Implicit Association Test is Biased: Most African Americans Have Positive Attitudes Towards Their In-Group

Ulrich Schimmack
University of Toronto Mississauga

Alicia Howard
Music Wellbeing

Abstract

Explicit ratings of attitudes show a preference for the in-group for African Americans and White participants. However, the average score of African Americans on the race Implicit Association Test is close to zero. This finding has been interpreted as evidence that many African Americans have unconsciously internalized negative attitudes towards their group. We conducted a multi-method study of this hypothesis with various implicit measures (Single-Target IAT, Evaluative Priming, Affective Misattribution Procedure) that distinguish between in-group and out-group attitudes. Our main finding is that African Americans have positive attitudes towards their in-group on a latent factor that reflects the valid variance across measures. In addition, the race IAT scores of African Americans are unrelated to in-group and out-group attitudes. Moreover, White American’s race IAT scores are biased and exaggerate in-group preferences. These findings are discussed in terms of the unique aspects of the race IAT that may activate cultural stereotypes. The results have ethical implications for the practice of providing individuals with feedback about their unconscious biases with an invalid measure. It is harmful to African Americans to suggest that they unconsciously dislike African Americans and to exaggerate prejudice of White Americans. Ongoing discrimination may be better explained by explicit prejudice of a minority of White Americans than pervasive, uncontrollable implicit biases of most White Americans.

Introduction

With 1,277 citations in WebOfScience, Jost, Banaji, and Nosek’s (2004) article “A Decade of System Justification Theory: Accumulated Evidence of Conscious and Unconscious Bolstering of the Status Quo” is easily the most cited article in the journal Political Psychology. The second most cited article has less than half the number of citations (523 citations). The abstract of this influential article states the authors’ main thesis clearly and succinctly. They postulate a general motive to support the existing social order. This motive contributes to internalization of inferiority of disadvantaged groups. Most important for this article is the claim that this internalization of inferiority is “observed most readily at an implicit, nonconscious level of awareness” (p. 881).

The theory is broadly applied to a wide range of stigmatized groups and its validity has to be evaluated for each group individually. Our focus is on the African American community. Jost et al. (2004) assume that system justification theory is applicable to African Americans because they show different evaluations of their in-group on explicit measures and on the Implicit Association Test (IAT; Greenwald, McGhee, & Schwartz, 1998). On explicit measures, like the feeling thermometer, African Americans show higher in-group favoritism than White Americans (standardized mean differences d = .8 vs. .6). However, IAT scores show greater in-group favoritism for White Americans than for African Americans (d = .9 vs. 0).  IAT scores close to zero for African Americans have been interpreted as evidence that “sizable proportions of members of disadvantaged groups – often 40% to 50% or even more exhibit implicit (or indirect) biases against their own group and in favor of more advantaged groups” (Jost, 2019, p. 277).

This pattern of results is based on large samples and has been replicated in several studies. Thus, we are not questioning the empirical facts. Our concern is that Jost and colleagues misinterpret these results. In the early 2000s, it was common to assume that explicit and implicit group evaluations reflect different constructs (Nosek, Greenwald, & Banaji, 2005). This dual-attitude model allows for different evaluations of the in-group at a conscious and an unconscious level. Evidence for this model rested mostly on the finding that race IAT scores and self-ratings are only weekly correlated, r ~ .2 (Hofmann, Gawronski, Gschwendner, Le, & Schmitt, 2005). However, these studies did not correct for measurement error. After correcting for measurement error, the correlation increases to r = .8 (Schimmack, 2021a). The race IAT also has little incremental predictive validity over explicit measures (Schimmack, 2021b). This new evidence renders it less likely that explicit and implicit attitudes can diverge. In fact, there exists no evidence that attitudes are hidden from consciousness. Thus, there may be an alternative explanation for African Americans’ scores on the race IAT.

White Psychologists’ Theorizing about African Americans

Before we propose an alternative explanation for African Americans’ neutral scores on the race IAT, we would like to make the observation that Jost et al.’s (2004) claims about African Americans follow a long tradition of psychological research on African Americans by mostly White psychologists. Often this research ignores the lived experience of African Americans, which often leads to false claims (cf. Adams, 2010). For example, since the beginning of psychology, White psychologists assumed that African Americans have low self-esteem and proposed several theories for this seemingly obvious fact. However, in 1986 Rosenberg ironically pointed out that “everything stands solidly in support of this conclusion except the facts.” Since then, decades of research have shown that African Americans have the same or even higher self-esteem than White Americans (Twenge & Crocker, 2002). Just like White theorists’ claims about self-esteem, Jost et al.’s claims about African Americans’ unconscious are removed from African Americans’ own understanding of their culture and identity and disconnected from other findings that are in conflict with the theory’s predictions. The only empirical support for the theory is the neutral score of African Americans on the race IAT.

African American’s Resilience in a Culture of Oppression

We are skeptical about the claim that most African-Americans secretly favor the out-group based on the lived experience of the second author. Alicia Howard is an African-American from a predominantly White, small town in Kentucky. She grew up surrounded by a large family and attended a Black church. Her identity was shaped by role-models from this Black in-group and not by some idealized abstract image of the White out-group. Also, contrary to the famous doll-studies from the 1960s, she had White and Black dolls and got excited when a new Black doll came out. Alicia studied classical music at the historically Black college and university Kentucky State University. Even though her admired composers like Rachmaninov were White, she looked up to Black classical musicians like Andre Watts, Kathleen Battle, Leontyne Price, and Jesse Norman as role models. It is of course possible that her experiences are unique and not representative of African-Americans. However, no one in her family or among her Black friends showed signs that they preferred to be White or liked White people more than Black people. In small towns, the lives of Black and White people are also more similar than in big cities. Therefore, the White out-group was not all that different from the Black in-group. Although there are Black individuals who seem to struggle with their Black identity, there are also White people who suffer from White guilt or assume a Black identity for other reasons. Thus, from an African American perspective, system justification theory does not seem to characterize most African Americans’ attitudes to their in-group.

The Race IAT Could Be Biased

We are not the first to note that the race IAT may not be a pure measure of attitudes (Olson & Fazio, 2004). The nature of the task may activate cultural stereotypes that are normally not activated when African Americans interact with each other. As a result, the mean score of African Americans on the race IAT may be shifted towards a pro-White bias because negative cultural stereotypes persist in US American culture. The same influence of cultural stereotypes would also enhance the pro-White bias for White Americans. Thus, an alternative explanation for the greater in-group bias for White Americans than for African Americans on the race IAT is that attitudes and cultural stereotypes act together for White Americans, whereas they act in opposite directions for African Americans.

One way to test this hypothesis is to examine in-group biases with alternative implicit measures that do not activate stereotypes. The most widely used alternative implicit measures are the Affective Misattribution Procedure (AMP; Payne, Cheng, Govorun, & Stewart, 2005) and the evaluative priming task (EPT, Fazio, Jackson, Dunton, & Williams, 2005). Only recently it has been noted that these implicit measures produce different results (Teige-Mocigemba, Becker, Sherman, Reichardt, & Klauer, 2017). A study in the United States, examined the differences between African American and White respondents on three implicit measures (Figure 1, Bar-Anan & Nosek, 2014).

Known-group differences are much more pronounced for the race IAT than the other two implicit tasks. The authors interpret this finding as evidence that the race IAT has higher validity. That is, under the assumption that (mostly) White participants have a strong preference for their in-group, a positive mean is predicted, and the more positive the mean is, the more valid a measure is. However, alternative explanations are possible. One alternative explanation is that only the race IAT activates cultural stereotypes and produces a high pro-White mean as a result. In contrast, the other tasks are better measures of attitudes and the results show that prejudice is much less pronounced than the race IAT suggests. That is, the race IAT is biased because it activates cultural stereotypes that are not automatically activated with other implicit tasks.

Another limitation of the race IAT is that preferences for the in-group and the out-group are confounded. In contrast, the other two tasks can be scored separately to obtain measures of the strength of preferences for the in-group and the out-group. This is particularly helpful to make sense of the neutral score of African Americans on the race IAT. One explanation for a weaker in-group bias is simply that African Americans are less biased against the out-group than White Americans. Thus, a better test of African Americans’ attitudes towards their own group is to examine how positive or negative African American’s responses are to African American stimuli.

In short, published studies reveal that different implicit tasks produce different results and that the race IAT shows stronger pro-White biases than other tasks. However, it has not been systematically explored whether this finding reveals higher or lower validity of the race IAT. We used Bar-Anan and Nosek’s (2014) data to explore this question.

Method

Data

The data are based on a voluntary online sample. The total sample size is large (N = 23,413).  However, participants completed only some of the tasks that included implicit measures of political orientation and self-esteem. Table 1 shows the number of African American and White participants for six measures.

Measures

Race IAT.  The race IAT is the standard Implicit Association Test, although the specific stimuli that represent the African American group and the White American group were different. However, this does not appear to have influenced responses as seen by similar means for African American and White American participants.  The race IAT was scored so that higher values represented a pro-White bias for White participants and a pro-Black bias for Black participants.

Single Target IAT. The single-target IAT (ST-IAT) is a variation of the race IAT. The main difference is that participants only have to classify one racial group along with classifications of positive and negative stimuli. As a result, the ST-IAT reflects only evaluations of one group and provides distinct information about evaluations of the in-group and out-group. It is particularly interesting how Black participants perform on the in-group ST-IAT with Black targets. System justification theory predicts a score close to zero that would reflect an over all neutral attitude and at least 50% of participants who may hold negative views of the in-group. 

Evaluative Priming Task. The Evaluative Priming Task (EPT) was developed by Fazio et al. (1995). In a practice block, participants classified words as “good” or “bad.” In the next three blocks, target stimuli were primed with pictures of African American and White Americans. In-group bias was the response time to same-group primes for negative words minus response times to same-group primes for positive words. Out-group bias was the response time to other-group primes for negative words minus response times to other-group primes for positive words.

Affective Misattribution Procedure. The Affective Misattribution was invented by Payne et al. (2005). Pictures of African Americans or White Americans are quickly followed by a Chinese character and a mask. Participants are instructed to rate the Chinese character as more or less pleasant than the average Chinese character. They were instructed not to let the pictures influence their evaluation of the target stimuli. The in-group score was the percentage of more pleasant responses after an in-group picture. The out-group score was the percentage of more pleasant responses after an out-group picture.

Feeling Thermometer. Self-reports of in-group and out-group attitudes were measured with feeling thermometers. Participants rated how warm or cold they feel toward the in-group and the out-group on an 11-point scale ranging from 0 = coldest feelings to 10 = warmest feelings.

For all measures, participants scores were divided by the standard deviation so that means can be interpreted as standardized effect sizes assuming that a mean of zero reflects a neutral attitude, positive scores reflect positive attitudes, and negative scores reflect negative attitudes.

Results

The data were analyzed using structural equation modeling with MPLUS8.2 (Muthen & Muthen (2017), A multi-group model was specified with African Americans and White Americans as separate groups. The model was developed iteratively using the data. Thus, all results are exploratory and require validation in a separate sample. Due to the small number of Black participants, it was not possible to cross-validate the model with half of the sample. Moreover, tests of group differences have low power and a study with a larger sample of African Americans is needed to test equivalence of parameters. Cherry picking of data, models, and references undermines psychological science. To avoid this problem, we also constructed a model that assumes some implicit measures are biased and inflate in-group attitudes of African Americans. To identify the means of the latent in-group and out-group factors, we chose the single-target IAT because it shows the least positive attitudes of African Americans towards their in-group. We then freed other parameters to maximize model fit. We then freed other parameters to maximize model fit. The data, input syntax, and the full outputs have been posted online (https://osf.io/rvfz8/).

Preferred Model

Overall fit of the final model meets standard fit criteria (RMSEA < .06, CFI > .95), CFI (78) = 133.37, RMSEA = .012, 90%CI = .009 to .016, CFI = .981. However, models with low coverage (many missing data) may overestimate model fit. A follow-up study that administers all tasks to all participants should be conducted to provide a stronger test of the model. Nevertheless, the model is parsimonious and there were no modification indices greater than 20. This suggests that there are no major discrepancies between the model and the data.

Figure 2 shows a measurement of attitudes towards the in-group and out-group. The key unobserved variables in this model are the attitude towards the in-group factor (ig) and the attitude towards the out-group factor (og). Each construct is measured with four indicators, namely scores on the single-target IAT (satig/satog), scores on the evaluative priming task (epig, epog), scores on the affective misattribution procedure (ampig/ampog), and scores on the explicit feeling thermometer ratings (thermoig/thermoog). For ease of interpretation, Figure 2 shows standardized coefficients that range from -1 to 1.

The first finding is that loadings of the measures on the IG factor (.3-.4) and on the outgroup factor (.4) are modest. They suggest that less than 20% of the variance in a single measure is valid variance. However, the model clearly identified latent factors that show individual differences in attitudes towards in-group and out-group for Black and White Americans. The second noteworthy finding is that loadings for African Americans and White Americans were similar. Thus, the multi-method measurement model was able to identify variation in in-group and out-group attitudes for both groups.

A third finding is that for White participants.54^2 = 29% of the variance in race IAT reflects attitudes towards African Americans (i.e., prejudice). This is a bit higher than previous estimates, which were in the 10% to 20% range (Schimmack, 2021). However, the lower limit of the 95%CI overlapped with this range of possible values, .43^2 = 18%.

Most important is the finding that race IAT scores for African Americans were unrelated to the attitudes towards the in-group and out-group factors. Thus, scores on the race IAT do not appear to be valid measures of African Americans’ attitudes. This finding has important implications for Jost et al.’s (2021) reliance on race IAT scores to make inferences about African Americans’ unconscious attitudes towards their in-group. This interpretation assumed that race IAT scores do provide valid information about African American’s attitudes towards the in-group, but no evidence for this assumption was provided. The present results show 20 years later that this fundamental assumption is wrong. The race-IAT does not provide information about African Americans’ attitudes towards the in-group as reflected in other implicit measures.

An additional interesting finding was that in-group and out-group attitudes were unrelated. This suggests that prejudice does not enhance pro-White attitudes for White participants. It also suggests that Black pride does not have to devalue the White outgroup.

Finally, the model shows that three methods show strong method variance. All three methods measured in-group and out-group attitudes within a single experimental block. The main difference is the single-target IAT that is conducted once with one target (Black) and once with the other target (White). Separating the assessment of in-group and out-group attitudes for the other tasks might reduce the amount of systematic measurement error. However, less systematic measurement error does not seem to translate into more valid variance as the single-target IAT was not more valid than the other measures. The results for the commonly used feeling thermometer are particularly noteworthy. While this measure shows some modest validity, the present results also show that this single-item measure has poor psychometric properties. An important goal for future research is to develop more valid measures of attitudes towards in-groups and out-groups. Until then, researchers should use a multi-method approach.  

Figure 3 shows the model for the means. While standardized coefficients are easier to interpret for the measurement model, means are easier to interpret in the units of the measures, which were scaled so that means can be interpreted as Cohen’s d values.

The most important finding is that African Americans’ mean for the in-group factor is positive, d = 1.07, 95%CI = 0.98 to 1.16. Thus, the data provide no support for the claim that most African Americans evaluate their in-group negatively. With a normal distribution centered at 1.07, only 14% of African Americans would have a negative (below 0) attitude towards the in-group. White Americans also show a positive evaluation of the in-group, but to a lesser extent, d = 0.62; 95%CI = 0.58, 0.66. The confidence intervals are tight and clearly do not overlap, and constraining these two coefficients to be equal reduced model fit, chi2(79) = 228.43, Δchi2(1) = 95.06, p = 1.85e-22.  Thus, this model suggests that African Americans have an even more positive attitude towards their in-group than White Americans.

As expected, out-group attitudes are less positive than in-group attitudes for both groups. Also expected was the finding that out-group attitudes of African Americans, d = .42, 95%CI , are more favorable than out-group attitudes of White Americans, d = .20, 95%CI. However, even White Americans’ out-group attitudes are on average positive. This finding is in marked contrast to the common finding with the race IAT that most White Americans show a pronounced pro-White bias, which has often been interpreted as evidence of widespread prejudice. However, this interpretation is problematic for two reasons. First, it confounds in-group and out-group attitudes. Prejudice is defined as White American’s attitude towards African Americans. The race IAT is not a direct measure of prejudice because it measures relative preferences. Of course, in-group favoritism alone can lead to discrimination and racial disparities when one group is dominant, but these consequences can occur without actual prejudice against African Americans. The present results suggest that African American also have an in-group bias. Thus, it is important to distinguish between in-group favoritism, which applies to both groups, from prejudice which applies uniquely to White Americans towards African Americans.

The bigger problem for the race IAT is that White Americans’ scores on the race IAT are systematically biased towards a pro-White score, d = .78, whereas African Americans’ scores are only slightly biased towards a pro-Black score, d = -.19. This finding shows that IAT scores provide misleading information about the amount of in-group favoritism. Thus, support for the system justification theory rests on a measurement artifact.

Alternative Model

It is possible that our modeling decisions exaggerated the positivity of African Americans’ in-group attitudes. To address this concern, we tried to find an alternative model that fits the data with the lowest amount of African American’s in-group bias. This alternative model fit the data as well as our preferred model, CFI (77) = 134.24, RMSEA = .013, 90%CI = .009 to .016, CFI = .980. Thus, the data cannot distinguish between these two models. The covariance structure was identical. Thus, we only present the means structure of the model (Figure 4).

The main difference between the models is that African Americans’ attitudes towards the ingroup are less favorable (d = 1.07 vs. d = .54). The discrepancy is explained by the assumptions that African Americans have a positive bias on the feeling-thermometer and by assuming that African Americans’ responses to White targets on the AMP are negatively biased (ampog = -.72). The most important finding is that African Americans’ in-group attitudes remain positive, d = .54, although they are now slightly less favorable than White Americans’ in-group attitudes, d = .62.  

Proponents of system justification theory might argue that attitudes towards the in-group have to be evaluated in relative terms. Viewed from this perspective, the results still show relatively more in-group favoritism for White Americans, d = .62 – .20 = .42 than African Americans, d = .54 – .40 = .14. However, out-group attitudes contribute more to this difference, d = .40 = .20 = .20, than in-group differences, d = .62 – .54 = .08. Thus, one reason for the difference in relative preferences is that African Americans attitudes towards Whites are more positive than White Americans’ attitudes towards African Americans. It would be a mistake to interpret this difference in evaluations of the out-group as evidence that African Americans have internalized negative stereotypes about their in-group.

The alternative model does not alter the fact that scores on the race IAT are biased and provide misleading information about in-group and out-group attitudes.

Discussion

After its introduction in 1998, the Implicit Association Test has been quickly accepted as a valid measure of attitudes that individuals are unwilling or unable to report on self-report measures. Mean scores of White Americans were interpreted as evidence that prejudice is much more widespread and severe than self-report measures suggest. Mean scores of African Americans were interpreted as evidence of unconscious self-loathing. The present results suggest that millions of African American and White visitors of the Project Implicit website were given false feedback about their attitudes. For White Americans, the race IAT does appear to reflect individual differences in out-group attitudes (prejudice). However, the scoring of the IAT in terms of deviations from a value of zero is invalid because the mean is biased towards pro-White scores. Even the amount of valid variation is modest and insufficient to provide individualized feedback.

Implications for African American’s In-Group and Out-Group Attitudes

Our investigation started with the surprising suggestions that African Americans are motivated to justify racism and are supposed to have internalized negative stereotypes and attitudes towards their group. This view of African Americans is detached from their history and evidence of high self-esteem among African Americans. The only evidence for this claim was the finding that African Americans do not show a strong in-group preference on the race IAT.

Our results suggest that this finding is due to the low validity of the race IAT as a measure of African Americans’ attitudes. African American’s race IAT scores were unrelated to their in-group attitudes and out-group attitudes as measured by other measures, including the single-target variant of the IAT.

This raises the question in which way the race IAT differs from other measures. We are not the first to suggest that the race IAT activates negative cultural stereotypes (Olson & Fazio, 2004). These stereotypes are known to African Americans and may influence their performance on the IAT, even if African Americans do not endorse these stereotypes and these stereotypes are rarely activated in real life. Thus, the mean close to zero may not reflect the fact that 50% of African Americans have negative attitudes towards their group. Rather, it is possible that the neutral score reflects a balanced influence of positive attitudes and negative stereotypes.

Another noteworthy difference between other implicit tasks and the race IAT is that other tasks rely on pictures of individual members to elicit a valenced response. In contrast, the race IAT focuses on the evaluation of the abstract category “Black.” It is possible that African Americans have more positive attitudes to (pictures of) members of the group than to the concept of being “Black,” which is a fuzzy category at best. Similarly, old people seem to have a negative attitude to the concept of being “old,” but this does not imply that they do not like old people. This has important implications for the predictive validity of the IAT. In everyday life, we encounter individuals and not abstract categories. Thus, even if the race IAT were a valid measure of attitudes towards abstract categories, it would be a weak predictor of actual behaviors.

In sum, the only empirical support for system justification theory was African Americans’ neutral score on the race IAT. We show that the race IAT lacks validity and that African Americans have positive attitudes towards their in-group on all other measures. We also find that they have positive attitudes towards the White outgroup. This has important implications for the assessment of racial attitudes of White participants. If most White participants have negative attitudes towards Black people and these attitudes consistently influence White Americans behaviors, African Americans would experience discrimination from most White Americans. In this case, we would expect negative attitudes towards the out-group. As the data show, this is not the case. This does not mean that discrimination is rare. Rather, it is possible that most acts of discrimination are committed by a relatively small group of White Americans (Campbell & Brauer, 2021).

Implications for White American’s In-Group and Out-Group Attitudes

Banaji and Greenwald’s (2013) popular book was largely responsible for claims that implicit bias is real, widespread, and explains racial discrimination. The book ends with several conclusions. Two conclusions are widely accepted among social psychologists and a majority of US Americans, namely Black disadvantage exists and racial discrimination at least partially contributes to this disadvantage. However, other conclusions were not generally accepted and were not clearly supported by evidence, namely attitudes have both reflective and automatic form, people are often unaware of their automatic attitudes, and implicit bias is pervasive, and implicit racial attitudes contribute to discrimination against Black Americans. The claim that implicit biases are widespread was based entirely on the finding that 75% of US Americans show a clear pro-White bias on the race IAT. The present results suggest that this finding is unique to the race IAT and not found with other implicit measures.

Once more, we are not the first to point out that scoring of the race IAT may have exaggerated the pervasiveness of racial biases among White Americans (Blanton et al., 2006, 2009, 2015; Oswald et al., 2013, 2015). However, so far this criticism has fallen on deaf ears and Project Implicit continues to provide individuals with feedback about their race IAT scores. Textbooks proudly point out that over 20 million people have received this feedback, as if this number says something about the validity of the test (Myers & Twenge, 2019).

When visitors might see a discrepancy between their self-views and the test scores, they are informed that this does not invalidate the test because it measures something that is hidden from self-knowledge. The present results suggest that many visitors of the Project Implicit website were given false feedback about their prejudices because even individuals without any negative attitudes towards African Americans end up with a pro-White bias on the race IAT.

This bias can co-exist with evidence that variation in race IAT scores shows some convergent validity with other explicit and implicit measures of individual differences in attitudes towards African Americans. However, variances and means are two independent statistical constructs, and valid variance does not imply that means are valid. Nosek and Bar-Anan (2014) argued that the race IAT is the most valid measure of attitudes because it shows the largest differences in scores between African Americans and White Americans. However, this argument is only valid, if we assume that random measurement error attenuates the differences on other measures. The present study directly tested this assumption and found no evidence for the assumption. Instead, we found that the larger differences between African Americans and White Americans reflects some systematic mean differences that are unique to the race IAT. As noted earlier, a plausible explanation for this systematic bias is that the race IAT activates stereotypes, whereas other measures are purer measures of attitudes.

We hope that our direct demonstration of bias will finally end the practice of providing visitors of the Project Implicit website with misleading information about the validity of the race IAT and misleading information about individuals’ prejudice. There is simply no evidence that prejudice is hidden from honest self-reflection or that such hidden biases are revealed by the race IAT (Schimmack, 2021).

Implications for Future Research

Although our article focuses on the race IAT, the results also have implications for the use and interpretation of the other measures. One advantage of the other measures is that they provide separate information about in-group and out-group attitudes because they avoid the pitting of one group against the other. However, these measures have other problems. Fast reactions to pictures of African Americans and White Americans reflect only first impressions without context. They are also influenced by affective reactions to other aspects such as gender, age, or attractiveness. Thus, these scores may not reflect other aspects of attitudes that are activated in specific contexts. Moreover, the means will depend heavily on the selection of individual pictures. Thus, a lot more work would need to be done to ensure that the picture sets are representative of the whole group. Finally, our results showed that none of the measures had high loadings on the attitude factors. Thus, a single measure has only modest validity.

Unfortunately, psychologists often do not carefully examine the psychometric properties of their measures. Instead, one measure is often arbitrarily chosen and treated as if it were a perfect measure of a construct. Even worse, a specific measure may be chosen from a set of measures because it showed the desired result (John, Loewenstein, & Prelec, 2012). To avoid these problems, we strongly urge intergroup relationship researchers to use a multi-method approach and to use formal measurement models to analyze their data (Schimmack, 2021). This approach will also produce better estimates of effect sizes that are attenuated by random and systematic measurement error.

References

Adams, P. E. (2010). Understanding the Different Realities, Experience, and Use of Self-Esteem Between Black and White Adolescent Girls. Journal of Black Psychology, 36(3), 255–276. https://doi.org/10.1177/0095798410361454

Banaji, M. R., & Greenwald, A. G. (2013). Blindspot: Hidden biases of good people. New York, NY: Delacorte Press.

Bar-Anan, Y., & Nosek, B. A. (2014). A comparative investigation of seven indirect attitude measures. Behavior Research Methods, 46(3), 668–688. https://doi.org/10.3758/s13428-013-0410-6

Blanton, H., Jaccard, J., Gonzales, P. M., & Christie, C. (2006). Decoding the implicit association test: Implications for criterion prediction. Journal of Experimental Social Psychology, 42(2), 192–212. https://doi.org/10.1016/j.jesp.2005.07.003

Blanton, H., Jaccard, J., Klick, J., Mellers, B., Mitchell, G., & Tetlock, P. E. (2009). Strong claims and weak evidence: Reassessing the predictive validity of the IAT. Journal of Applied Psychology, 94(3), 567–582.

Blanton, H., Jaccard, J., Strauts, E., Mitchell, G., & Tetlock, P. E. (2015). Toward a meaningful metric of implicit prejudice. Journal of Applied Psychology, 100(5), 1468–1481. https://doi.org/10.1037/a0038379

Campbell, M. R., & Brauer, M. (2021). Is discrimination widespread? Testing assumptions about bias on a university campus. Journal of Experimental Psychology: General, 150(4), 756–777. https://doi.org/10.1037/xge0000983

Fazio, R. H., Jackson, J. R., Dunton, B. C., & Williams, C. J. (1995). Variability in automatic activation as an unobtrusive measure of racial attitudes: A bona fide pipeline? Journal of Personality and Social Psychology, 69(6), 1013–1027. https://doi.org/10.1037/0022-3514.69.6.1013

Greenwald, A. G., McGhee, D. E., & Schwartz, J. L. K. (1998). Measuring individual differences in implicit cognition: The Implicit Association Test. Journal of Personality and Social Psychology, 74, 1464–1480.

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23(5), 524–532. https://doi.org/10.1177/0956797611430953

Jost, J. T. (2019). A quarter century of system justification theory: Questions, answers, criticisms, and societal applications. British Journal of Social Psychology, 58(2), 263–314. https://doi.org/10.1111/bjso.12297

Jost, J. T., Banaji, M. R., & Nosek, B. A. (2004). A Decade of System Justification Theory: Accumulated Evidence of Conscious and Unconscious Bolstering of the Status Quo. Political Psychology, 25(6), 881–919. https://doi.org/10.1111/j.1467-9221.2004.00402.x

Hofmann, W., Gawronski, B., Geschwendner, T., Le, H., & Schmitt, M. (2005). A meta-analysis on the correlation between the Implicit Association Test and explicit self-report measures. Personality and Social Psychology Bulletin, 31, 1369–1385. doi:10.1177/0146167205275613

Muthén, L.K. and Muthén, B.O. (1998-2017). Mplus User’s Guide. Eighth Edition. Los Angeles, CA: Muthén & Muthén

Myers, D. & Twenge, J. (2019). Social psychology (13th edition). McGraw Hill.

Nosek, B. A., Greenwald, A. G., & Banaji, M. R. (2005). Understanding and Using the Implicit Association Test: II. Method Variables and Construct Validity. Personality and Social Psychology Bulletin, 31(2), 166–180. https://doi.org/10.1177/0146167204271418

Olson, M. A., & Fazio, R. H. (2004). Reducing the Influence of Extrapersonal Associations on the Implicit Association Test: Personalizing the IAT. Journal of Personality and Social Psychology, 86(5), 653–667. https://doi.org/10.1037/0022-3514.86.5.653

Oswald, F. L., Mitchell, G., Blanton, H., Jaccard, J., & Tetlock, P. E. (2013). Predicting ethnic and racial discrimination: A meta-analysis of IAT criterion studies. Journal of Personality and Social Psychology, 105(2), 171–192. https://doi.org/10.1037/a0032734

Oswald, F. L., Mitchell, G., Blanton, H., Jaccard, J., & Tetlock, P. E. (2015). Using the IAT to predict ethnic and racial discrimination: Small effect sizes of unknown societal significance. Journal of Personality and Social Psychology, 108(4), 562–571. https://doi.org/10.1037/pspa0000023

Payne, B. K., Cheng, C. M., Govorun, O., & Stewart, B. D. (2005). An inkblot for attitudes: Affect misattribution as implicit measurement. Journal of Personality and Social Psychology, 89(3), 277–293. https://doi.org/10.1037/0022-3514.89.3.277

Rosenberg, M. (1986). Conceiving the self. Malabar, FL: Robert E. Krieger.

Schimmack, U. (2021a). The Implicit Association Test: A Method in Search of a Construct. Perspectives on Psychological Science, 16(2), 396–414. https://doi.org/10.1177/1745691619863798

Schimmack, U. (2021). Invalid Claims About the Validity of Implicit Association Tests by Prisoners of the Implicit Social-Cognition Paradigm. Perspectives on Psychological Science, 16(2), 435–442. https://doi.org/10.1177/1745691621991860

Teige-Mocigemba, S., Becker, M., Sherman, J. W., Reichardt, R., & Christoph Klauer, K. (2017). The affect misattribution procedure: In search of prejudice effects. Experimental Psychology, 64(3), 215–230. https://doi.org/10.1027/1618-3169/a000364

Twenge, J. M., & Crocker, J. (2002). Race and self-esteem: Meta-analyses comparing Whites, Blacks, Hispanics, Asians, and American Indians and comment on Gray-Little and Hafdahl (2000). Psychological Bulletin, 128(3), 371–408. https://doi.org/10.1037/0033-2909.128.3.371

How to build a Monster Model of Well-being: Part 4

This is part 4 in a mini-series of blogs that illustrate the usefulness of structural equation modeling to test causal models of well-being. The first causal model of well-being was introduced in 1980 by Costa and McCrae. Although hundreds of studies have examined correlates of well-being since then, hardly any progress has been made in theory development. In 1984, Diener (1984) distinguished between top-down and bottom-up theories of well-being, but empirical tests of the different models have not settled this issue. The monster model is a first attempt to develop a causal model of well-being that corrects for measurement error and fits empirical data.

The first part (Part1) introduced the measurement of well-being and the relationship between affect and well-being. The second part added measures of satisfaction with life-domains (Part 2). Part 2 ended with the finding that most of the variance in global life-satisfaction judgments is based on evaluations of important life domains. Satisfaction in important life domains also influences the amount of happiness and sadness individuals experience, whereas positive affect had no direct effect on life-evaluations. In contrast, sadness had a unique negative effect on life-evaluations that was not mediated by life domains.

Part 3 added extraversion to the model. This was a first step towards a test of Costa and McCrae’s assumption that extraversion has a direct effect on positive affect (happiness) and no effect on negative affect (sadness). Without life domains in the model, the results replicated Costa and McCrae’s (1980) results. Yes, personality psychology has replicable findings. However, when domain satisfactions were added to the model, the story changed. Costa and McCrae (1980) assumed that extraversion increases well-being because it has a direct effect on cheerfulness (positive affect) that adds to well-being. However, in the new model, the effect of extraversion on life-satisfaction was mediated by life domains rather than positive affect. The strongest mediation was found for romantic satisfaction. Extraverts tended to have higher romantic satisfaction and romantic satisfaction contributed significantly to overall life-satisfaction. Other domains like recreation and work are also possible mediators, but the sample size was too small to produce more conclusive evidence.

Part 4 is a simple extension of the model in part 3 by adding the other personality dimensions to the model. I start with neuroticism because it is by far the most consistent and strongest predictor of well-being. Costa and McCrae (1980) assumed that neuroticism is a general disposition to experience more negative affect without any relation to positive affect. However, most studies show that neuroticism has a negative relationship with positive aspect as well, although it is not as strong as the relationship with negative affect. Moreover, neuroticism is also related to lower satisfaction in many life domains. Thus, the model simply allowed for neuroticism to be a predictor of both affects and all domain satisfaction. The only assumption made by this model is that the negative effect of neuroticism on life-satisfaction is fully mediated by domain satisfaction and affect.

Figure 1 shows the model and the path coefficients for neuroticism. The first important finding is that neuroticism has a strong direct effect on sadness that is independent of satisfaction with various life domains. This finding suggests that neuroticism may have a direct effect on individuals’ mood rather than interacting with situational factors that are unique to individual life domains. The second finding is that neuroticism has sizeable effects on all life domains ranging from b = -.19 for satisfaction with housing to -31 for satisfaction with friendships.

Following the various paths from neuroticism to life-satisfaction produces a total effect of b = -.38, which confirms the strong negative effect of neuroticism on well-being. About a quarter of this effect is directly mediated by negative affect (sadness), b = -.09. The rest is mediated by the top-down effect of neuroticism on satisfaction with life domains and the bottom-up effect of life domains on global life-evaluations.

McCrae and Costa (1991) expanded their model to include the other Big Five factors. They proposed that agreeableness has a positive influence on well-being that is mediated by romantic satisfaction (adding Liebe) and that conscientiousness has a positive influence on well-being that is mediated by work satisfaction (adding Arbeit). Although this proposal was made three decades ago, it has never been seriously tested because few studies measure domain satisfaction (but see Heller et al., 2004).

To test these hypotheses, I added conscientiousness and agreeableness to the model. Adding both together was necessary because agreeableness and conscientiousness were correlated as reflected in a large modification index when the two factors were assumed to be independent. This does not mean that agreeableness and conscientiousness are correlated factors, an issue that is debated among personality psychologists (Anusic et al., 2009; Biesanz & West, 2004; DeYoung, 2006). One problem is that secondary loadings can produce spurious correlations among scale scores that were used for this model. This could be examined by using a more complex item-level model in the future. For now, agreeableness and conscientiousness were allowed to correlate. The results showed no direct effects of conscientiousness on PA, NA, and LS. In contrast, agreeableness was a positive predictor of PA and a negative predictor of NA. Most important are the relationships with domain satisfactions.

Confirming McCrae and Costa’s (1991) prediction, work satisfaction was predicted by conscientiousness, b = .21, z = 3.4. Also confirming McCrae and Costa, romantic satisfaction was predicted by agreeableness, although the effect size was small, b = .13, z = 2.9. Moreover, conscientiousness was an even stronger predictor, b =.28, z = 6.0. This confirms the old saying “marriage is work.” Also not predicted by McCrae and Costa was that conscientiousness is related to higher housing satisfaction, b = .20, z = 3.7, presumably because conscientious individuals take better care of their houses. The other domains were not significantly related to conscientiousness, |b| < .1.

Also not predicted by McCrae and Costa are additional relationships of agreeableness with other domains such as health, b = .18, z = 3.7, housing, a = .17, z = 2.9, recreation, b = .25, z = 4.0, and friendships, b = .35, z = 5.9. The only domains that were not predicted by agreeableness were financial satisfaction, b = .05, z = 0.8, and work satisfaction, b = .07, z = 1.3. Some of these relationships could reflects benefits for social relationships aside from romantic relationships. Thus, the results are broadly consistent with McCrae and Costa’s assumption that agreeableness is beneficial for well-being.

The total effect of agreeableness in this dataset was b = .21, z = 4.34. All of this effect was mediated by indirect paths, but only the path through romantic satisfaction achieved statistical significance due to a lack of power, b = .03, z = 2.6.

The total effect of conscientiousness was b = .18, z = 4.14. Three indirect paths were significant, namely work, b = .06, z = 3.3. romantic satisfaction, b = .06, z = 4.2, and housing satisfaction, b = .04, z = 2.51.

Overall, these results confirm previous findings that agreeableness and conscientiousness are also positive predictors of well-being and shed first evidence on potential mediators of these relationships. These results need to be replicated in datasets from other populations.

When openness was added to the model, a modification index suggested a correlation between extraversion and openness, which has been found in several multi-method studies (Anusic et al., 2009; DeYoung, 2006). Thus, the two factors were allowed to correlate. Openness had no direct effects on positive affect, negative affect, or life-satisfaction. Moreover, there were only two, weak, just significant relationships with domain satisfaction for work, b = .12, z = 2.0, and health, b = .12, z = 2.2. Consistent with meta-analysis, the total effect is negligible, b = .06, z = 1.3. In short, the results are consistent with previous studies and show that openness is not a predictor of higher or lower well-being. To keep the model simple, it is therefore possible to omit openness from the monster model.

Model Comparisons

At this point, we have built a complex, but plausible model that links personality traits to subjective well-being by means of domain satisfaction and affect. However, just because this model is plausible and fits the data, does not ensure that it is the right model. An important step in causal modeling is to consider alternative models and to do model comparisons. Overall fit is less important than relatively better fit among alternative models.

The previous model assumed that domain satisfaction causes higher levels of PA and lower levels of NA. Accordingly, affect is a summary of the affect generated in different life domains. This assumption is consistent with bottom-up models of well-being. However, a plausible alternative model assumes that affect is largely influenced by internal dispositions which in turn color our experiences of different life domains. Accordingly neuroticism may simply be a disposition to be more often in a negative mood and this negative mood colors perception of marital satisfaction, job satisfaction, and so on. Costa and McCrae (1980) proposed that neuroticism and extraversion are global affective dispositions. So, it makes sense to postulate that their influence on domain satisfaction and life satisfaction is mediated by affect. McCrae and Costa (1991) postulated that agreeableness and conscientiousness are not affective dispositions, but rather only instrumental for higher satisfaction in some life domains. Thus, their effects should not be mediated by affect. Consistent with this assumption, conscientiousness showed only significant relationships with some domains, including work satisfaction. However, agreeableness was a positive predictor of all life domains, suggesting that it is also a broad affective disposition. I thus modeled agreeableness as a third global affective disposition (see Figure 2).

The effect sizes for affect on domain satisfaction are shown in Table 1.

A comparison of the fit indices for the top-down and bottom-up models shows that both models meet standard criteria for global model fit (CFI > .95; RMSEA < .06). In addition, the results show no clear superiority of one model over the other. CFI and RMSEA show slightly better fit for the bottom-up model, but the Bayesian Information Criterion favors the more parsimonious top-down model. Thus, the data are unable to distinguish between the two models.

Both model assume that conscientiousness is instrumental for higher well-being in only some domains. The key difference between the models is the assumption of the top-down model that changes in domain satisfaction have no influence on affective experiences. That is, an increase in relationship satisfaction does not produce higher levels of PA or a decrease in job satisfaction does not produce a change in NA. These competing predictions can be tested in longitudinal studies.

Conclusion

To conclude part 4 of the monster model series. As surprising as it may sound, the present results provide one of the first tests of McCrae and Costa’s causal theory of well-being (Costa & McCrae, 1980, McCrae & Costa, 1991). Although the present results are consistent with their proposal that agreeableness and conscientiousness are instrumental for higher well-being because they foster higher romantic and job satisfaction, respectively, the present results also show that this model is too simplistic. For example, conscientiousness may also increase well-being because it contributes to higher romantic satisfaction (marriage is work).

One limitation of the present model is the focus on the Big Five as a measure of personality traits. The Big Five are higher-order personality traits of more specific personality traits that are often called facets. Facet level traits may predict additional variance in well-being that is not captured by the Big Five (Schimmack ,Oishi, Furr, & Funder, 2004). Part 5 will add the strongest facet predictors to the model, namely the Depressiveness facet of Neuroticism and the Cheerfulness facet of Extraversion (see also Payne & Schimmack, 2020).

Continue here to Part 5.

Stay tuned.

False Positive Causality: Putting Traits into Causal Models of Panel Data

Poster presented at the virtual conference of the Association for Personality Research (ARP), July 16, 2021.

For a more detailed critique, see “Why most cross-lagged-panel models are false” (R-Index, August, 22, 2020).

References

Anusic, I., & Schimmack, U. (2016). Stability and change of personality traits, self-esteem, and well-being: Introducing the meta-analytic stability and change model of retest correlations. Journal of Personality and Social Psychology, 110(5), 766-781. 
https://doi.org/10.1037/pspp0000066

Campbell, D. T. (1963). From description to experimentation:
Interpreting trends as quasi-experiments. In C. W. Harris (Ed.), Problems in measuring
change. Madison: University of Wisconsin Press.

Hamaker, E. L., Kuiper, R. M., & Grasman, R. P. P. P. (2015). A critique of the cross-lagged panel model. Psychological Methods, 20(1), 102–116. https://doi.org/10.1037/a0038889

Heise, D. R. (1970). Causal inference from panel data. Sociological Methodology,
2
, 3–27.

Kenny, D. A., & Zautra, A. (1995). The trait-state-error model for multiwave data. Journal of Consulting and Clinical Psychology, 63(1), 52–59. https://doi.org/10.1037/0022-006X.63.1.52

Orth, U., Clark, D. A., Donnellan, M. B., & Robins, R. W. (2021). Testing prospective effects in longitudinal research: Comparing seven competing cross-lagged models. Journal of Personality and Social Psychology, 120(4), 1013-1034. http://dx.doi.org/10.1037/pspp0000358

Orth, U., Robins, R. W., & Roberts, B. W. (2008). Low self-esteem prospectively predicts depression in adolescence and young adulthood. Journal of Personality and Social Psychology, 95, 695–708. http://dx.doi.org/10.1037/0022-3514.95.3.695

Pelz, D. C., & Andrews, F. (1964). Detecting causal priorities in panel study data, American Sociological Review, 29, 836-848.

Bill von Hippel and Ulrich Schimmack discuss Bill’s Replicability Index: Part 2

Background: A previous blog post shared a conversation between Bill von Hippel and Ulrich Schimmack about Bill’s Replicability Index (Part 1). To recapitulate, I had posted statistical replicability estimates for several hundred social psychologists (Personalized P-Values). Bill’s scores suggested that many of his results with p-values just below .05 might not be replicable. Bill was dismayed by his low R-Index but thought that some of his papers with very low values might be more replicable than the R-Index would indicate. He suggested that we put the R-Index results to an empirical test. He chose his paper with the weakest statistical evidence (interaction p = .07) for a replication study. We jointly agreed on the replication design and sample size. In just three weeks the study was approved, conducted, and the results were analyzed. Here we discuss the results.

…. Three Weeks Later

Bill: Thanks to rapid turnaround at our university IRB, the convenience of modern data collection, and the programming skills of Sam Pearson, we have now completed our replication study on Prolific. We posted the study for 2,000 participants, and 2,031 people signed up. For readers who are interested in a deeper dive, the data file is available at https://osf.io/cu68f/ and the pre-registration at https://osf.io/7ejts.

To cut to the chase, this one is a clear win for Uli’s R-Index. We successfully replicated the standard effect documented in the prior literature (see Figure A), but there was not even a hint of our predicted moderation of that effect, which was the key goal of this replication exercise (see Figure B: Interaction F(1,1167)=.97, p=.325, and the nonsignificant mean differences don’t match predictions). Although I would have obviously preferred to replicate our prior work, given that we failed to do so, I’m pleased that there’s no hint of the effect so I don’t continue to think that maybe it’s hiding in there somewhere. For readers who have an interest in the problem itself, let me devote a few paragraphs to what we did and what we found. For those who are not interested in Darwinian Grandparenting, please skip ahead to Uli’s response.

Previous work has established that people tend to feel closest to their mother’s mother, then their mother’s father, then their father’s mother, and last their father’s father. We replicated this finding in our prior paper and replicated it again here as well. The evolutionary idea underlying the effect is that our mother’s mother knows with certainty that she’s related to us, so she puts greater effort into our care than other grandparents (who do not share her certainty), and hence we feel closest to her. Our mother’s father and father’s mother both have one uncertain link (due to the possibility of cuckoldry), and hence put less effort into our care than our mother’s mother, so we feel a little less close to them. Last on the list is our father’s father, who has two uncertain links to us, and hence we feel least close to him.

The puzzle that motivated our previous work lies in the difference between our mother’s father and father’s mother; although both have one uncertain link, most studies show that people feel closer to their mother’s father than their father’s mother. The explanation we had offered for this effect was based on the idea that our father’s mother often has daughters who often have children, providing her with a more certain outlet for her efforts and affections. According to this possibility, we should only feel closer to our mother’s father than our father’s mother when the latter has grandchildren via daughters, and that is what our prior paper had documented (in the form of a marginally significant interaction and predicted simple effects).

Our clear failure to replicate that finding suggests an alternative explanation for the data in Figure A:

  1. People are closer to their maternal grandparents than their paternal grandparents (possibly for the reasons of genetic certainty outlined above).
  2. People are closer to their grandmothers than their grandfathers (possibly because women tend to be more nurturant than men and more involved in childcare).
  3. As a result of these two main effects, people tend to be closer to their mothers’ father than their father’s mother, and this particular difference emerges in the presence or absence of other more certain kin.

Does our failure to replicate mean that the presence or absence of more certain kin has no impact on grandparenting? Clearly not in the manner I expected, but that doesn’t mean it has no effect. Consider the following (purely exploratory, non-preregistered) analyses of these same data: After failing to find the predicted interaction above, I ran a series of regression analyses, in which closeness to maternal and paternal grandparents were the dependent variables and number of cousins via fathers’ and mothers’ brothers and sisters were the predictor variables. The results are the same whether we’re looking at grandmothers or grandfathers, so for the sake of simplicity, I’ve collapsed the data into closeness to paternal grandparents and closeness to maternal grandparents. Here are the regression tables:

We see three very small but significant findings here (all of which require replication before we have any confidence in them). First, people feel closer to their paternal grandparents to the degree that those grandparents are not also maternal grandparents to someone else (i.e., more cousins through fathers’ sisters are associated with less closeness to paternal grandparents). Second, people feel closer to their paternal grandparents to the degree that their maternal grandparents have more grandchildren through daughters other than their mother (i.e., more cousins through mothers’ sisters are associated with more closeness to paternal grandparents). Third, people feel closer to their maternal grandparents to the degree that those grandparents are not also maternal grandparents to someone else (i.e., more cousins through mothers’ sisters are associated with less closeness to maternal grandparents). Note that none of these effects emerged via cousins through father’s or mother’s brothers. These findings strike me as worthy of follow-up, as they suggest that the presence or absence of equally or more certain kin does indeed have a (very small) influence on grandparents in a manner that evolutionary theory would predict (even if I didn’t predict it myself).

Uli:  Wow, I am impressed how quickly research with large samples can be done these days. That is good news for the future of social psychology, at least the studies that are relatively easy to do. 

Bill: Agreed! But benefits rarely come without cost and studies on the web are no exception. In this case, the ease of working on the web also distorts our field by pushing us to do the kind of work that is ‘web-able’ (e.g., self-report) or by getting us to wangle the methods to make them work on the web. Be that as it may, this study was a no brainer, as it was my lowest R-Index and pure self-report. Unfortunately, my other papers with really low R-Indices aren’t as easy to go back and retest (although I’m now highly motivated to try).

Uli:  Of course, I am happy that R-Index made the correct prediction, but N = 1 is not that informative. 

Bill: Consider this N+1, as it adds to your prior record.

Uli:  Fortunately, R-Index also does make good, although by no means, perfect predictions in general; https://replicationindex.com/2021/05/16/pmvsrindex/.

Bill: Very interesting.

Uli:  Maybe you set yourself up for failure by picking a marginally significant result. 

Bill: That was exactly my goal. I still believed in the finding, so it was a great chance to pit your method against my priors. Not much point in starting with one of my results that we both agree is likely to replicate.

Uli:  The R-Index analysis implied that we should only trust your results with p < .001. 

Bill: That seems overly conservative to me, but of course I’m a biased judge of my own work. Out of curiosity, is that p value better when you analyze all my critical stats rather than just one per experiment? This strikes me as potentially important, because almost none of my papers would have been accepted based on just a single statistic; rather, they typically depend on a pattern of findings (an issue I mentioned briefly in our blog).

Uli:  The rankings are based on automatic extraction of test statistics. Selecting focal tests would only lead to an even more conservative alpha criterion. To evaluate the alpha = .001 criterion, it is not fair to use a single p = .07 result. Looking at the original article about grandparent relationships, I see p < .001 for mother’s mother vs. mother’s father relationships.  The other contrasts are just significant and do not look credible according to R-Index (predicting failure for same N).  However, they are clearly significant in the replication study. So, R-Index made two correct predictions (one failure and one success), and two wrong predictions. Let’s call it a tie. 🙂

Bill: Kind of you, but still a big win for the R-Index. It’s important to keep in mind that many prior papers had found the other contrasts, whereas we were the first to propose and find the specific moderation highlighted in our paper. So a reasonable prior would set the probability much higher to replicate the other effects, even if we accept that many prior findings were produced in an era of looser research standards. And that, in turn, raises the question of whether it’s possible to integrate your R-Index with some sort of Bayesian prior to see if it improves predictive ability.

Your prediction markets v. R-Index blog makes the very good point that simple is better and the R-Index works awfully well without the work involved in human predictions. But when I reflect on how I make such predictions (I happened to be a participant in one of the early prediction market studies and did very well), I’m essentially asking whether the result in question is a major departure from prior findings or an incremental advance that follows from theory. When the former, I say it won’t replicate without very strong statistical evidence. When the latter, I say it will replicate. Would it be possible to capture that sort of Bayesian processing via machine learning and then use it to supplement the R-Index?

Uli:  There is an article that tried to do this. Performance was similar to prediction markets. However, I think it is more interesting to examine the actual predictors that may contribute to the prediction of replication outcomes. For example, we know cognitive psychology and within-subject designs are more replicable than social psychology and between-subject designs. I don’t think, however, we will get very far based on single questionable studies. Bias-corrected meat-analysis may be the only way to salvage robust findings from the era of p-hacking.

To broaden the perspective from this single article to your other articles, one problem with the personalized p-values is that they are aggregated across time. This may lead to overly conservative alpha levels (p < .001) for new research that was conducted in accordance with new rules about transparency, while the rules may be too liberal for older studies that were conducted in a time when awareness about the problems of selection for significance was lacking (say before 2013).  Inspired by the “loss of confidence project” (Rohrer et al., 2021), I want to give authors the opportunity to exclude articles from their R-Index analysis that they no longer consider credible themselves. To keep track of these loss-of-confidence declaration, I am proposing to use PubPeer (https://pubpeer.com/). Once an author posts a note on PubPeer that declares loss of confidence in the empirical results of an article, the article will be excluded from the R-Index analysis. Thus, authors can improve their standing in the rankings and, more importantly, change the alpha level to a more liberal level (e.g., from .005 to .01) by (a) publicly declaring loss of confidence in a finding and (b) publishing new research with studies that have more power and honestly report non-significant results. 

I hope that the incentive to move up in the rankings will increase the low rate of loss of confidence declarations and help us to clean up the published record faster. Declarations could also be partial. For example, for the 2005 article, you could post a note on PubPeer that the ordering of the grandparent relationships was successfully replicated and the results for cousins were not with a link to the data and hopefully eventually a publication. I would then remove this article from the R-Index analysis. What do you think about this idea? 

Bill: I think this is a very promising initiative! The problem, as I see it, is that authors are typically the last ones to lose confidence in their own work. When I read through the recent ‘loss of confidence’ reports, I was pretty underwhelmed by the collection. Not that there was anything wrong with the papers in there, but rather that only a few of them surprised me. 

Take my own case as an example. I obviously knew it was possible my result wouldn’t replicate, but I was very willing to believe what turned out to be a chance fluctuation in the data because it was consistent with my hypothesis. Because I found that hypothesis-consistent chance fluctuation on my first try, I would never have stated I have low confidence in it if you hadn’t highlighted it as highly improbable. In other words, there’s no chance I’d have put that paper on a ‘loss of confidence’ list without your R-Index telling me it was crap and even then it took a failure to replicate for me to realize you were right.

Thus, I would guess that uptake into the ‘loss of confidence’ list would be low if it emphasizes work that people feel was sloppy in the first place, not because people are liars, but because people are motivated reasoners.

With that said, if the collection also emphasizes work that people have subsequently failed to replicate, and hence have lost confidence in it, I think it would be used much more frequently and could become a really valuable corrective. When I look at the Darwinian Grandparenting paper, I see that it’s been cited over 150 times on google scholar. I don’t know how many of those papers are citing it for the key moderation effect that we now know doesn’t replicate, but I hope that no one else will cite it for that reason after we publish this blog. No one wants other investigators to waste time following up their work once they realize the results aren’t reliable.

Uli: (feeling a bit blue today). I am not very optimistic that authors will take note of replication failures. Most studies are not conducted after a careful review of the existing literature or a meta-analysis that takes publication bias into account. As a result, citations in articles are often picked because they help to support a finding in an article. While p-hacking of data may have decreased over the past decade in some areas, cherry-picking of references is still common and widespread. I am not really sure how we can speed up self-correction of science. My main hope is that meta-analyses are going to improve and take publication bias more seriously. Fortunately, new methods show promising results in debiasing effect sizes estimates (Bartoš, Maier, Wagenmakers, Doucouliagos, & Stanley, 2021). Z-curve is also being used by meta-analysists and we are hopeful that z-curve 2.0 will soon be accepted for publication in Meta-Psychology (Bartos & Schimmack, 2021). Unfortunately, it will take another decade for these methods to become mainstream and meanwhile many resources will be wasted on half-baked ideas that are grounded in a p-hacked literature. I am not optimistic that psychology will become a rigorous science during my lifetime. So, I am trying to make the best of it. Fortunately, I can just do something else when things are too depressing, like sitting in my backyard and watch Germany win at the Euro cup. Life is good, psychological science not so much.

Bill: I don’t blame you for your pessimism, but I completely disagree. You see a science that remains flawed when we ought to know better, but I see a science that has improved dramatically in the 35 years since I began working in this field. Humans are wildly imperfect actors who did not evolve to be dispassionate interpreters of data. We hope that training people to become scientists will debias them – although the data suggest that it doesn’t – and then we double down by incentivizing scientists to publish results that are as exciting as possible as rapidly as possible.

Thankfully bias is the both the problem and the solution, as other scientists are biased in favor of their theories rather than ours, and out of this messy process the truth eventually emerges. The social sciences are a dicier proposition in this regard, as our ideologies intersect with our findings in ways that are less common in the physical and life sciences. But so long as at least some social scientists feel free to go wherever the data lead them, I think our science will continue to self-correct, even if the process often seems painfully slow.

Uli: Your response to my post is a sign that progress is possible, but 1 out of 400 may just be the exception to the rule to never question your own results. Even researchers who know better become promoters of their own theories, especially when they become popular. I think the only way to curb false enthusiasm is to leave the evaluation of theories (review articles, meta-analysis) to independent scientists. The idea that one scientist can develop and evaluate a theory objectively is simply naive. Leaders of a paradigm are like strikers in soccer. They need to have blinders on to risk failure. We need meta-psychologists to distinguish real contributions from false ones. In this way meta-psychologists are like referees. Referees are not glorious heroes, but they are needed for a good soccer game, and they have the power to call of a goal because a player was offside or used their hands. The problem for science is the illusion that scientists can control themselves.

Reevaluating the Predictive Validity of the Race Implicit Association Test

Over the past two decades, social psychological research on prejudice has been dominated by the implicit cognition paradigm (Meissner, Grigutsch, Koranyi, Müller, & Rothermund, 2019). This paradigm is based on the assumption that many individuals of the majority group (e.g., White US Americans) have an automatic tendency to discriminate against members of a stigmatized minority group (e.g., African Americans). It is assumed that this tendency is difficult to control because many people are unaware of their prejudices.

The implicit cognition paradigm also assumes that biases vary across individuals of the majority group. The most widely used measure of individual differences in implicit biases is the race Implicit Association Test (rIAT; Greenwald, McGhee, & Schwartz, 1998). Like any other measure of individual differences, the race IAT has to meet psychometric criteria to be a useful measure of implicit bias. Unfortunately, the race IAT has been used in hundreds of studies before its psychometric properties were properly evaluated in a program of validation research (Schimmack, 2021a, 2021b).

Meta-analytic reviews of the literature suggest that the race IAT is not as useful for the study of prejudice as it was promised to be (Greenwald et al., 1998). For example, Meissner et al. (2019) concluded that “the predictive value for behavioral criteria is weak and their incremental validity over and above self-report measures is negligible” (p. 1).

In response to criticism of the race IAT, Greenwald, Banaji, and Nosek (2015) argued that “statistically small effects of the implicit association test can have societally large effects” (p. 553). At the same time, Greenwald (1975) warned psychologists that they may be prejudiced against the null-hypothesis. To avoid this bias, he proposed that researchers should define a priori a range of effect sizes that are close enough to zero to decide in favor of the null-hypothesis. Unfortunately, Greenwald did not follow his own advice and a clear criterion for a small, but practically significant amount of predictive validity is lacking. This is a problem because estimates have decreased over time from r = .39 (McConnell & Leibold, 2001), to r = .24 in 2009 ( Greenwald, Poehlman, Uhlmann, and Banaji, 2009), to r = .148 in 2013 (Oswald, Mitchell, Blanton, Jaccard, & Tetlock (2013), and r = .097 in 2019 (Greenwald & Lai, 2020; Kurdi et al., 2019). Without a clear criterion value, it is not clear how this new estimate of predictive validity should be interpreted. Does it still provide evidence for a small, but practically significant effect, or does it provide evidence for the null-hypothesis (Greenwald, 1975)?

Measures are not Causes

To justify the interpretation of a correlation of r = .1 as small but important, it is important to revisit Greenwald et al.’s (2015) arguments for this claim. Greenwald et al. (2015) interpret this correlation as evidence for an effect of the race IAT on behavior. For example, they write “small effects can produce substantial discriminatory impact also by cumulating over repeated occurrences to the same person” (p. 558). The problem with this causal interpretation of a correlation between two measures is that scores on the race IAT have no influence on individuals’ behavior. This simple fact is illustrated in Figure 1. Figure 1 is a causal model that assumes the race IAT reflects valid variance in prejudice and prejudice influences actual behaviors (e.g., not voting for a Black political candidate). The model makes it clear that the correlation between scores on the race IAT (i.e., the iat box) and scores on a behavioral measures (i.e., the crit box) do not have a causal link (i.e., no path leads from the iat box to the crit box). Rather, the two measured variables are correlated because they both reflect the effect of a third variable. That is, prejudice influences race IAT scores and prejudice influences the variance in the criterion variable.

There is general consensus among social scientists that prejudice is a problem and that individual differences in prejudice have important consequences for individuals and society. The effect size of prejudice on a single behavior has not been clearly examined, but to the extent that race IAT scores are not perfectly valid measures of prejudice, the simple correlation of r = .1 is a lower limit of the effect size. Schimmack (2021) estimated that no more than 20% of the variance in race IAT scores is valid variance. With this validity coefficient, a correlation of r = .1 implies an effect of prejudice on actual behaviors of .1 / sqrt(.2) = .22.

Greenwald et al. (2015) correctly point out that effect sizes of this magnitude, r ~ .2, can have practical, real-world implications. The real question, however, is whether predictive validity of .1 justifies the use of the race IAT as a measure of prejudice. This question has to be evaluated in a comparison of predictive validity for the race IAT with other measures of prejudice. Thus, the real question is whether the race IAT has sufficient incremental predictive validity over other measures of prejudice. However, this question has been largely ignored in the debate about the utility of the race IAT (Greenwald & Lai, 2020; Greenwald et al., 2015; Oswald et al., 2013).

Kurdi et al. (2019) discuss incremental predictive validity, but this discussion is not limited to the race IAT and makes the mistake to correct for random measurement error. As a result, the incremental predictive validity for IATs of b = .14 is a hypothetical estimate for IATs that are perfectly reliable. However, it is well-known that IATs are far from perfectly reliable. Thus, this estimate overestimates the incremental predictive validity. Using Kurdi et al.’s data and limiting the analysis to studies with the race IAT, I estimated incremental predictive validity to be b = .08, 95%CI = .04 to .12. It is difficult to argue that this a practically significant amount of incremental predictive validity. At the very least, it does not justify the reliance on the race IAT as the only measure of prejudice or the claim that the race IAT is a superior measure of prejudice (Greenwald et al., 2009).

The meta-analytic estimate of b = .1 has to be interpreted in the context of evidence of substantial heterogeneity across studies (Kurdi et al., 2019). Kurdi et al. (2019) suggest that “it may be more appropriate to ask under what conditions the two [race IAT scores and criterion variables] are more or less highly correlated” (p. 575). However, little progress has been made in uncovering moderators of predictive validity. One possible explanation for this is that previous meta-analysis may have overlooked one important source of variation in effect sizes, namely publication bias. Traditional meta-analyses may be unable to reveal publication bias because they include many articles and outcome measures that did not focus on predictive validity. For example, Kurdi’s meta-analysis included a study by Luo, Li, Ma, Zhang, Rao, and Han (2015). The main focus of this study was to examine the potential moderating influence of oxytocin on neurological responses to pain expressions of Asian and White faces. Like many neurological studies, the sample size was small (N = 32), but the study reported 16 brain measures. For the meta-analysis, correlations were computed across N = 16 participants separately for two experimental conditions. Thus, this study provided as many effect sizes as it had participants. Evidently, power to obtain a significant result with N = 16 and r = .1 is extremely low, and adding these 32 effect sizes to the meta-analysis merely introduced noise. This may undermine the validity of meta-analytic results ((Sharpe, 1997). To address this concern, I conducted a new meta-analysis that differs from the traditional meta-analyses. Rather than coding as many effects from as many studies as possible, I only include focal hypothesis tests from studies that aimed to investigate predictive validity. I call this a focused meta-analysis.

Focused Meta-Analysis of Predictive Validity

Coding of Studies

I relied on Kurdi et al.’s meta-analysis to find articles. I selected only published articles that used the race IAT (k = 96). The main purpose of including unpublished studies is often to correct for publication bias (Kurdi et al., 2019). However, it is unlikely that only 14 (8%) studies that were conducted remained unpublished. Thus, the unpublished studies are not representative and may distort effect size estimates.

Coding of articles in terms of outcome measures that reflect discrimination yielded 60 studies in 45 articles. I examined whether this selection of studies influenced the results by limiting a meta-analysis with Kurdi et al.’s coding of studies to these 60 articles. The weighted average effect size was larger than the reported effect size, a = .167, se = .022, 95%CI = .121 to .212. Thus, Kurdi et al.’s inclusion of a wide range of studies with questionable criterion variables diluted the effect size estimate. However, there remained substantial variability around this effect size estimate using Kurdi et al.’s data, I2 = 55.43%.

Results

The focused coding produced one effect-size per study. It is therefore not necessary to model a nested structure of effect sizes and I used the widely used metafor package to analyze the data (Viechtbauer, 2010). The intercept-only model produced a similar estimate to the results for Kurdi et al.’s coding scheme, a = .201, se = .020, 95%CI = .171 to .249. Thus, focal coding does seem to produce the same effect size estimate as traditional coding. There was also a similar amount of heterogeneity in the effect sizes, I2 = 50.80%.

However, results for publication bias differed. Whereas Kurdi et al.’s coding shows no evidence of publication bias, focused coding produced a significant relationship emerged, b = 1.83, se = .41, z = 4.54, 95%CI = 1.03 to 2.64. The intercept was no longer significant, a = .014, se = .0462, z = 0.31, 95%CI = -.077 to 95%CI = .105. This would imply that the race IAT has no incremental predictive validity. Adding sampling error as a predictor reduced heterogeneity from I2 = 50.80% to 37.71%. Thus, some portion of the heterogeneity is explained by publication bias.

Stanley (2017) recommends to accept the null-hypothesis when the intercept in the previous model is not significant. However, a better criterion is to compare this model to other models. The most widely used alternative model regresses effect sizes on the squared sampling error (Stanley, 2017). This model explained more of the heterogeneity in effect sizes as reflected in a reduction of unexplained heterogeneity from 50.80% to 23.86%. The intercept for this model was significant, a = .113, se = .0232, z = 4.86, 95%CI = .067 to .158.

Figure 2 shows the effect sizes as a function of sampling error and the regression lines for the three models.

Inspection of Figure 1 provides further evidence that the squared-SE model. The red line (squared sampling error) fits the data better than the blue line (sampling error) model. In particular for large samples, PET underestimates effect sizes.

The significant relationship between sample size (sampling error) and effect sizes implies that large effects in small studies cannot be interpreted at face value. For example, the most highly cited study of predictive validity had only a sample size of N = 42 participants (McConnell & Leibold, 2001). The squared-sampling-error model predicts an effect size estimate of r = .30, which is close to the observed correlation of r = .39 in that study.

In sum, a focal meta-analysis replicates Kurdi et al.’s (2019) main finding that the average predictive validity of the race IAT is small, r ~ .1. However, the focal meta-analysis also produced a new finding. Whereas the initial meta-analysis suggested that effect sizes are highly variable, the new meta-analysis suggests that a large portion of this variability is explained by publication bias.

Moderator Analysis

I explored several potential moderator variables, namely (a) number of citations, (b) year of publication, (c) whether IAT effects were direct or moderator effects, (d) whether the correlation coefficient was reported or computed based on test statistics, and (e) whether the criterion was an actual behavior or an attitude measure. The only statistically significant result was a weaker correlation in studies that predicted a moderating effect of the race IAT, b = -.11, se = .05, z = 2.28, p = .032. However, the effect would not be significant after correction for multiple comparison and heterogeneity remained virtually unchanged, I2 = 27.15%.

During the coding of the studies, the article “Ironic effects of racial bias during interracial interactions” stood out because it reported a counter-intuitive result. in this study, Black confederates rated White participants with higher (pro-White) race IAT scores as friendlier. However, other studies find the opposite effect (e.g., McConnell & Leibold, 2001). If the ironic result was reported because it was statistically significant, it would be a selection effect that is not captured by the regression models and it would produce unexplained heterogeneity. I therefore also tested a model that excluded all negative effect. As bias is introduced by this selection, the model is not a test of publication bias, but it may be better able to correct for publication bias. The effect size estimate was very similar, a = .133, se = .017, 95%CI = .010 to .166. However, heterogeneity was reduced to 0%, suggesting that selection for significance fully explains heterogeneity in effect sizes.

In conclusion, moderator analysis did not find any meaningful moderators and heterogeneity was fully explained by publication bias, including publishing counterintuitive findings that suggest less discrimination by individuals with more prejudice. The finding that publication bias explains most of the variance is extremely important because Kurdi et al. (2019) suggested that heterogeneity is large and meaningful, which would suggest that higher predictive validity could be found in future studies. In contrast, the current results suggest that correlations greater than .2 in previous studies were largely due to selection for significance with small samples, which also explains unrealistically high correlations in neuroscience studies with the race IAT (cf. Schimmack, 2021b).

Predictive Validity of Self-Ratings

The predictive validity of self-ratings is important for several reasons. First, it provides a comparison standard for the predictive validity of the race IAT. For example, Greenwald et al. (2009) emphasized that predictive validity for the race IAT was higher than for self-reports. However, Kurdi et al.’s (2019) meta-analysis found the opposite. Another reason to examine the predictive validity of explicit measures is that implicit and explicit measures of racial attitudes are correlated with each other. Thus, it is important to establish the predictive validity of self-ratings to estimate the incremental predictive validity of the race IAT.

Figure 2 shows the results. The sampling-error model shows a non-zero effect size, but sampling error is large, and the confidence interval includes zero, a = .121, se = .117, 95%CI = -.107 to .350. Effect sizes are also extremely heterogeneous, I2 = 62.37%. The intercept for the squared-sampling-error model is significant, a = .176, se = .071, 95%CI = .036 to .316, but the model does not explain more of the heterogeneity in effect sizes than the squared-sampling-error model, I2 = 63.33%. To remain comparability, I use the squared-sampling error estimate. This confirms Kurdi et al.’s finding that self-ratings have slightly higher predictive validity, but the confidence intervals overlap. For any practical purposes, predictive validity of the race IAT and self-reports is similar. Repeating the moderator analyses that were conducted with the race IAT revealed no notable moderators.

Implicit-Explicit Correlations

Only 21 of the 60 studies reported information about the correlation between the race IAT and self-report measures. There was no indication of publication bias, and the effect size estimates of the three models converge on an estimate of r ~ .2 (Figure 3). Fortunately, this result can be compared with estimates from large internet studies (Axt, 2017) and a meta-analysis of implicit-explicit correlations (Hofmann et al., 2005). These estimates are a bit higher, r ~ .25. Thus, using an estimate of r = .2 is conservative for a test of the incremental predictive validity of the race IAT.

Incremental Predictive Validity

It is straightforward to estimate the incremental predictive validity of the race IAT and self-reports on the basis of the correlations between race IAT, self-ratings, and criterion variables. However, it is a bit more difficult to provide confidence intervals around these estimates. I used a simulated dataset with missing values to reproduce the correlations and sampling error of the meta-analysis. I then regressed, the criterion on the implicit and explicit variable. The incremental predictive validity for the race IAT was b = .07, se = .02, 95%CI = .03 to .12. This finding implies that the race IAT on average explains less than 1% unique variance in prejudice behavior. The incremental predictive validity of the explicit measure was b = .165, se = .03, 95%CI = .11 to .23. This finding suggests that explicit measures explain between 1 and 4 percent of the variance in prejudice behaviors.

Assuming that there is no shared method variance between implicit and explicit measures and criterion variables and that implicit and explicit measures reflect a common construct, prejudice, it is possible to fit a latent variable model to the correlations among the three indicators of prejudice (Schimmack, 2021). Figure 4 shows the model and the parameter estimates.

According to this model, prejudice has a moderate effect on behavior, b = .307, se = .043. This is consistent with general findings about effects of personality traits on behavior (Epstein, 1973; Funder & Ozer, 1983). The loading of the explicit variable on the prejudice factor implies that .582^2 = 34% of the variance in self-ratings of prejudice is valid variance. The loading of the implicit variable on the prejudice factor implies that .353^2 = 12% of the variance in race IAT scores is valid variance. Notably, similar estimates were obtained with structural equation models of data that are not included in this meta-analysis (Schimmack, 2021). Using data from Cunningham et al., (2001) I estimated .43^2 = 18% valid variance. Using Bar-Anan and Vianello (2018), I estimated .44^2 = 19% valid variance. Using data from Axt, I found .44^2 = 19% valid variance, but 8% of the variance could be attributed to group differences between African American and White participants. Thus, the present meta-analytic results are consistent with the conclusion that no more than 20% of the variance in race IAT scores reflects actual prejudice that can influence behavior.

In sum, incremental predictive validity of the race IAT is low for two reasons. First, prejudice has only modest effects on actual behavior in a specific situation. Second, only a small portion of the variance in race IAT scores is valid.

Discussion

In the 1990s, social psychologists embraced the idea that behavior is often influenced by processes that occur without conscious awareness. This assumption triggered the implicit revolution (Greenwald & Banaji, 2017). The implicit paradigm provided a simple explanation for low correlations between self-ratings of prejudice and implicit measures of prejudice, r ~ .2. Accordingly, many people are not aware how prejudice their unconscious is. The Implicit Association Test seemed to support this view because participants showed more prejudice on the IAT than on self-report measures. First studies of predictive validity also seemed to support this new model of prejudice (McConnell & Leibold, 2001), and the first meta-analysis suggested that implicit bias has a stronger influence on behavior than self-reported attitudes (Greenwald, Poehlman, Uhlmann, & Banaji, 2009, p. 17).

However, the following decade produced many findings that require a reevaluation of the evidence. Greenwald et al. (2009) published the largest test (N = 1057) of predictive validity. This study examined the ability of the race IAT to predict racial bias in the 2008 US presidential election. Although the race IAT was correlated with voting for McCain versus Obama, incremental predictive validity was close to zero and no longer significant when explicit measures were included in the regression model. Then subsequent meta-analyses produced lower estimates of predictive validity and it is no longer clear that predictive validity, especially incremental predictive validity, is high enough to reject the null-hypothesis. Although incremental predictive validity may vary across conditions, no conditions have been identified that show practically significant incremental predictive validity. Unfortunately, IAT proponents continue to make misleading statements based on single studies with small samples. For example, Kurdi et al. claimed that “effect sizes tend to be relatively large in studies on physician–patient interactions” (p. 583). However, this claim was based on a study with just 15 physicians, which makes it impossible to obtain precise effect size estimates about implicit bias effects for physicians.

Beyond Nil-Hypothesis Testing

Just like psychology in general, meta-analyses also suffer from the confusion of nil-hypothesis testing and null-hypothesis testing. The nil-hypothesis is the hypothesis that an effect size is exactly zero. Many methodologists have pointed out that it is rather silly to take the nil-hypothesis at face value because the true effect size is rarely zero (Cohen, 1994). The more important question is whether an effect size is sufficiently different from zero to be theoretically and practically meaningful. As pointed out by Greenwald (1975), effect size estimation has to be complemented with theoretical predictions about effect sizes. However, research on predictive validity of the race IAT lacks clear criteria to evaluate effect size estimates.

As noted in the introduction, there is agreement about the practical importance of statistically small effects for the prediction of discrimination and other prejudiced behaviors. The contentious question is whether the race IAT is a useful measure of dispositions to act prejudiced. Viewed from this perspective, focus on the race IAT is myopic. The real challenge is to develop and validate measures of prejudice. IAT proponents have often dismissed self-reports as invalid, but the actual evidence shows that self-reports have some validity that is at least equal to the validity of the race IAT. Moreover, even distinct self-report measures like the feeling thermometer and the symbolic racism have incremental predictive validity. Thus, prejudice researchers should use a multi-method approach. At present it is not clear that the race IAT can improve the measurement of prejudice (Greenwald et al., 2009; Schimmack, 2021a).

Methodological Implications

This article introduced a new type of meta-analysis. Rather than trying to find as many vaguely related studies and to code as many outcomes as possible, focused meta-analysis is limited to the main test of the key hypothesis. This approach has several advantages. First, the classic approach creates a large amount of heterogeneity that is unique to a few studies. This noise makes it harder to find real moderators. Second, the inclusion of vaguely related studies may dilute effect sizes. Third, the inclusion of non-focal studies may mask evidence of publication bias that is virtually present in all literatures. Finally, focal meta-analysis are much easier to do and can produce results much faster than the laborious meta-analyses that psychologists are used to. Even when classic meta-analysis exist, they often ignore publication bias. Thus, an important task for the future is to complement existing meta-analysis with focal meta-analysis to ensure that published effect sizes estimates are not diluted by irrelevant studies and not inflated by publication bias.

Prejudice Interventions

Enthusiasm about implicit biases has led to interventions that aim to reduce implicit biases. This focus on implicit biases in the real world needs to be reevaluated. First, there is no evidence that prejudice typically operates outside of awareness (Schimmack, 2021a). Second, individual differences in prejudice have only a modest impact on actual behaviors and are difficult to change. Not surprisingly, interventions that focus on implicit bias are not very infective. Rather than focusing on changing individuals’ dispositions, interventions may be more effective by changing situations. In this regard, the focus on internal factors is rather different from the general focus in social psychology on situational factors (Funder & Ozer, 1983). In recent years, it has become apparent that prejudice is often systemic. For example, police training may have a much stronger influence on racial disparities in fatal use of force than individual differences in prejudice of individual officers (Andersen, Di Nota, Boychuk, Schimmack, & Collins, 2021).

Conclusion

The present meta-analysis of the race IAT provides further support for Meissner et al.’s (2019) conclusion that IATs “predictive value for behavioral criteria is weak and their incremental validity over and above self-report measures is negligible” (p. 1). The present meta-analysis provides a quantitative estimate of b = .07. Although researchers can disagree about the importance of small effect sizes, I agree with Meissner that the gains from adding a race IAT to the measurement of prejudice is negligible. Rather than looking for specific contexts in which the race IAT has higher predictive validity, researchers should use a multi-method approach to measure prejudice. The race IAT may be included to further explore its validity, but there is no reason to rely on the race IAT as the single most important measure of individual differences in prejudice.

References

Funder, D.C., & Ozer, D.J. (1983). Behavior as a function of the situation. Journal of Personality and Social Psychology, 44, 107–112.

Kurdi, B., Seitchik, A. E., Axt, J. R., Carroll, T. J., Karapetyan, A., Kaushik, N., et al. (2019). Relationship between the implicit association test and intergroup behavior: a meta-analysis. American Psychologist. 74, 569–586. doi: 10.1037/amp0000364

Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software36(3), 1–48. https://www.jstatsoft.org/v036/i03.

Failure to Accept the Null-Hypothesis: A Case Study

The past decade has revealed many flaws in the way psychologists conduct empirical tests of theories. The key problem is that psychologists lacked an accepted strategy to conclude that a prediction was not supported. This fundamental flaw can be traced back to Fisher’s introduction of significance testing. In Fisher’s the null-hypothesis is typically specified as the absence of an effect in either direction. That is, the effect size is exactly zero. Significance testing examines how much empirical results deviate from this prediction. If the probability of the result or even more extreme deviations is less than 5%, the null-hypothesis is rejected. However, if the p-value is greater than .05, no inferences can be drawn from the finding because there are two explanations for this finding. Either the null-hypothesis is true or it is false and the result is a false negative result. The probability of this false negative results is unspecified in Fisher’s framework. This asymmetrical approach to significance testing continues to dominate psychological science.

Criticism of this one-sided approach to significance testing is nearly as old as nil-hypothesis significance testing itself (Greenwald, 1975; Sterling, 1959). Greenwald’s (1975) article is notable because it provided a careful analysis of the problem and it pointed towards a solution to this problem that is rooted in Neyman-Pearson’s alternative to Fisher’s significance testing. Greenwald (1975) showed how it is possible to “Accept the Null-Hypothesis Gracefully” (p. 16).

“Use a range, rather than a point, null hypothesis. The procedural recommendations to follow are much easier to apply if the researcher has decided, in advance of data collection, just what magnitude of effect on a dependent measure or measure of association is large enough not to be considered trivial. This decision may have to be made somewhat arbitrarily but seems better to be made somewhat arbitrarily before data collection than to be made after examination of the data.” (p. 16).

The reason is simply that it is impossible to provide evidence for the nil-hypothesis that an effect size is exactly zero, just like it is impossible to show than an effect size equals any other precise value (e..g., r = .1). Although Greenwald made this sensible suggestion over 40 years ago, it is nearly impossible to find articles that specify a range of effect sizes a priori (e.g.., we expected the effect size to be in the range between r = .3 and r = .5 or we expected the correlation to be larger than r = .1).

Bad training continues to be a main reason for the lack of progress in psychological science. However, other factors also play a role. First, specifying effect sizes a priori has implications for the specification of sample sizes. A researcher who declares that effect sizes as small as r = .1 are meaningful and expected needs large samples to obtain precise effect size estimates. For example, assuming the population correlation is r = .2 and a researcher wants to show that it is at least r = .1, a one-sided test with alpha = .05 and 95% power (i.e., the probability of a successful outcome) is N = 1,035. As most sample sizes in psychology are below N = 200, most studies simply lack the precision to test hypothesis that predict small effects. A solution to this might be to focus on hypotheses that predict large effect sizes. However, to show that a population correlation of r = .4 is greater than r = .3, still requires N = 833 participants. In fact, most studies in psychology barely have enough power to demonstrate that moderate correlations, r = .3, are greater than zero, N = 138. In short, most studies are too small to provide evidence for the null-hypothesis that effect sizes are small than a minimum effect size. Not surprisingly, psychological theories are rarely abandoned because empirical results seemed to support the null-hypothesis.

However, occasionally studies do have large samples and it would be possible to follow Greenwald’s (1975) recommendation to specify a minimum effect size a priori. For example, Greenwald and colleagues conducted a study with N = 1,411 participants who reported their intentions to vote for Obama or McCain in the 2008 US elections. The main hypothesis was that implicit measures of racial attitudes like the race IAT would add to the prediction because some White Democrats might not vote for a Black Democratic candidate. It would have been possible to specify an minimum effect size based on a meta-analysis that was published in the same year. This meta-analysis of smaller studies suggested that the average race IAT – criterion correlation was r = .236. The explicit – criterion correlation was r = .186, effect, and the explicit-implicit correlation was only r = .117. Given the lower estimates for the explicit measures and the low explicit-implicit correlation, a regression analysis would only slightly reduce the effect size for the incremental predictive validity of the race IAT, b = .225. Thus, it would have been possible to test the hypothesis that the effect size is at least b = .1, which would imply that adding the race IAT as a predictor explains at least 1% additional variance in voting behaviors.

In reality, the statistical analyses were conducted with prejudice against the null-hypothesis. First, Greenwald et al. (2009) noted that “conservatism and symbolic racism
were the two strongest predictors of voting intention (see Table 1)” (p. 247).

A straightforward way to test the hypothesis that the race IAT contributes to the prediction of voting would simply add the standardized race IAT as an additional predictor and use the regression coefficient to test the prediction that implicit bias as measured with the race IAT contributes to voting against Obama. A more stringent test of incremental predictive validity would also include the other explicit prejudice measures because measurement error alone can produce incremental predictive validity for measures of the same construct. However, this is not what the authors did. Instead, they examined whether the four racial attitude measures jointly predicted variance in addition to political orientation. This was the case, with 2% additional explained variance (p < .0010). However, this result does not tell us anything about the unique contribution of the race IAT. The unique contributions of the four measures were not reported. Instead, another regression model tested whether the race IAT and a second implicit measure (the Affective Misattribution Task) explained incremental variance in addition to political orientation. In this model “the pair of implicit measures incrementally predicted only 0.6% of voting intention variance, p = .05” (p. 247). This model also does not tell us anything about the importance of the race IAT because it was not reported how much of the joint contribution was explained by the race IAT alone. The inclusion of the AMP also makes it impossible to test the statistical significance for the race IAT because most of the prediction may come from the shared variance between the two implicit measures, r = .218. Most important, the model does not test whether the race IAT predicts voting above and beyond explicit measures, including symbolic racism.

Another multiple regression analysis entered symbolic racism and the two implicit measures. In this analysis, the two implicit measures combined explained an additional 0.7% of the variance, but this was not statistically significant, p = .07.

They then fitted the model with all predictor variables. In this model, the four attitude measures explained an additional 1.3% of the variance, p = .01, but no information is provided about the unique contribution of the race IAT or the joint contribution of the two implicit measures. The authors merely comment that “among the four race attitude measures,
the thermometer difference measure was the strongest incremental predictor and was also the only one of the four that was individually statistically significant in their simultaneous entry after both symbolic racism and conservatism (p. 247).

To put it mildly, the presented results carefully avoid reporting the crucial result about the incremental predictive validity of the race IAT after explicit measures of prejudice are entered into the equation. Adding the AMP only creates confusion because the empirical question is how much the race IAT adds to the prediction of voting behavior. Whether this variance is shared with another implicit measure or not is not relevant.

Table 1 can be used to obtain the results that were not reported in the article. A regression analysis shows a standardized effect size estimate of 0.000 with a 95%CI that ranges from -.047 to .046. The upper limit of this confidence interval is below the minimum effect size of .1 that was used to specify a reasonable null-hypothesis. Thus, the only study that had sufficient precision to the incremental predictive validity of the race IAT shows that the IAT does not make a meaningful, notable, practically significant contribution to the prediction of racial bias in voting. In contrast, several self-report measures did show that racial bias influenced voting behavior above and beyond the influence of political orientation.

Greenwald et al.’s (2009) article illustrates Greenwald’s (1975) prejudice against the null-hypotheses. Rather than reporting a straightforward result, they present several analyses that disguise the fact that the race IAT did not predict voting behavior. Based on these questionable analyses, the authors misrepresent the findings. For example, they claim that “both the implicit and explicit (i.e., self-report) race attitude measures successfully predicted voting.” They omit that this statement is only correct when political orientation and symbolic racism are not used as predictors.

They then argue that their results “supplement the substantial existing evidence that race attitude IAT measures predict individual behavior (reviewed by Greenwald et al., 2009)” (p. 248). This statement is false. The meta-analysis suggested that incremental predictive validity of the race IAT is r ~ .2, whereas this study shows an effect size of r ~ 0 when political orientation is taken into account.

The abstract, often the only information that is available or read, further misleads readers. “The implicit race attitude measures (Implicit Association Test and Affect Misattribution Procedure) predicted vote choice independently of the self-report race attitude measures, and also independently of political conservatism and symbolic racism. These findings support construct validity of the implicit measures” (p. 242). Careful reading of the results section shows that the statement refers to separate analyses in which implicit measures are tested controlling for explicit attitude ratings OR political orientation OR symbolic racism. The new results presented here show that the race IAT does not predict voting controlling for explicit attitudes AND political orientation AND symbolic racism.

The deceptive analysis of these data has led to many citations that the race IAT is an important predictor of actual behavior. For example, in their popular book “Blindspot” Banaji and Greenwald list this study as an example that “the Race IAT predicted racially discriminatory behavior. A continuing stream of additional studies that have been completed since publication of the meta-analysis likewise supports that conclusion. Here are a few examples of race-relevant behaviors that were predicted by automatic White preference in these more recent studies: voting for John McCain rather than Barack Obama in the 2008 U.S. presidential election” (p. 49)

Kurdi and Banaji (2017) use the study to claim that “investigators have used implicit race attitudes to predict widely divergent outcome measures” (p. 282), without noting that even the reported results showed less than 1% incremental predictive validity. A review of prejudice measures features this study as an example of predictive validity (Fiske & North, 2014).

Of course, a single study with a single criterion is insufficient to accept the null-hypothesis that the race IAT lacks incremental predictive validity. A new meta-analysis by Kurdi with Greenwald as co-author provides new evidence about the typical amount of incremental predictive validity of the incremental predictive validity of the race IAT. The only problem is that this information is not provided. I therefore analyzed the open data to get this information. The meta-analytic results suggest an implicit-criterion correlation of r = .100, se = .01, an explicit-criterion correlation of r = .127, se = .02, and an implicit-explicit correlation of of r = .139, se = .022. A regression analysis yields an estimate of the incremental predictive validity for the race IAT of .084, 95%CI = .040 to .121. While this effect size is statistically significant in a test against the nil-hypothesis, it is also statistically different from Greenwald et al.s’ (2009) estimate of b = .225. Moreover, the point estimate is below .1, which could be used to affirm the null-hypothesis, but the confidence interval includes a value of .1. Thus, there is a 20% chance (an 80%CI would not include .1) that the effect size is greater than .1, but it is unlikely(p < .05) that it is greater than .12.

Greenwald and Lai (2020) wrote an Annual Review article about implicit measures. It mentions that estimates of the predictive validity of IATs have decreased from r = .274 (Greenwald et all, 2009) to r = .097 (Kurdi et al., 2019). No mention is made of a range of effect sizes that would support the null-hypothesis that implicit measures do not add to the prediction of prejudice because they do not measure an implicit cause of behavior that is distinct from causes of prejudice that are reflected in self-report measures. Thus, Greenwald fails to follow the advice of his younger self to provide a strong test of a theory by specifying effect sizes that would provide support for the null-hypothesis and against his theory of implicit cognitions.

It is not only ironic to illustrate the prejudice against falsification with Greenwald’s own research. It also shows that the one-sided testing of theories that avoids failures is not only a lack of proper training in statistics or philosophy of science. After all, Greenwald demonstrated that he is well aware of the problems with nil-hypothesis testing. Thus, only motivated biases can explain the one-sided examination of the evidence. Once a researcher has made a name for themselves, they are no longer neutral observers like judges or juries. They are more like prosecutors who will try as hard as possible to get a conviction and ignore evidence that may support a non-guilty verdict. To make matters worse, science does not really have an adversarial system where a defense lawyer stands up for the defendant (i.e., the null-hypothesis) and no evidence can be presented to support the defendant.

Once we realize the power of motivated reasoning, it is clear that we need to separate the work of theory development and theory evaluation. We cannot let researchers who developed a theory conduct meta-analyses and write review articles, just like we cannot ask film directors to write their own movie reviews. We should leave meta-analyses and reviews to a group of theoretical psychologists who do not conduct original research. As grant money for original research is extremely limited and a lot of time and energy is wasted on grant proposals, there is ample capacity for psychologist to become meta-psychologist. Their work also needs to be evaluated differently. The aim of meta-psychology is not to make novel discoveries, but to confirm that claims by original researchers about their discoveries are actually robust, replicable, and credible. Given the well-documented bias in the published literature, a lot of work remains to be done.

Incidental Anchoring Bites the Dust

Update: 6/10/21

After I posted this post, I learned about a published meta-analysis and new studies of incidental anchoring by David Shanks and colleagues that came to the same conclusion (Shanks et al., 2020).

Introduction

“The most expensive car in the world costs $5 million. How much does a new BMW 530i cost?”

According to anchoring theory, information about the most expensive car can lead to higher estimates for the cost of a BMW. Anchoring effects have been demonstrated in many credible studies since the 1970s (Kahneman & Tversky, 1973).

A more controversial claim is that anchoring effects even occur when the numbers are unrelated to the question and presented incidentally (Criticher & Gilovich, 2008). In one study, participants saw a picture of a football player and were asked to guess how likely it is that the player will sack the football player in the next game. The player’s number on jersey was manipulated to be 54 or 94. The study produced a statistically significant result suggesting that a higher number makes people give higher likelihood judgments. This study started a small literature on incidental anchoring effects. A variation on this them are studies that presented numbers so briefly on a computer screen that most participants did not actually see the numbers. This is called subliminal priming. Allegedly, subliminal priming also produced anchoring effects (Mussweiler & Englich (2005).

Since 2011, many psychologists are skeptical whether statistically significant results in published articles can be trusted. The reason is that researchers only published results that supported their theoretical claims even when the claims were outlandish. For example, significant results also suggested that extraverts can foresee where pornographic images are displayed on a computer screen even before the computer randomly selected the location (Bem, 2011). No psychologist, except Bem, believes these findings. More problematic is that many other findings are equally incredible. A replication project found that only 25% of results in social psychology could be replicated (Open Science Collaboration, 2005). So, the question is whether incidental and subliminal anchoring are more like classic anchoring or more like extrasensory perception.

There are two ways to assess the credibility of published results when publication bias is present. One approach is to conduct credible replication studies that are published independent of the outcome of a study. The other approach is to conduct a meta-analysis of the published literature that corrects for publication bias. A recent article used both methods to examine whether incidental anchoring is a credible effect (Kvarven et al., 2020). In this article, the two approaches produced inconsistent results. The replication study produced a non-significant result with a tiny effect size, d = .04 (Klein et al., 2014). However, even with bias-correction, the meta-analysis suggested a significant, small to moderate effect size, d = .40.

Results

The data for the meta-analysis were obtained from an unpublished thesis (Henriksson, 2015). I suspected that the meta-analysis might have coded some studies incorrectly. Therefore, I conducted a new meta-analysis, using the same studies and one new study. The main difference between the two meta-analysis is that I coded studies based on the focal hypothesis test that was used to claim evidence for incidental anchoring. The p-values were then transformed into fisher-z transformed correlations and and sampling error, 1/sqrt(N – 3), based on the sample sizes of the studies.

Whereas the old meta-analysis suggested that there is no publication bias, the new meta-analysis showed a clear relationship between sampling error and effect sizes, b = 1.68, se = .56, z = 2.99, p = .003. Correcting for publication bias produced a non-significant intercept, b = .039, se = .058, z = 0.672, p = .502, suggesting that the real effect size is close to zero.

Figure 1 shows the regression line for this model in blue and the results from the replication study in green. We see that the blue and green lines intersect when sampling error is close to zero. As sampling error increases because sample sizes are smaller, the blue and green line diverge more and more. This shows that effect sizes in small samples are inflated by selection for significance.

However, there is some statistically significant variability in the effect sizes, I2 = 36.60%, p = .035. To further examine this heterogeneity, I conducted a z-curve analysis (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020). A z-curve analysis converts p-values into z-statistics. The histogram of these z-statistics shows publication bias, when z-statistics cluster just above the significance criterion, z = 1.96.

Figure 2 shows a big pile of just significant results. As a result, the z-curve model predicts a large number of non-significant results that are absent. While the published articles have a 73% success rate, the observed discovery rate, the model estimates that the expected discovery rate is only 6%. That is, for every 100 tests of incidental anchoring, only 6 studies are expected to produce a significant result. To put this estimate in context, with alpha = .05, 5 studies are expected to be significant based on chance alone. The 95% confidence interval around this estimate includes 5% and is limited at 26% at the upper end. Thus, researchers who reported significant results did so based on studies with very low power and they needed luck or questionable research practices to get significant results.

A low discovery rate implies a high false positive risk. With an expected discovery rate of 6%, the false discovery risk is 76%. This is unacceptable. To reduce the false discovery risk, it is possible to lower the alpha criterion for significance. In this case, lowering alpha to .005 produces a false discovery risk of 5%. This leaves 5 studies that are significant.

One notable study with strong evidence, z = 3.70, examined anchoring effects for actual car sales. The data came from an actual auction of classic cars. The incidental anchors were the prices of the previous bid for a different vintage car. Based on sales data of 1,477 cars, the authors found a significant effect, b = .15, se = .04 that translates into a standardized effect size of d = .2 (fz = .087). Thus, while this study provides some evidence for incidental anchoring effects in one context, the effect size estimate is also consistent with the broader meta-analysis that effect sizes of incidental anchors are fairly small. Moreover, the incidental anchor in this study is still in the focus of attention and in some way related to the actual bid. Thus, weaker effects can be expected for anchors that are not related to the question at all (a player’s number) or anchors presented outside of awareness.

Conclusion

There is clear evidence that evidence for incidental anchoring cannot be trusted at face value. Consistent with research practices in general, studies on incidental and subliminal anchoring suffer from publication bias that undermines the credibility of the published results. Unbiased replication studies and meta-analysis suggest that incidental anchoring effects are either very small or zero. Thus, there exists currently no empirical support for the notion that irrelevant numeric information can bias numeric judgments. More research on anchoring effects that corrects for publication bias is needed.

Why Are Ease of Retrieval Effects so Hard to Replicate?

Abstract

Social psychology suffers from a replication crisis because publication bias undermines the evidential value of published significant results. Meta-analysis that do not correct for publication bias are biased and cannot be used to estimate effect sizes. Here I show that a meta-analysis of the ease-of-retrieval effect (Weingarten & Hutchinson, 2018) did not fully correct for publication bias and that 200 significant results for the ease-of-retrieval effect can be fully explained by publication bias. This conclusion is consistent with the results of the only registered replication study of ease of retrieval (Groncki et al., 2021). As a result, there is no empirical support for the ease-of-retrieval effect. Implications for the credibility of social psychology are discussed.

Introduction

Until 2011, social psychology appeared to have made tremendous progress. Daniel Kahneman (2011) reviewed many of the astonishing findings in his book “Thinking: Fast and Slow.” His book used Schwarz et al.’s (2011) ease-of-retrieval research as an example of rigorous research on social judgments.

The ease-of-retrieval paradigm is simple. Participants are assigned to two groups. In one group, they are asked to recall a small number of examples from memory. The number is chosen to make it easy to do this. In the other conditions, participants are asked to recall a larger number of examples. The number is chosen so that it is hard to come up with the requested number of examples. This task is used to elicit a feeling of ease or difficulty. Hundreds of studies have used this paradigm to study the influence of ease-of-retrieval on a variety of judgments.

In the classic studies that introduced the paradigm, participants were asked to retrieve a few or many examples of assertiveness behaviors before answering a question about their assertiveness. Three studies suggested that participants based their personality judgments on the ease of retrieval.

However, this straightforward finding is not always found. Kahneman points out that participants sometimes do not rely on the ease of retrieval. Paradoxically, they sometimes rely on the number of examples they retrieved even though the number was given by the experimenter. What made ease-of-retrieval a strong theory was that ease of retrieval researchers seemed to be able to predict the conditions that made people use ease as information and the conditions when they would use other information. “The proof that you truly understand a pattern of behavior is that you know how to reverse it” (Kahneman, 2011).

This success story had one problem. It was not true. In 2011, it became apparent that social psychologists used questionable research practices to produce significant results. Thus, rather than making amazing predictions about the outcome of studies, they searched for statistical significance and then claimed that they predicted these effects (John, Loewenstein, & , 2012; Kerr, 1998). Since 2011, it has become clear that only a small percentage of results in social psychology can be replicated without questionable practices (Open Science Collaboration, 2015).

I had my doubts about the ease-of-retrieval literature because I had heard rumors that researchers were unable to replicate these effects, but it was common not to publish these replication failures. My suspicions appeared to be confirmed, when John Krosnick gave a talk about a project that replicated 12 experiments in a large nationally representative sample. All but one experiment was successfully replicate. The exception was the ease-of-retrieval study; a direct replication of Schwarz et al.’s (1991) assertiveness studies. These results were published several years later (Yeager et al., 2019).

I was surprised when Weingarten and Hutchinson (2018) published a detailed and comprehensive meta-analysis of published and unpublished ease-of-retrieval studies and found evidence for a moderate effect size (d ~ .4) even after correcting for publication bias. This conclusion based on many small studies seemed inconsistent with the replication failure in the large national representative sample (Yeager et al., 2019). Moreover, the first pre-registered direct replication of Schwarz et al. (1991) also produced a replication failure (Groncki et al., 2021). One possible explanation for the discrepancy between the meta-analytic results and the replication results could be that the meta-analysis did not fully correct for publication bias. To test this hypothesis, I used the openly shared data to examine the robustness of the effect size estimate. I also conducted a new meta-analysis that included studies published after 2014, using a different coding of studies that codes only one focal hypothesis test per study. The results showed that the effect size estimate in Weingarten and Hutchinson’s (2018) is not robust and depends heavily on outliers. I also find that the coding scheme attenuates the detection of bias which leads to inflated effect size estimates. The new meta-analysis shows an effect size estimate close to zero. It also shows that heterogeneity is fully explained by publication bias.

Reproducing the Original Meta-Analysis

All effect sizes are Fisher-z transformed correlation coefficients. The predictor is the standard error; 1/sqrt(N – 3). Figure 1 reproduces the funnel plot in Weingarten and Hutchinson (2018), with the exception that sampling error is plotted on the x-axis and effect sizes are plotted on the y-axis.

Figure 1 also includes the predictions (regression lines) for three models. The first model is an unweighted average. This model assumes that there is no publication bias. The straight orange line shows that this model assumes an average effect size of z = .23 for all sample sizes. The second model assumes that there is publication bias and that bias increases in a linear fashion with sampling error. The slope of the blue regression line is significance and suggests that publication bias is present. The intercept of this model can be interpreted as the unbiased effect size estimate (Stanley, 2017). The intercept is z = .115 with a 95% confidence interval that ranges from .036 to .193. These results reproduce the results in Weingarten and Hutchinson (2018) closely, but not exactly, r = .104, 95%CI = .034 to .172. Simulation studies suggest that this effect size estimate underestimates the true effect size when the intercept is significantly different from zero (Stanley, 2017). In this case, it is recommended to use the variance (sampling error squared) as a model of publication bias. The red curve shows the predictions of this model. Most important, the intercept is now nearly at the same level as the model without publication bias, z = .221, 95%CI = .174 to .267. Once more, these results closely reproduce the published results, r = .193, 95%CI = .153 to .232.

The problem with unweighted models is that data points from small studies are given equal weights to studies with large samples. In this particular case, small studies are given even more weight than larger studies because small studies with extremely small sample sizes (N < 20) are outliers and outliers are weighted more heavily in regression analysis. Inspection of the scatter plot shows that 7 studies with sample sizes less than 10 (5 per condition) have a strong influence on the regression line. As a result, all three regression lines in Figure 1 overestimate effect sizes for studies with more than 100 participants. Thus, the intercept overestimates the effect sizes for large studies, including Yeager et al.’s (2019) study with N = 1,323 participants. In short, the effect size estimate in the meta-analysis is strongly influenced by 7 data points that represent fewer than 100 participants.

A simple solution to this problem is to weight observations by sample size so that larger samples are given more weight. This is actually the default option for many meta-analysis programs like the metafor package in R (Viechbauer, 2010). Thus, I reran the same analyses with weighting of observations by sample size. Figure 2 shows the results. In Figure 2 the size of observations reflects weights. The most important difference in the results is that the intercept for the model with a linear effect of sampling error is practically zero and not statistically significant, z = .006, 95%CI = -.040 to .052. The confidence interval is small enough to infer that the typical effect size is close enough to zero to accept the null-hypothesis.

Proponents of ease-of-retrieval will, however, not be satisfied with this answer. First, inspection of Figure 2 shows that the intercept is now strongly influenced by a few large samples. Moreover, the model does show heterogeneity in effect sizes, I2 = 33.38%, suggesting that at least some of the significant results were based on real effects.

Coding of Studies

Effect size meta-analysis evolved without serious consideration of publication bias. Although publication bias has been known to be present since meta-analysis was invented (Sterling, 1959), it was often an afterthought rather than part of the meta-analytic model (Rosenthal, 1979). Without having to think about publication bias, it became a common practice to code individual studies without a focus on the critical test that was used to publish a study. This practice obscures the influence of publication bias and may lead to an overestimation of the average effect size. To illustrate this, I am going to focus on the 7 data points in Figure 1 that were coded with sample sizes less than 10.

Six of the observations stem from an unpublished dissertation by Bares (2007) that was supervised by Norbert Schwarz. The dissertation was a study with children. The design had the main manipulation of ease of retrieval (few vs. many) as a between subject factor. Additional factors were gender, age (kindergartners vs. second graders) and 5 content domains (books, shy, friendly, nice, mean). The key dependent variable were frequency estimates. The total sample size was 198, with 98 participants in the easy condition and 100 in the difficult condition. The hypothesis was that ease-of-retrieval would influence judgments independent of gender or content. However, rather than testing the overall main effect across all participants, the dissertation presents analyses separately for different ages and contents. This led to the coding of this study with a reasonable sample size of N = 198 as 20 effects with sample sizes of N = 7 to 9. Only six of these effects were included in the meta-analysis. Thus, the meta-analysis added 6 studies with non-significant results, when there was only one study with non-significant results that was put in the file-drawer. As a result, the meta-analysis does no longer represent the amount of publication bias in the ease-of-retrieval literature. Adding these six effects to the meta-analysis makes the data look less biased and attenuates the regression of effect sizes on sampling error, which in turn leads to a higher intercept. Thus, traditional coding of effect sizes in meta-analyses can lead to inflated effect size estimates even in models that aim to correct for publication bias.

An Updated Meta-Analysis of Ease-of-Retrieval

Building on Weingarten and Hutchinson’s (2018) meta-analysis, I conducted a new meta-analysis that relied on test statistics that were reported to test ease-of-retrieval effects. I only used published articles because the only reason to search for unpublished studies is to correct for publication bias. However, Weingarten and Hutchinson’s meta-analysis showed that publication bias is still present even with a diligent attempt to obtain all data. I extended the time frame of the meta-analysis by searching for new publications since the last year that was included in Weingarten and Hutchinson’s meta-analysis (i.e., 2014). For each study, I looked for the focal hypothesis test of the ease-of-retrieval effect. In some studies, this was a main effect. In other studies, it was a post-hoc test following an interaction effect. The exact p-values were converted into t-values and t-values were converted into fisher-z scores as effect sizes. Sampling error was based on the sample size of the study or the subgroup in which the ease of retrieval effect was predicted. For the sake of comparability, I again show unweighted and weighted results.

The effect size estimate for the random effects model that ignores publication bias is z = .340, 95%CI = .317 to .363. This would be a moderate effect size (d ~ .6). The model also shows a moderate amount of heterogeneity, I2 = 33.48%. Adding sampling error as a predictor dramatically changes the results. The effect size estimate is now practically zero, z = .020. and the 95%CI is small enough to conclude that any effect would be small, 95%CI = -.048 to .088. Moreover, publication bias fully explains heterogeneity, I2 = 0.00%. Based on this finding, it is not recommended to use the variance as a predictor (Stanley, 2017). However, for the sake of comparison, Figure 1 also shows the results for this model. The red curve shows that the model makes similar predictions in the middle, but overestimates effect sizes for large samples and for small samples. Thus, the intercept is not a reasonable estimate of the average effect size, z = .183, 95%CI = .144 to .222. In conclusion, the new coding shows clearer evidence of publication bias and even the unweighted analysis shows no evidence that the average effect size differs from zero.

Figure 4 shows that the weighted models produce very similar results to the unweighted results.

The key finding is that the intercept is not significantly different from zero, z = -.016, 95%CI = -.053 to .022. The upper bound of the 95%CI corresponds to an effect size of r = .022 or d = .04. Thus, the typical ease of retrieval effect is practically zero and there is no evidence of heterogeneity.

Individual Studies

Meta-analysis treats individual studies as interchangeable tests of a single hypothesis. This makes sense when all studies are more or less direct replications of the same experiments. However, meta-analysis in psychology often combine studies that vary in important details such as the population (adults vs. children) and the dependent variables (frequency judgments vs. attitudes). Even if a meta-analysis would show a significant average effect size, it remains unclear which particular conditions show the effect and which ones do not. This is typically examined in moderator analyses, but when publication bias is strong and effect sizes are dramatically inflated, moderator analyses have low power to detect signals in the noise.

In Figure 4, real moderators would produce systematic deviations from the blue regression line. As these residuals are small and strongly influenced by sampling error, finding a moderator is like looking for a needle in a heystack. To do so, it is useful to look for individual studies that produced more credible results than the average study. A new tool that can be used for this purpose is z-curve (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020).

Z-curve does not decompose p-values into separate effect size and sampling error components. Rather it converts p-values into z-scores and models the distribution of z-scores with a finite mixture model. The results provide complementary information about publication bias that does not rely on variation in sample sizes. As correlations between sampling error and effect sizes can be produced by other factors, z-curve provides a more direct test of publication bias.

Z-curve also provides information about the false positive risk in individual studies. If a literature has a low discovery rate (many studies produce non-significant results), the false discovery risk is high (Soric, 1989). Z-curve estimates the size of the file drawer and provides a corrected estimate of the expected discovery rate. To illustrate z-curve, I fitted z-curve to the ease-of-retrieval studies in the new meta-analysis (Figure 5).

Visual inspection shows that most z-statistics are just above the criterion for statistical significance z = 1.96. This corresponds to the finding that most effect sizes are about 2 times the magnitude of sampling error, which produces a just significant result. The z-curve shows a rapid decline of z-statistics as z-values increase. The z-curve model uses the shape of this distribution to estimate the expected discovery rate; that is, the proportion of significant results that are observed if all tests that were conducted were available. The estimate of 8% implies that most ease-of-retrieval tests are extremely underpowered and can only produce significant results with the help of sampling error. Thus, most of the observed effect size estimates in Figure 4 reflect sampling error rather than any population effect sizes.

The expected discovery rate can be compared to the observed discovery rate to assess the amount of publication bias. The observed discovery rate is simply the percentage of significant results for the 128 studies. The observed discovery rate is 83% and would be even higher if marginally significant results, p < .10, z > 1.65, were counted as significant. Thus, the observed discovery rate is 10 times higher than the expected discovery rate. This shows massive publication bias.

The difference between the expected and observed discovery rate is also important for the assessment of the false positive risk. As Soric (1989) showed, the risk of false positives increases as the discovery risk decreases. The observed discovery rate of 83% implies that the false positive risk is very small (1%). Thus, readers of journals are given the illusion that ease-of-retrieval effects are robust and researchers have a very good understanding of the conditions that can produce the effect. Hence, Kahneman’s praise of researchers’ ability to show the effect and to reverse it seemingly at will. The z-curve results show that this is an illusion because researchers only publish results when a study was successful. With an expected discovery rate of 8%, the false discovery risk is 61%. Thus, there is a high chance that studies with large samples will produce effect size estimates close to zero. This is consistent with the effect size estimates close to zero.

One solution to reduce the false-positive risk is to lower the significance criterion (Benjamin et al., 2017). Z-curve can be fitted with different alpha-levels to examine the influence on the false positive risk. By setting alpha to .005, the false positive risk is below 5% (Figure 6).

This leaves 36 studies that may have produced a real effect. A true positive result does not mean that a direct replication study will produce a significant result. To estimate replicability, we can select only the studies with p < .005 (z > 2.8) and fit z-curve to these studies using the standard significance criterion of .05. The false discovery risk inched up a bit, but may be considered acceptable with 8%. However, the expected replication rate with the same sample sizes is only 47%. Thus, replication studies need to increase sample sizes to avoid false negative results.

Five of the studies with strong evidence are by Sanna and colleagues. This is noteworthy because Sanna retracted 8 articles, including an article with ease-of-retrieval effects under suspicion of fraud (Yong, 2012). It is therefore unlikely that these studies provide credible evidence for ease-of-retrieval effects.

An article with three studies reported consistently strong evidence (Ofir et al., 2008). All studies manipulated the ease of recall of products and found that recalling a few low priced items made participants rate a store as less expensive than recalling many low priced items. It seems simple enough to replicate this study to test the hypothesis that ease of retrieval effects influence judgments of stores. Ease of retrieval may have a stronger influence for these judgments because participants may have less chronically accessible and stable information to make these judgments. In contrast, assertiveness judgments may be harder to move because people have highly stable self-concepts that show very little situational variance (Eid & Diener, 2004; Anusic & Schimmack, 2016).

Another article that provided three studies examined willingness to pay for trips to England (Sinha & Naykankuppam, 2013). A major difference to other studies was that this study supplied participants with information about tourist destinations in England and after a delay used recall of this information to manipulate ease of retrieval. Once more, ease-of-retrieval may have had an effect in these studies because participants had little chronically accessible information to make willingness-to-pay judgments.

A third, three study article with strong evidence found that participants rated the quality of their memory for specific events (e.g., New Year’s Eve) worse when they were asked to recall many (vs. few) facts about the event (Echterhoff & Hirst, 2006). These results suggest that ease-of-retrieval is used for judgments about memory, but may not influence other judgments.

The focus on individual studies shows why moderator analyses in effect-size meta-analysis often produce non-significant results. Most of the moderators that can be coded are not relevant, whereas moderators that are relevant can be limited to a single article and are not coded.

The Original Paradigm

It is not clear why Schwarz et al. (1991) decided to manipulate personality ratings of assertiveness. A look into the personality literature suggests that these judgments are often made quickly and with high temporal stability. Thus, they seemed a challenging target to demonstrate the influence of situational effects.

It was also risky to conduct these studies with small sample sizes that require large effect sizes to produce significant results. Nevertheless, the first study with 36 participants produced an encouraging, marginally significant result, F(1,34) = .07. Study 2 followed up on this result with a larger sample to boost power and did produce a real significant result, F(1, 142) = 6.35, p = .01. However, observed power (70%)) was still below the recommended level of 80%. Thus, the logical next step would have been to test the effect again with an even larger sample. However, the authors tested a moderator hypothesis in a smaller sample, which surprisingly produced a significant three-way interaction, F(1, 70) = 9.75, p < .001. Despite this strong interaction, the predicted ease-of-retrieval effects were not statistically significant because sample sizes were very small, assertive: t(18) = 1.55, p = .14, unassertive: t(18) = 1.91, p = .07.

It is unlikely to obtain supportive evidence in three underpowered studies (Schimmack, 2012), suggesting that the reported results were selected from a larger set of tests. This hypothesis can be tested with the Test of Insufficient Variance (TIVA), a bias test for small sets of studies (Renkewitz & Keiner, 2019; Schimmack, 2015). TIVA shows that the variation in p-values is less than expected, but the evidence is not conclusive. Nevertheless, even if the authors were just lucky, future studies are expected to produce non-significant results unless sample sizes are increased considerably. However, most direct replication studies of the original design used equally small sample sizes, but reported successful outcomes.

Yahalom and Schul (2016) reported a successful replication in another small sample (N = 20), with an inflated effect size estimate, t(18) = 2.58, p < .05, d = 1.15. Rather than showing the robustness of the effect, it strengthens the evidence that bias is present, TIVA p = .05. Another study in the same article finds evidence for the effect again, but only when participants are instructed to hear some background noise and not when they are instructed to listen to background noise, t(25) = 2.99, p = .006. The bias test remains significant, TIVA p = .05. Kuehnen did not find the effect, but claimed an interaction with item-order for questions about ease-of-retrieval and assertiveness. A non-significant trend emerged when ease-of-retrieval questions were asked first, which was not reported, t(34) = 1.10, p = .28. The bias test remains significant, TIVA p = .08. More evidence from small samples comes from Caruso (2008). In Study 1a, 30 participants showed an ease-of-retrieval effect, F(1,56) = 6.91, p = .011. The bias test remains significant, TIVA p = .06. Study 1b with more participants (N = 55), the effect was not significant, F(1, 110) = 1.05, p = .310. The bias test remains significant despite the non-significant result, TIVA p = .08. Tomala et al. (2007) added another just significant result with 79 participants, t(72) = 1.97, p = .05. This only strengthens the evidence of bias, TIVA p = .05. Yahalom and Schul (2013) also found a just significant effect with 130 students, t(124) = 2.54, only to strengthen evidence of bias, TIVA p = .04. Study 2 reduced the number of participants to 40, yet reported a significant result, F(1,76) = 8.26, p = .005. Although this p-value nearly reached the .005 level, there is no theoretical explanation why this particular direct replication of the original finding should have produced a real effect. Evidence for bias remains significant, TIVA p = .05. Study 3 reverts back to a marginally significant result that only strengthens evidence of bias, t(114) = 1.92, p = .06, TIVA bias p = .02. Greifeneder and Bless (2007) manipulated cognitive load and found the predicted trend only in the low-load condition, t(76) = 1.36, p = .18. Evidence for bias remained unchanged, TIVA p = .02.

In conclusion, from 1991 to 2016 published studies appeared to replicate the original findings, but this evidence is not credible because there is evidence of publication bias. Not a single one of these studies produced a p-value below .005, which has been suggested as a significance level that keeps the type-I error rate at an acceptable level (Benjamin et al., 2017).

Even meta-analyses of these small studies that correct for bias are inconclusive because sampling error is large and effect size estimates are imprecise. The only way to provide strong and credible evidence is to conduct a transparent and ideally pre-registered replication study with a large sample. One study like this was published by Yeager et al. (2019). With N = 1,325 participants the study failed to show a significant effect, F(1, 1323) = 1.31, p = .25. Groncki et al. (2021) conducted the first pre-registered replication study with N = 659 participants. They also ended up with a non-significant result, F(1, 657) = 1.34, p = .25.

These replication failures are only surprising if the inflated observed discovery rate is used to predict the outcome of future studies. Accordingly, we would have an 80% probability to get significant results and an even higher probability given the larger sample sizes. However, when we take publication bias into account, the expected discovery rate is only 8% and even large sample sizes will not produce significant results if the true effect size is close to zero.

In conclusion, the clear evidence of bias and the replication failures in two large replication studies suggest that the original findings were only obtained with luck or with questionable research practices. However, naive interpretation of these results created a literature with over 200 published studies without a real effect. In this regard, ease of retrieval is akin to the ego-depletion literature that is now widely considered invalid (Inzlicht, Werner, Briskin, & Roberts, 2021).

Discussion

2011 has been a watershed moment in the history of social psychology. It has split social psychology into two camps. One camp denies that questionable research practices undermine the validity of published results and continue to rely on published studies as credible empirical evidence (Schwarz & Strack, 2016). The other camp assumes that most published results are false positives and trusts only new studies that are published following open science practices with badges for sharing of materials, data, and ideally pre-registration.

Meta-analysis can help to find a middle ground by examining carefully whether published results can be trusted, even if some publication bias is present. To do so, meta-analysis have to take publication bias seriously. Given the widespread use of questionable practices in social psychology, we have to assume that bias is present (Schimmack, 2020). Published meta-analyses that did not properly correct for publication bias can at best provide an upper limit for effect sizes, but they cannot establish that an effect exists or that the effect size has practical significance.

Weingarten and Hutchinson (2018) tried to correct for publication bias by using the PET-PEESE approach (Stanley, 2017). This is currently the best bias-correction method, but it is by no means perfect (Hong & Reed, 2021; Stanley, 2017). Here I demonstrated one pitfall in the use of PET-PEESE. Coding of studies that does not match the bias in the original articles can obscure the amount of bias and lead to inflated effect size estimates, especially if the PET model is incorrectly rejected and the PEESE results are accepted at face value. As a result, the published effect size of r = .2 (d = .4) was dramatically inflated and new results suggest that the effect size is close to zero.

I also showed in a z-curve analysis that the false positive risk for published ease-of-retrieval studies is high because the expected discovery rate is low and the file drawer of unpublished studies is large. To reduce the false positive risk, I recommend to adjust the significance level to alpha = .005, which is also consistent with other calls for more stringent criteria to claim discoveries (Benjamin et al., 2017). Based on this criterion, neither the original studies, nor any direct replication s of the original studies were significant. A few variations of the paradigm may have produced real effects, but pre-registered replication studies are needed to examine this question. For now, ease of retrieval is a theory without credible evidence.

For many social psychologists, these results are shocking and hard to believe. However, the results are by no means unique to the ease-of-retrieval literature. It has been estimated that only 25% to 50% of published results in social psychology can be replicated (Schimmack, 2020). Other large literatures such as implicit priming, ego-depletion, and facial feedback have also been questioned by rigorous meta-analyses and large replication studies.

For methodologists the replication crisis in psychology is not a surprise. They have warned for decades that selection for significance renders significant results insignificant (Sterling, 1961) and that sample sizes are too low (Cohen, 1961). To avoid similar mistakes in the future, researchers should conduct continuous power analyses and bias tests. As demonstrated here for the assertiveness paradigm, bias tests ring the alarm bells from the start and continue to show bias. In the future, we do not need to wait 40 years before we realize that researchers are chasing an elusive phenomenon. Sample sizes need to be increased or research needs to stop. Amassing a literature of 200 studies with a median sample size of N = 53 and 8% power has to be a mistake that should not be repeated.

Social psychologists should be the least surprised that they fooled themselves in believing their results. After all, they have demonstrated with credible studies that confirmation bias has a strong influence on human information processing. They should therefore embrace the need for open science, bias-checks, and replication studies as necessary evils that are necessary to minimize confirmation bias and to make scientific progress.

References

Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software36(3), 1–48. https://www.jstatsoft.org/v036/i03.

Aber bitte ohne Sanna

Abstract

Social psychologists have failed to clean up their act and their literature. Here I show unusually high effect sizes in non-retracted articles by Sanna, who retracted several articles. I point out that non-retraction does not equal credibility and I show that co-authors like Norbert Schwarz lack any motivation to correct the published record. The inability of social psychologists to acknowledge and correct their mistakes renders social psychology a para-science that lacks credibility. Even meta-analyses cannot be trusted because they do not correct properly for the use of questionable research practices.

Introduction

When I grew up, a popular German Schlager was the song “Aber bitte mit Sahne.” The song is about Germans love of deserts with whipped cream. So, when I saw articles by Sanna, I had to think about whipped cream, which is delicious. Unfortunately, articles by Sanna are the exact opposite. In the early 2010s, it became apparent that Sanna had fabricated data. However, unlike the thorough investigation of a similar case in the Netherlands, the extent of Sanna’s fraud remains unclear (Retraction Watch, 2012). The latest count of Sanna’s retracted articles was 8 (Retraction Watch, 2013).

WebOfScience shows 5 retraction notices for 67 articles, which means 62 articles have not been retracted. The question is whether these article can be trusted to provide valid scientific information? The answer to this question matters because Sanna’s articles are still being cited at a rate of over 100 citations per year.

Meta-Analysis of Ease of Retrieval

The data are also being used in meta-analyses (Weingarten & Hutchinson, 2018). Fraudulent data are particularly problematic for meta-analysis because fraud can produce large effect size estimates that may inflate effect size estimates. Here I report the results of my own investigation that focusses on the ease-of-retrieval paradigm that was developed by Norbert Schwarz and colleagues (Schwarz et al., 1991).

The meta-analysis included 7 studies from 6 articles. Two studies produced independent effect size estimates for 2 conditions for a total of 9 effect sizes.

Sanna, L. J., Schwarz, N., & Small, E. M. (2002). Accessibility experiences and the hindsight bias: I knew it all along versus it could never have happened. Memory & Cognition, 30(8), 1288–1296. https://doi.org/10.3758/BF03213410 [Study 1a, 1b]

Sanna, L. J., Schwarz, N., & Stocker, S. L. (2002). When debiasing backfires: Accessible content and accessibility experiences in debiasing hindsight. Journal of Experimental Psychology: Learning, Memory, and Cognition, 28(3), 497–502. https://doi.org/10.1037/0278-7393.28.3.497
[Study 1 & 2]

Sanna, L. J., & Schwarz, N. (2003). Debiasing the hindsight bias: The role of accessibility experiences and (mis)attributions. Journal of Experimental Social Psychology, 39(3), 287–295. https://doi.org/10.1016/S0022-1031(02)00528-0 [Study 1]

Sanna, L. J., Chang, E. C., & Carter, S. E. (2004). All Our Troubles Seem So Far Away: Temporal Pattern to Accessible Alternatives and Retrospective Team Appraisals. Personality and Social Psychology Bulletin, 30(10), 1359–1371. https://doi.org/10.1177/0146167204263784
[Study 3a]

Sanna, L. J., Parks, C. D., Chang, E. C., & Carter, S. E. (2005). The Hourglass Is Half Full or Half Empty: Temporal Framing and the Group Planning Fallacy. Group Dynamics: Theory, Research, and Practice, 9(3), 173–188. https://doi.org/10.1037/1089-2699.9.3.173 [Study 3a, 3b]

Carter, S. E., & Sanna, L. J. (2008). It’s not just what you say but when you say it: Self-presentation and temporal construal. Journal of Experimental Social Psychology, 44(5), 1339–1345. https://doi.org/10.1016/j.jesp.2008.03.017 [Study 2]

When I examined Sanna’s results, I found that all 9 of these 9 effect sizes were extremely large with effect size estimates being larger than one standard deviation. A logistic regression analysis that predicted authorship (With Sanna vs. Without Sanna) showed that the large effect sizes in Sanna’s articles were unlikely to be due to sampling error alone, b = 4.6, se = 1.1, t(184) = 4.1, p = .00004 (1 / 24,642).

These results show that Sanna’s effect sizes are not typical for the ease-of-retrieval literature. As one of his retracted articles used the ease-of retrieval paradigm, it is possible that these articles are equally untrustworthy. As many other studies have investigated ease-of-retrieval effects, it seems prudent to exclude articles by Sanna from future meta-analysis.

These articles should also not be cited as evidence for specific claims about ease-of-retrieval effects for the specific conditions that were used in these studies. As the meta-analysis shows, there have been no credible replications of these studies and it remains unknown how much ease of retrieval may play a role under the specified conditions in Sanna’s articles.

Discussion

The blog post is also a warning for young scientists and students of social psychology that they cannot trust researchers who became famous with the help of questionable research practices that produced too many significant results. As the reference list shows, several articles by Sanna were co-authored by Norbert Schwarz, the inventor of the ease-of-retrieval paradigm. It is most likely that he was unaware of Sanna’s fraudulent practices. However, he seemed to lack any concerns that the results might be too good to be true. After all, he encountered replicaiton failures in his own lab.

of course, we had studies that remained unpublished. Early on we experimented with different manipulations. The main lesson was: if you make the task too blatantly difficult, people correctly conclude the task is too difficult and draw no inference about themselves. We also had a couple of studies with unexpected gender differences” (Schwarz, email communication, 5/18,21).

So, why was he not suspicious when Sanna only produced successful results? I was wondering whether Schwarz had some doubts about these studies with the help of hindsight bias. After all, a decade or more later, we know that he committed fraud for some articles on this topic, we know about replication failures in larger samples (Yeager et al., 2019), and we know that the true effect sizes are much smaller than Sanna’s reported effect sizes (Weingarten & Hutchinson, 2018).

Hi Norbert, 
   thank you for your response. I am doing my own meta-analysis of the literature as I have some issues with the published one by Evan. More about that later. For now, I have a question about some articles that I came across, specifically Sanna, Schwarz, and Small (2002). The results in this study are very strong (d ~ 1).  Do you think a replication study powered for 95% power with d = .4 (based on meta-analysis) would produce a significant result? Or do you have concerns about this particular paradigm and do not predict a replication failure?
Best, Uli (email

His response shows that he is unwilling or unable to even consider the possibility that Sanna used fraud to produce the results in this article that he co-authored.

Uli, that paper has 2 experiments, one with a few vs many manipulation and one with a facial manipulation.  I have no reason to assume that the patterns won’t replicate. They are consistent with numerous earlier few vs many studies and other facial manipulation studies (introduced by Stepper & Strack,  JPSP, 1993). The effect sizes always depend on idiosyncracies of topic, population, and context, which influence accessible content and accessibility experience. The theory does not make point predictions and the belief that effect sizes should be identical across decades and populations is silly — we’re dealing with judgments based on accessible content, not with immutable objects.  

This response is symptomatic of social psychologists response to decades of research that has produced questionable results that often fail to replicate (see Schimmack, 2020, for a review). Even when there is clear evidence of questionable practices, journals are reluctant to retract articles that make false claims based on invalid data (Kitayama, 2020). And social psychologist Daryl Bem wants rather be remembered as loony para-psychologists than as real scientists (Bem, 2021).

The problem with these social psychologists is not that they made mistakes in the way they conducted their studies. The problem is their inability to acknowledge and correct their mistakes. While they are clinging to their CVs and H-Indices to protect their self-esteem, they are further eroding trust in psychology as a science and force junior scientists who want to improve things out of academia (Hilgard, 2021). After all, the key feature of science that distinguishes it from ideologies is the ability to correct itself. A science that shows no signs of self-correction is a para-science and not a real science. Thus, social psychology is currently para-science (i.e., “Parascience is a broad category of academic disciplines, that are outside the scope of scientific study, Wikipedia).

The only hope for social psychology is that young researchers are unwilling to play by the old rules and start a credibility revolution. However, the incentives still favor conformists who suck up to the old guard. Thus, it is unclear if social psychology will ever become a real science. A first sign of improvement would be to retract articles that make false claims based on results that were produced with questionable research practices. Instead, social psychologists continue to write review articles that ignore the replication crisis (Schwarz & Strack, 2016) as if repression can bend reality.

Nobody should believe them.