“For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).
DEFINITION OF REPLICABILITY: In empirical studies with sampling error, replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion (Schimmack, 2017).
See Reference List at the end for peer-reviewed publications.
The purpose of the R-Index blog is to increase the replicability of published results in psychological science and to alert consumers of psychological research about problems in published articles.
I have used these tools to demonstrate that several claims in psychological articles are incredible (a.k.a., untrustworthy), starting with Bem’s (2011) outlandish claims of time-reversed causal pre-cognition (Schimmack, 2012). This article triggered a crisis of confidence in the credibility of psychology as a science.
Over the past decade it has become clear that many other seemingly robust findings are also highly questionable. For example, I showed that many claims in Nobel Laureate Daniel Kahneman’s book “Thinking: Fast and Slow” are based on shaky foundations (Schimmack, 2020). An entire book on unconscious priming effects, by John Bargh, also ignores replication failures and lacks credible evidence (Schimmack, 2017). The hypothesis that willpower is fueled by blood glucose and easily depleted is also not supported by empirical evidence (Schimmack, 2016). In general, many claims in social psychology are questionable and require new evidence to be considered scientific (Schimmack, 2020).
Each year I post new information about the replicability of research in 120 Psychology Journals (Schimmack, 2021). I also started providing information about the replicability of individual researchers and provide guidelines how to evaluate their published findings (Schimmack, 2021).
Replication is essential for an empirical science, but it is not sufficient. Psychology also has a validation crisis (Schimmack, 2021). That is, measures are often used before it has been demonstrate how well they measure something. For example, psychologists have claimed that they can measure individuals’ unconscious evaluations, but there is no evidence that unconscious evaluations even exist (Schimmack, 2021a, 2021b).
If you are interested in my story how I ended up becoming a meta-critic of psychological science, you can read it here (my journey).
Brunner, J., & Schimmack, U. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta-Psychology, 4, MP.2018.874, 1-22 https://doi.org/10.15626/MP.2018.874
Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17, 551–566 http://dx.doi.org/10.1037/a0029487
Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne, 61(4), 364–376. https://doi.org/10.1037/cap0000246
The first five parts built a model that related personality traits with well-being. Part six added sex (male/female) to the model. It may not come as a surprise that part 7 adds age to the model because sex and age are two commonly measured demographic variables.
Age and Wellbeing
Diener et al.’s (1999) review article pointed out that early views of old age as a period of poor health and misery was not supported by empirical studies. Since then, some studies with national representative samples have found a U-shaped relationship between age and well-being. Accordingly, well-being decreases from young adulthood to middle age and then increases again into old age before well-being declines at the end of life. Thus, there is some evidence for a mid-life crisis (Blanchflower, 2021).
The present dataset cannot examine this U-shaped pattern because data are based on students and their parents, but the U-shaped pattern would predict that students have higher well-being than their middle-aged parents.
McAdams, Lucas, and Donnellan (2012) found that the relationship between age and life-satisfaction was explained by effects of age on life-domains. According to their findings in a British sample, health satisfaction decreased with age, but housing satisfaction increased with age. The average trend across domains mirrored the pattern for life-satisfaction judgments.
Based on these findings, I expected that age was a negative predictor of life-satisfaction and that this negative relationship is mediated by domain satisfaction. To test this prediction I added age as a predictor variable. As for sex, age is an exogeneous variable because age can influence personality and well-being, but personality cannot influence (biological) age. Although age was added as a predictor for all factors in the model, overall model fit decreased, chi2(1478) = 2198, CFI = .973, RMSEA = .019. This can happen when a new variable is also related to the unique variances of indicators. Inspection of the modification indices showed some additional relationships with self-ratings that suggested older respondents have a positive bias in their self-ratings. To allow for this possibility, I allowed all self-ratings to be influenced by age. This modification substantially increased model fit, chi2(1462) = 1970, CFI = .981, RMSEA = .016. I will further examine this positivity bias in the next model. Here I focus on the findings for age and well-being.
As expected, age was a negative predictor of life-satisfaction, b = -.21, se = .04, Z = 5.5. This effect was fully mediated. The direct effect of age on life-satisfaction was close to zero and not significant, b = -.01, se = .04, Z = 0.34. Age also had no direct effect on positive affect (happy), b = .00, se = .00, Z = 0.44, and only a small effect on negative affect (sadness), b = -.03, se = .01, Z = 2.5. Yet, the sign of this relationship shows lower levels of sadness in middle age, which does not explain the lower level of life-satisfaction. In contrast, age was a negative predictor of average domain satisfaction (DSX) and the effect size was close to the effect size for life-satisfaction, b = -.20, se = .05, Z = 4.1. This results replicates McAdams et al.’s (2012) finding that domain satisfaction mediates the effect of age on life-satisfaction.
However, the monster model shows that domain satisfaction is influenced by personality traits. Thus, it is possible that some of the age effects on domain satisfaction are not only influenced by objective domain aspects, but also by top-down effects of personality traits. To examine this, I traced the indirect effects of age on average domain satisfaction.
Age was a notable negative predictor of cheerfulness, b = -.29, se = .04, Z = 7.5. This effect was partially mediated by extraversion, b = -.07, se = 02, Z = 3.5 and agreeableness, b = -.08, se = .02, Z = 4.5, while some of the effect was direct, b = -.14, se = .03, Z = 4.4. There was no statistically significant effect of age on depressiveness, b = .07, se = 04, Z = 1.9.
Age also had direct relationships with some life domains. Age was a positive predictor of romantic satisfaction, b = .36, se = .04, Z = 8.2. Another strong relationship emerged for health satisfaction, b = -.36, se = .04, Z = 8.4. Another negative relationship was observed for work, b = -.26, se = .04, Z = 6.4, reflecting the difference between studying and working. Age was also a negative predictor of housing satisfaction, b = -.10, se = .04, Z = 2.8, recreation satisfaction, b = -.15, se = .05, Z = 3.4, financial satisfaction, b = -.10, se = .05, Z = 2.1, and friendship satisfaction, b = -.09, se = .04, Z = 2.1. In short, age was a negative predictor of satisfaction with al life domains even after controlling for the effects of age on cheerfulness.
The only positive effect of age was an increase in conscientiousness, b = .15, se = .04, Z = 3.7, which is consistent with the personality literature (Roberts, Walton, & Viechtbauer, 2006). However, the indirect positive effect on life-satisfaction is small, b = .04
In conclusion, the present results replicate that well-being decreases from young adulthood to middle age. The effect is mainly explained by a decrease in cheerfulness and decreasing satisfaction with a broad range of life domains. The only exception was a positive effect on romantic satisfaction. These results have to be interpreted in the context of the specific sample. Younger participants were students. It is possible that young adults who already join the workforce have lower well-being than students. The higher romantic satisfaction for parents may also be due to the recruitment of parents who remained married with children. Singles and divorced middle-aged individuals show lower life-satisfaction. The fact that age effects were fully mediated shows that studies of age and well-being can benefit from the inclusion of personality measures and the measurement of domain satisfaction (McAdams et al., 2012).
The first five parts of this series built a model that related the Big Five personality traits as well as the depressiveness facet of neuroticism and the cheerfulness facet of extraversion to well-being. In this model, well-being is conceptualized as a weighted average of satisfaction with life domains and experiences of happiness and sadness (Part 5).
Part 6 adds sex/gender to the model. Although gender is a complex construct, most individuals identify as either male or female. As sex is frequently assessed as a demographic characteristic, the simple correlations of sex with personality and well-being are fairly well known and were reviewed by Diener et al. (1999).
A somewhat surprising finding is that life-satisfaction judgments show hardly any sex differences. Diener et al. (1999) point out that this finding seems to be inconsistent with findings that women report higher levels of neuroticism (neuroticism is a technical term for a disposition to experience more negative affects and does not imply a mental illness), negative affect, and depression. Accordingly, gender could have a negative effect on well-being that is mediated by neuroticism and depressiveness. To explain the lack of a sex difference in well-being, Diener et al. proposed that women also experience more positive emotions. Another possible mediator is agreeableness. Women consistently score higher in agreeableness and agreeableness is a positive predictor of well-being. Part 5 showed that most of the positive effect of agreeableness was mediated by cheerfulness. Thus, agreeableness may partially explain higher levels of cheerfulness for women. To my knowledge, these mediation hypotheses have never been formally tested in a causal model.
Adding sex to the monster model is relatively straightforward because sex is an exogeneous variable. That is causal paths can originate from sex, but no causal path can be pointed at sex. After all, we know that sex is determined by the genetic lottery at the moment of conception. It is therefore possible to add sex as a cause to all factors in the model. Despite adding all causal pathways, model fit decreased a bit, chi2(1432) = 2068, CFI = .976, RMSEA = .018. The main reason for reduced fit would be that sex predicts some of the unique variances in individual indicators. Inspection of modification indices showed that sex was related to higher student ratings of neuroticism and lower ratings of neuroticism by mothers’ as informants. While freeing these parameters improved model fit, the effect on sex differences in neuroticism were opposite. Assuming (!) that mothers’ underestimate neuroticism, increased sex differences in neuroticism from d = .69, se = .07 to d = .81, se = .07. Assuming that students’ overestimate neuroticism resulted in a smaller sex difference of d = .54, se = .08. Thus, the results suggest that sex differences in neuroticism are moderate to large (d = .5 to .8), but there is uncertainty due to some rating biases in ratings of neuroticism. A model that allowed for both biases had even better fit and produced the compromise effect size estimate of d = .67, se = .08. Overall fit was now only slightly lower than for the model without sex, chi2(1430) = 2024, CFI = .978, RMSEA = .017. Figure 2 shows the theoretically significant direct effects of sex with effect sizes in units of standard deviations (Cohen’s d).
The model not only replicated sex differences in neuroticism. It also replicated sex differences in agreeableness, although the effect size was small, d = .29, se = .08, Z = 3.7. Not expected was the finding that women also scored higher in extraversion, d = .38, se = .07, Z = 5.6, and conscientiousness, d = .36, se = .07, Z = 5.0. The only life domain with a notable sex difference was romantic relationships, d = -.41, se = .08, Z = 5.4. The only other statistically significant difference was found for recreation, d = -.19, se = .08, Z = 2.4. Thus, life domains do not contribute substantially to sex differences in well-being. Even the sex difference for romantic satisfaction is not consistently found in studies of marital satisfaction.
The model indirect results replicated the finding that there are no notable sex differences in life-satisfaction, total effect d = -.07, se = .06, Z = 1.1. Thus, tracing the paths from sex to life-satisfaction provides valuable insights into the paradox that women tend to have higher levels of neuroticism, but not lower life-satisfaction.
Consistent with prior studies, women had higher levels of depressiveness and the effect size was small, d = .24, se = .08, Z = 3.0. The direct effect was not significant, d = .06, se = .08, Z = 0.8. The only positive effect was mediated by neuroticism, d = .42, se = .06, Z = 7.4. Other indirect effects reduced the effect of sex on depressiveness. Namely, women’s higher conscientiousness (in this sample) reduced depressiveness, d = -.14, as did women’s higher agreeableness, d = -.06, se = .02, Z = 2.7, and women’s higher extraversion, d = -.04, se = .02, Z = 2.4. These results show the problem of focusing on neuroticism as a predictor of well-being. While neuroticism shows a moderate to strong sex difference, it is not a strong predictor of well-being. In contrast, depressiveness is a stronger predictor of well-being, but has a relatively small sex difference. This small sex difference partially explains why women can have higher levels of neuroticism without lower levels of well-being. Men and women are nearly equally disposed to suffer from depression. Consistent with this finding, men are actually more likely to commit suicide than women.
Consistent with Diener et al.’s (1999) hypothesis, cheerfulness also showed a positive relationship with sex. The total effect size was larger than for depressiveness, d = .50, se = .07, Z = 7.2. The total effect was partially explained by a direct effect of sex on cheerfulness, d = .20, se = .06, Z = 3.6. Indirect effects were mediated by extraversion, d = .27, se = .05, Z = 5.8, agreeableness d = .11, se = .03, Z = 3.6, and conscientiousness, d = .05, se = .02, Z = 3.2. However, neuroticism reduced the effect size by d = -.12, se = .03, Z = 4.4.
The effects of gender on depressiveness and cheerfulness produced corresponding differences in experiences of NA (sadness) and PA (happiness), without additional direct effects of gender on the sadness or happiness factors. The effect on happiness was a bit stronger, d = .35, se = .08, Z = 4.6 than the effect on sadness, d = .28, se = .07, Z = 4.1.
In conclusion, the results provide empirical support for Diener et al.’s hypothesis that sex differences in well-being are small because women have higher levels of positive affect and negative affect. The relatively large difference in neuroticism is also deceptive because neuroticism is not a direct predictor of well-being and gender differences in depressiveness are weaker than gender differences in neuroticism or anxiety. In the present sample, women also benefited from higher levels of agreeableness and conscientiousness that are linked to higher cheerfulness and lower depressiveness.
The present study also addresses concerns that self-report biases may distort gender differences in measures of affect and well-being (Diener et al., 1999). In the present study, well-being of mothers and fathers was not just measured by their self-reports, but also by students’ reports of their parents’ well-being. I have also asked students in my well-being course whether their mother or father has higher life-satisfaction. The answers show pretty much a 50:50 split. Thus, at least subjective well-being does not appear to differ substantially between men and women. This blog post showed a theoretical model that explains why men and women have similar levels of well-being.
This is Part 5 of the blog series on the monster model of well-being. The first parts developed a model of well-being that related life-satisfaction judgments to affect and domain satisfaction. I then added the Big Five personality traits to the model (Part 4). The model confirmed/replicated the key finding that neuroticism has the strongest relationship with life-satisfaction, b ~ .3. It also showed notable relationships with extraversion, agreeableness, and conscientiousness. The relationship with openness was practically zero. The key novel contribution of the monster model is to trace the effects of the Big Five personality traits on well-being. The results showed that neuroticism, extraversion, and agreeableness had broad effects on various life domains (top-down effects) that mediated the effect on global life-satisfaction (bottom-up effect). In contrast, conscientiousness was only instrumental for a few life domains.
The main goal of Part 5 is to examine the influence of personality traits at the level of personality facets. Various models of personality assume a hierarchy of traits. While there is considerable disagreement about the number of levels and the number of traits on each level, most models share a basic level of traits that correspond to traits in the everyday language (talkative, helpful, reliable, creative) and a higher-order level that represents covariations among basic traits. In the Five factor model, the Big Five traits are five independent higher-order traits. Costa and McCrae’s influential model of the Big Five recognizes six basic-level traits called facets for each of the Big Five traits. Relatively few studies have conducted a comprehensive examination of personality and well-being at the facet level (Schimmack, Oishi, Furr, & Funder, 2004). A key finding was that the depressiveness facet of neuroticism was the only facet with unique variance in the prediction of life-satisfaction. Similarly, the cheerfulness facet of extraversion was the only extraversion facet that predicted unique variance in life-satisfaction. Thus, the Mississauga family study included measures of these two facets in addition to the Big Five items.
In Part 5, I add these two facets to the monster model of well-being. Consistent with Big Five theory, I allowed for causal effects of Extraversion on Cheerfulness and from Neuroticism to Depressiveness. Strict hierarchical models would assume that each facet is related to only one broad factor. However, in reality basic-level traits can be related to multiple higher-order factors, but not much attention has been paid to secondary loadings of the depressiveness and cheerfulness facets on the other Big Five factors. In one study that controlled for evaluative bias, I found that depressiveness had a negative loading on conscientiousness (Schimmack, 2019). This relationship was confirmed in this dataset. However, additional relations improved model fit. Namely, cheerfulness was related to lower neuroticism and higher agreeableness and depressiveness was related to lower extraversion and agreeableness. Some of these relations were weak and might be spurious due to the use of short three-item scales to measure the Big Five.
The monster model combines two previous mediation models that link the Big Five personality traits to well-being. Schimmack, Diener, and Oishi (2002) proposed that affective experiences mediate the effects of extraversion and neuroticism. Schimmack, Oishi, Furr, and Funder (2004) suggested that the Depressiveness and Cheerfulness facets mediate the effects of Extraversion and Neuroticism. The monster model proposes that extraversion’s effect is mediated by trait cheerfulness which influences positive experiences, whereas neuroticism’s effect is mediated by trait depressiveness which in turn influences experiences of sadness.
When this model was fitted to the data, depressiveness and cheerfulness fully mediated the effect of extraversion and neuroticism. However, extraversion became a negative predictor of well-being. While it is possible that the unique aspects of extraversion that are not shared with cheerfulness have a negative effect on well-being, there is little evidence for such a negative relationship in the literature. Another possible explanation for this finding is that cheerfulness and positive affect (happy) share some method variance that inflates the correlation between these two factors. As a result, the indirect effect of extraversion is overestimated. When this shared method variance is fixed to zero and extraversion is allowed to have a direct effect, SEM will use the free parameter to compensate for the overestimation of the indirect path. The ability to model shared method variance is one of the advantages of SEM over mediation tests that rely on manifest variables and assume perfect measurement of constructs. Figure 1 shows the correlation between measures of trait PA (cheerfulness) and experienced PA (happy) as a curved arrow. A similar shared method effect was allowed for depressiveness and experienced sadness (sad), although it turned out be not significant.
Exploratory analysis showed that cheerfulness and depressiveness did not fully mediate all effects on well-being. Extraversion, agreeableness, and conscientiousness had additional direct relationships on some life-domains that contribute to well-being. The final model remained good overall fit and modification indices did not show notable additional relationships for the added constructs, chi2(1387) = 1914, CFI = .980, RMSEA = .017.
The standardized model indirect effects were used to quantify the effect of the facets on well-being and to quantify indirect and direct effects of the Big Five on well-being. The total effect of Depressiveness was b = -.47, Z = 8.8. About one-third of this effect was directly mediated by sadness, b = -.19. Follow-up research needs to examine how much of this relationship might be explained by risk factors for mood disorders as compared to normal levels of depressive moods. Valuable new insights can emerge from integrating the extensive literature on depression and life-satisfaction. The remaining effects were mediated by top-down effects of depressiveness on domain satisfactions (Payne & Schimmack, 2020). The present results show that it is important to control for these top-down effects in studies that examine the bottom-up effects of life domains on life-satisfaction.
The total effect of cheerfulness was as large as the effect of depressiveness, b = .44, Z = 6.6. Contrary to depressiveness, the indirect effect through happiness was weak, b = .02, Z = 0.6 because happy did not make a significant unique contribution to life-satisfaction. Thus, all of the effects were mediated by domain satisfaction.
In sum, the results for depressiveness and cheerfulness are consistent with integrated bottom-up-top-down models that postulate top-down effects of affective dispositions on domain satisfaction and bottom-up effects from domain satisfaction to life-satisfaction. The results are only partially consistent with models that assume affective experiences mediate the effect (Schimmack, Diener, & Oishi, 2002).
The effect of neuroticism on well-being, b = -.36, Z = 10.7, was fully mediated by depressiveness, b = -.28 and cheerfulness, b = -.08. Causality is implied by the assumption that neuroticism is a common cause of specific dispositions for anger, anxiety, depressiveness and other negative affects that is made in hierarchical models of personality traits. If this assumption were false, neuroticism would only be a correlate of well-being and it would be even more critical to focus on depressiveness as the more important personality trait related to well-being. Thus, future research on personality and well-being needs to pay more attention to the depressiveness facet of neuroticism. Too many short neuroticism measures focus exclusively or predominantly on anxiety.
Following Costa and McCrae (1980), extraversion has often been considered a second important personality trait that influences well-being. However, quantitatively the effect of extraversion on well-being is relatively small, especially in studies that control for shared method variance. The effect size for this sample was b = .12, a statistically small effect, and a much smaller effect than for its cheerfulness facets. The weak effect was a combination of a moderate positive effect mediated by cheerfulness, b = .32, and a negative effect that was mediated by direct effects of extraversion on domain satisfactions, b = -.23. These results show how important it is to examine the relationship between extraversion and well-being at the facet level. Whereas cheerfulness explains why extraversion has positive effects on well-being, the relationship of other facets with well-being require further investigation. The present results make it clear that a simple reason for positive relationships between extraversion and well-being is the cheerfulness facet. The finding that individuals with a cheerful disposition evaluate their lives more positively may not be surprising or may even appear to be trivial, but it would be a mistake to omit cheerfulness from a causal theory of well-being. Future research needs to uncover the determinants of individual differences in cheerfulness.
Agreeableness had a moderate effect on well-being, b = .21, Z = 5.8. Importantly, the positive effect of agreeableness was fully mediated by cheerfulness, b = .17 and depressiveness, b = .09, with a small negative direct effect on domain satisfactions, b = -.05, which was due to lower work satisfaction for individuals high in agreeableness. These results replicate Schimmack et al.’s (2004) findings that agreeableness was not a predictor of life-satisfaction, when cheerfulness and depressiveness were added to the model. This finding has important implications for theories of well-being that see a relationship between morality, empathy, and prosociality and well-being. The present results do not support this interpretation of the relationship between agreeableness and well-being. The results also show the importance of taking second order relationships more seriously. Hierarchical models consider agreeableness to be unrelated to cheerfulness and depressiveness, but simple hierarchical models do not fit actual data. Finally, it is important to examine the causal relationship between agreeableness and affective facets. It is possible that cheerfulness influences agreeableness rather than agreeableness influencing cheerfulness. In this case, agreeableness would be a predictor but not a cause of higher well-being. However, it is also possible that an agreeable disposition contributes to a cheerful disposition because agreeableness people may be more easily satisfied with reality. In any case, future studies of agreeableness and related traits and well-being need to take potential relationships with cheerfulness and depressiveness into account.
Conscientiousness also has a moderate effect on well-being, b = .19, Z = 5.9. A large portion of this effect is mediated by the Depressiveness facet of Neuroticism, b = .15. Although a potential link between Conscientiousness and Depressiveness is often omitted from hierarchical models of personality, neuropsychological research is consistent with the idea that conscientiousness may help to regulate negative affective experiences. Thus, this relationship deserves more attention in future research. If causality were reversed, conscientiousness would have only a trivial causal effect on well-being.
In short, adding cheerfulness and depressiveness facets to the model provided several new insights. First of all, the results replicated prior findings that these two facets are strong predictors of well-being. Second, the results showed that Big Five predictors are only weak unique predictors of well-being when their relationship with Cheerfulness and Depressiveness is taken into account. Omitting these important predictors from theories of well-being is a major problem of studies that focus on personality traits at the Big Five level. It also makes theoretical sense that cheerfulness and depressiveness are related to well-being. These traits influence the emotional evaluation of people’s lives. Thus, even when objective life circumstances are the same, a cheerful individual is likely to look at the bright side and see the their lives with rose colored glasses. In contrast, depression is likely to color live evaluations negatively. Longitudinal studies confirm that depressive symptoms, positive affect, and negative affect are influenced by stable traits (Anusic & Schimmack, 2016; Desai et al., 2012). Furthermore, twin studies show that shared genes contribute to the correlation between life-satisfaction judgments and depressive symptoms (Nes et al., 2013). Future research needs to examine the biopsychosocial factors that cause stable variation in dispositional cheerfulness and depressiveness that contribute to individual differences in well-being.
This is a preprint (not yet submitted to a journal) of a manuscript that examines the validity of the race IAT as a measure of in-group and out-group attitudes for African and White Americans. We show that research on intergroup relationships and attitudes benefits from insights (insights by means of being inside the experience) by African Americans that are often ignored by White psychologists. Data and Syntax are here (https://osf.io/rvfz8/)
The Race Implicit Association Test is Biased: Most African Americans Have Positive Attitudes Towards Their In-Group
Ulrich Schimmack University of Toronto Mississauga
Explicit ratings of attitudes show a preference for the in-group for African Americans and White participants. However, the average score of African Americans on the race Implicit Association Test is close to zero. This finding has been interpreted as evidence that many African Americans have unconsciously internalized negative attitudes towards their group. We conducted a multi-method study of this hypothesis with various implicit measures (Single-Target IAT, Evaluative Priming, Affective Misattribution Procedure) that distinguish between in-group and out-group attitudes. Our main finding is that African Americans have positive attitudes towards their in-group on a latent factor that reflects the valid variance across measures. In addition, the race IAT scores of African Americans are unrelated to in-group and out-group attitudes. Moreover, White American’s race IAT scores are biased and exaggerate in-group preferences. These findings are discussed in terms of the unique aspects of the race IAT that may activate cultural stereotypes. The results have ethical implications for the practice of providing individuals with feedback about their unconscious biases with an invalid measure. It is harmful to African Americans to suggest that they unconsciously dislike African Americans and to exaggerate prejudice of White Americans. Ongoing discrimination may be better explained by explicit prejudice of a minority of White Americans than pervasive, uncontrollable implicit biases of most White Americans.
With 1,277 citations in WebOfScience, Jost, Banaji, and Nosek’s (2004) article “A Decade of System Justification Theory: Accumulated Evidence of Conscious and Unconscious Bolstering of the Status Quo” is easily the most cited article in the journal Political Psychology. The second most cited article has less than half the number of citations (523 citations). The abstract of this influential article states the authors’ main thesis clearly and succinctly. They postulate a general motive to support the existing social order. This motive contributes to internalization of inferiority of disadvantaged groups. Most important for this article is the claim that this internalization of inferiority is “observed most readily at an implicit, nonconscious level of awareness” (p. 881).
The theory is broadly applied to a wide range of stigmatized groups and its validity has to be evaluated for each group individually. Our focus is on the African American community. Jost et al. (2004) assume that system justification theory is applicable to African Americans because they show different evaluations of their in-group on explicit measures and on the Implicit Association Test (IAT; Greenwald, McGhee, & Schwartz, 1998). On explicit measures, like the feeling thermometer, African Americans show higher in-group favoritism than White Americans (standardized mean differences d = .8 vs. .6). However, IAT scores show greater in-group favoritism for White Americans than for African Americans (d = .9 vs. 0). IAT scores close to zero for African Americans have been interpreted as evidence that “sizable proportions of members of disadvantaged groups – often 40% to 50% or even more exhibit implicit (or indirect) biases against their own group and in favor of more advantaged groups” (Jost, 2019, p. 277).
This pattern of results is based on large samples and has been replicated in several studies. Thus, we are not questioning the empirical facts. Our concern is that Jost and colleagues misinterpret these results. In the early 2000s, it was common to assume that explicit and implicit group evaluations reflect different constructs (Nosek, Greenwald, & Banaji, 2005). This dual-attitude model allows for different evaluations of the in-group at a conscious and an unconscious level. Evidence for this model rested mostly on the finding that race IAT scores and self-ratings are only weekly correlated, r ~ .2 (Hofmann, Gawronski, Gschwendner, Le, & Schmitt, 2005). However, these studies did not correct for measurement error. After correcting for measurement error, the correlation increases to r = .8 (Schimmack, 2021a). The race IAT also has little incremental predictive validity over explicit measures (Schimmack, 2021b). This new evidence renders it less likely that explicit and implicit attitudes can diverge. In fact, there exists no evidence that attitudes are hidden from consciousness. Thus, there may be an alternative explanation for African Americans’ scores on the race IAT.
White Psychologists’ Theorizing about African Americans
Before we propose an alternative explanation for African Americans’ neutral scores on the race IAT, we would like to make the observation that Jost et al.’s (2004) claims about African Americans follow a long tradition of psychological research on African Americans by mostly White psychologists. Often this research ignores the lived experience of African Americans, which often leads to false claims (cf. Adams, 2010). For example, since the beginning of psychology, White psychologists assumed that African Americans have low self-esteem and proposed several theories for this seemingly obvious fact. However, in 1986 Rosenberg ironically pointed out that “everything stands solidly in support of this conclusion except the facts.” Since then, decades of research have shown that African Americans have the same or even higher self-esteem than White Americans (Twenge & Crocker, 2002). Just like White theorists’ claims about self-esteem, Jost et al.’s claims about African Americans’ unconscious are removed from African Americans’ own understanding of their culture and identity and disconnected from other findings that are in conflict with the theory’s predictions. The only empirical support for the theory is the neutral score of African Americans on the race IAT.
African American’s Resilience in a Culture of Oppression
We are skeptical about the claim that most African-Americans secretly favor the out-group based on the lived experience of the second author. Alicia Howard is an African-American from a predominantly White, small town in Kentucky. She grew up surrounded by a large family and attended a Black church. Her identity was shaped by role-models from this Black in-group and not by some idealized abstract image of the White out-group. Also, contrary to the famous doll-studies from the 1960s, she had White and Black dolls and got excited when a new Black doll came out. Alicia studied classical music at the historically Black college and university Kentucky State University. Even though her admired composers like Rachmaninov were White, she looked up to Black classical musicians like Andre Watts, Kathleen Battle, Leontyne Price, and Jesse Norman as role models. It is of course possible that her experiences are unique and not representative of African-Americans. However, no one in her family or among her Black friends showed signs that they preferred to be White or liked White people more than Black people. In small towns, the lives of Black and White people are also more similar than in big cities. Therefore, the White out-group was not all that different from the Black in-group. Although there are Black individuals who seem to struggle with their Black identity, there are also White people who suffer from White guilt or assume a Black identity for other reasons. Thus, from an African American perspective, system justification theory does not seem to characterize most African Americans’ attitudes to their in-group.
The Race IAT Could Be Biased
We are not the first to note that the race IAT may not be a pure measure of attitudes (Olson & Fazio, 2004). The nature of the task may activate cultural stereotypes that are normally not activated when African Americans interact with each other. As a result, the mean score of African Americans on the race IAT may be shifted towards a pro-White bias because negative cultural stereotypes persist in US American culture. The same influence of cultural stereotypes would also enhance the pro-White bias for White Americans. Thus, an alternative explanation for the greater in-group bias for White Americans than for African Americans on the race IAT is that attitudes and cultural stereotypes act together for White Americans, whereas they act in opposite directions for African Americans.
One way to test this hypothesis is to examine in-group biases with alternative implicit measures that do not activate stereotypes. The most widely used alternative implicit measures are the Affective Misattribution Procedure (AMP; Payne, Cheng, Govorun, & Stewart, 2005) and the evaluative priming task (EPT, Fazio, Jackson, Dunton, & Williams, 2005). Only recently it has been noted that these implicit measures produce different results (Teige-Mocigemba, Becker, Sherman, Reichardt, & Klauer, 2017). A study in the United States, examined the differences between African American and White respondents on three implicit measures (Figure 1, Bar-Anan & Nosek, 2014).
Known-group differences are much more pronounced for the race IAT than the other two implicit tasks. The authors interpret this finding as evidence that the race IAT has higher validity. That is, under the assumption that (mostly) White participants have a strong preference for their in-group, a positive mean is predicted, and the more positive the mean is, the more valid a measure is. However, alternative explanations are possible. One alternative explanation is that only the race IAT activates cultural stereotypes and produces a high pro-White mean as a result. In contrast, the other tasks are better measures of attitudes and the results show that prejudice is much less pronounced than the race IAT suggests. That is, the race IAT is biased because it activates cultural stereotypes that are not automatically activated with other implicit tasks.
Another limitation of the race IAT is that preferences for the in-group and the out-group are confounded. In contrast, the other two tasks can be scored separately to obtain measures of the strength of preferences for the in-group and the out-group. This is particularly helpful to make sense of the neutral score of African Americans on the race IAT. One explanation for a weaker in-group bias is simply that African Americans are less biased against the out-group than White Americans. Thus, a better test of African Americans’ attitudes towards their own group is to examine how positive or negative African American’s responses are to African American stimuli.
In short, published studies reveal that different implicit tasks produce different results and that the race IAT shows stronger pro-White biases than other tasks. However, it has not been systematically explored whether this finding reveals higher or lower validity of the race IAT. We used Bar-Anan and Nosek’s (2014) data to explore this question.
The data are based on a voluntary online sample. The total sample size is large (N = 23,413). However, participants completed only some of the tasks that included implicit measures of political orientation and self-esteem. Table 1 shows the number of African American and White participants for six measures.
Race IAT. The race IAT is the standard Implicit Association Test, although the specific stimuli that represent the African American group and the White American group were different. However, this does not appear to have influenced responses as seen by similar means for African American and White American participants. The race IAT was scored so that higher values represented a pro-White bias for White participants and a pro-Black bias for Black participants.
Single Target IAT. The single-target IAT (ST-IAT) is a variation of the race IAT. The main difference is that participants only have to classify one racial group along with classifications of positive and negative stimuli. As a result, the ST-IAT reflects only evaluations of one group and provides distinct information about evaluations of the in-group and out-group. It is particularly interesting how Black participants perform on the in-group ST-IAT with Black targets. System justification theory predicts a score close to zero that would reflect an over all neutral attitude and at least 50% of participants who may hold negative views of the in-group.
Evaluative Priming Task. The Evaluative Priming Task (EPT) was developed by Fazio et al. (1995). In a practice block, participants classified words as “good” or “bad.” In the next three blocks, target stimuli were primed with pictures of African American and White Americans. In-group bias was the response time to same-group primes for negative words minus response times to same-group primes for positive words. Out-group bias was the response time to other-group primes for negative words minus response times to other-group primes for positive words.
Affective Misattribution Procedure. The Affective Misattribution was invented by Payne et al. (2005). Pictures of African Americans or White Americans are quickly followed by a Chinese character and a mask. Participants are instructed to rate the Chinese character as more or less pleasant than the average Chinese character. They were instructed not to let the pictures influence their evaluation of the target stimuli. The in-group score was the percentage of more pleasant responses after an in-group picture. The out-group score was the percentage of more pleasant responses after an out-group picture.
Feeling Thermometer. Self-reports of in-group and out-group attitudes were measured with feeling thermometers. Participants rated how warm or cold they feel toward the in-group and the out-group on an 11-point scale ranging from 0 = coldest feelings to 10 = warmest feelings.
For all measures, participants scores were divided by the standard deviation so that means can be interpreted as standardized effect sizes assuming that a mean of zero reflects a neutral attitude, positive scores reflect positive attitudes, and negative scores reflect negative attitudes.
The data were analyzed using structural equation modeling with MPLUS8.2 (Muthen & Muthen (2017), A multi-group model was specified with African Americans and White Americans as separate groups. The model was developed iteratively using the data. Thus, all results are exploratory and require validation in a separate sample. Due to the small number of Black participants, it was not possible to cross-validate the model with half of the sample. Moreover, tests of group differences have low power and a study with a larger sample of African Americans is needed to test equivalence of parameters. Cherry picking of data, models, and references undermines psychological science. To avoid this problem, we also constructed a model that assumes some implicit measures are biased and inflate in-group attitudes of African Americans. To identify the means of the latent in-group and out-group factors, we chose the single-target IAT because it shows the least positive attitudes of African Americans towards their in-group. We then freed other parameters to maximize model fit. We then freed other parameters to maximize model fit. The data, input syntax, and the full outputs have been posted online (https://osf.io/rvfz8/).
Overall fit of the final model meets standard fit criteria (RMSEA < .06, CFI > .95), CFI (78) = 133.37, RMSEA = .012, 90%CI = .009 to .016, CFI = .981. However, models with low coverage (many missing data) may overestimate model fit. A follow-up study that administers all tasks to all participants should be conducted to provide a stronger test of the model. Nevertheless, the model is parsimonious and there were no modification indices greater than 20. This suggests that there are no major discrepancies between the model and the data.
Figure 2 shows a measurement of attitudes towards the in-group and out-group. The key unobserved variables in this model are the attitude towards the in-group factor (ig) and the attitude towards the out-group factor (og). Each construct is measured with four indicators, namely scores on the single-target IAT (satig/satog), scores on the evaluative priming task (epig, epog), scores on the affective misattribution procedure (ampig/ampog), and scores on the explicit feeling thermometer ratings (thermoig/thermoog). For ease of interpretation, Figure 2 shows standardized coefficients that range from -1 to 1.
The first finding is that loadings of the measures on the IG factor (.3-.4) and on the outgroup factor (.4) are modest. They suggest that less than 20% of the variance in a single measure is valid variance. However, the model clearly identified latent factors that show individual differences in attitudes towards in-group and out-group for Black and White Americans. The second noteworthy finding is that loadings for African Americans and White Americans were similar. Thus, the multi-method measurement model was able to identify variation in in-group and out-group attitudes for both groups.
A third finding is that for White participants.54^2 = 29% of the variance in race IAT reflects attitudes towards African Americans (i.e., prejudice). This is a bit higher than previous estimates, which were in the 10% to 20% range (Schimmack, 2021). However, the lower limit of the 95%CI overlapped with this range of possible values, .43^2 = 18%.
Most important is the finding that race IAT scores for African Americans were unrelated to the attitudes towards the in-group and out-group factors. Thus, scores on the race IAT do not appear to be valid measures of African Americans’ attitudes. This finding has important implications for Jost et al.’s (2021) reliance on race IAT scores to make inferences about African Americans’ unconscious attitudes towards their in-group. This interpretation assumed that race IAT scores do provide valid information about African American’s attitudes towards the in-group, but no evidence for this assumption was provided. The present results show 20 years later that this fundamental assumption is wrong. The race-IAT does not provide information about African Americans’ attitudes towards the in-group as reflected in other implicit measures.
An additional interesting finding was that in-group and out-group attitudes were unrelated. This suggests that prejudice does not enhance pro-White attitudes for White participants. It also suggests that Black pride does not have to devalue the White outgroup.
Finally, the model shows that three methods show strong method variance. All three methods measured in-group and out-group attitudes within a single experimental block. The main difference is the single-target IAT that is conducted once with one target (Black) and once with the other target (White). Separating the assessment of in-group and out-group attitudes for the other tasks might reduce the amount of systematic measurement error. However, less systematic measurement error does not seem to translate into more valid variance as the single-target IAT was not more valid than the other measures. The results for the commonly used feeling thermometer are particularly noteworthy. While this measure shows some modest validity, the present results also show that this single-item measure has poor psychometric properties. An important goal for future research is to develop more valid measures of attitudes towards in-groups and out-groups. Until then, researchers should use a multi-method approach.
Figure 3 shows the model for the means. While standardized coefficients are easier to interpret for the measurement model, means are easier to interpret in the units of the measures, which were scaled so that means can be interpreted as Cohen’s d values.
The most important finding is that African Americans’ mean for the in-group factor is positive, d = 1.07, 95%CI = 0.98 to 1.16. Thus, the data provide no support for the claim that most African Americans evaluate their in-group negatively. With a normal distribution centered at 1.07, only 14% of African Americans would have a negative (below 0) attitude towards the in-group. White Americans also show a positive evaluation of the in-group, but to a lesser extent, d = 0.62; 95%CI = 0.58, 0.66. The confidence intervals are tight and clearly do not overlap, and constraining these two coefficients to be equal reduced model fit, chi2(79) = 228.43, Δchi2(1) = 95.06, p = 1.85e-22. Thus, this model suggests that African Americans have an even more positive attitude towards their in-group than White Americans.
As expected, out-group attitudes are less positive than in-group attitudes for both groups. Also expected was the finding that out-group attitudes of African Americans, d = .42, 95%CI , are more favorable than out-group attitudes of White Americans, d = .20, 95%CI. However, even White Americans’ out-group attitudes are on average positive. This finding is in marked contrast to the common finding with the race IAT that most White Americans show a pronounced pro-White bias, which has often been interpreted as evidence of widespread prejudice. However, this interpretation is problematic for two reasons. First, it confounds in-group and out-group attitudes. Prejudice is defined as White American’s attitude towards African Americans. The race IAT is not a direct measure of prejudice because it measures relative preferences. Of course, in-group favoritism alone can lead to discrimination and racial disparities when one group is dominant, but these consequences can occur without actual prejudice against African Americans. The present results suggest that African American also have an in-group bias. Thus, it is important to distinguish between in-group favoritism, which applies to both groups, from prejudice which applies uniquely to White Americans towards African Americans.
The bigger problem for the race IAT is that White Americans’ scores on the race IAT are systematically biased towards a pro-White score, d = .78, whereas African Americans’ scores are only slightly biased towards a pro-Black score, d = -.19. This finding shows that IAT scores provide misleading information about the amount of in-group favoritism. Thus, support for the system justification theory rests on a measurement artifact.
It is possible that our modeling decisions exaggerated the positivity of African Americans’ in-group attitudes. To address this concern, we tried to find an alternative model that fits the data with the lowest amount of African American’s in-group bias. This alternative model fit the data as well as our preferred model, CFI (77) = 134.24, RMSEA = .013, 90%CI = .009 to .016, CFI = .980. Thus, the data cannot distinguish between these two models. The covariance structure was identical. Thus, we only present the means structure of the model (Figure 4).
The main difference between the models is that African Americans’ attitudes towards the ingroup are less favorable (d = 1.07 vs. d = .54). The discrepancy is explained by the assumptions that African Americans have a positive bias on the feeling-thermometer and by assuming that African Americans’ responses to White targets on the AMP are negatively biased (ampog = -.72). The most important finding is that African Americans’ in-group attitudes remain positive, d = .54, although they are now slightly less favorable than White Americans’ in-group attitudes, d = .62.
Proponents of system justification theory might argue that attitudes towards the in-group have to be evaluated in relative terms. Viewed from this perspective, the results still show relatively more in-group favoritism for White Americans, d = .62 – .20 = .42 than African Americans, d = .54 – .40 = .14. However, out-group attitudes contribute more to this difference, d = .40 = .20 = .20, than in-group differences, d = .62 – .54 = .08. Thus, one reason for the difference in relative preferences is that African Americans attitudes towards Whites are more positive than White Americans’ attitudes towards African Americans. It would be a mistake to interpret this difference in evaluations of the out-group as evidence that African Americans have internalized negative stereotypes about their in-group.
The alternative model does not alter the fact that scores on the race IAT are biased and provide misleading information about in-group and out-group attitudes.
After its introduction in 1998, the Implicit Association Test has been quickly accepted as a valid measure of attitudes that individuals are unwilling or unable to report on self-report measures. Mean scores of White Americans were interpreted as evidence that prejudice is much more widespread and severe than self-report measures suggest. Mean scores of African Americans were interpreted as evidence of unconscious self-loathing. The present results suggest that millions of African American and White visitors of the Project Implicit website were given false feedback about their attitudes. For White Americans, the race IAT does appear to reflect individual differences in out-group attitudes (prejudice). However, the scoring of the IAT in terms of deviations from a value of zero is invalid because the mean is biased towards pro-White scores. Even the amount of valid variation is modest and insufficient to provide individualized feedback.
Implications for African American’s In-Group and Out-Group Attitudes
Our investigation started with the surprising suggestions that African Americans are motivated to justify racism and are supposed to have internalized negative stereotypes and attitudes towards their group. This view of African Americans is detached from their history and evidence of high self-esteem among African Americans. The only evidence for this claim was the finding that African Americans do not show a strong in-group preference on the race IAT.
Our results suggest that this finding is due to the low validity of the race IAT as a measure of African Americans’ attitudes. African American’s race IAT scores were unrelated to their in-group attitudes and out-group attitudes as measured by other measures, including the single-target variant of the IAT.
This raises the question in which way the race IAT differs from other measures. We are not the first to suggest that the race IAT activates negative cultural stereotypes (Olson & Fazio, 2004). These stereotypes are known to African Americans and may influence their performance on the IAT, even if African Americans do not endorse these stereotypes and these stereotypes are rarely activated in real life. Thus, the mean close to zero may not reflect the fact that 50% of African Americans have negative attitudes towards their group. Rather, it is possible that the neutral score reflects a balanced influence of positive attitudes and negative stereotypes.
Another noteworthy difference between other implicit tasks and the race IAT is that other tasks rely on pictures of individual members to elicit a valenced response. In contrast, the race IAT focuses on the evaluation of the abstract category “Black.” It is possible that African Americans have more positive attitudes to (pictures of) members of the group than to the concept of being “Black,” which is a fuzzy category at best. Similarly, old people seem to have a negative attitude to the concept of being “old,” but this does not imply that they do not like old people. This has important implications for the predictive validity of the IAT. In everyday life, we encounter individuals and not abstract categories. Thus, even if the race IAT were a valid measure of attitudes towards abstract categories, it would be a weak predictor of actual behaviors.
In sum, the only empirical support for system justification theory was African Americans’ neutral score on the race IAT. We show that the race IAT lacks validity and that African Americans have positive attitudes towards their in-group on all other measures. We also find that they have positive attitudes towards the White outgroup. This has important implications for the assessment of racial attitudes of White participants. If most White participants have negative attitudes towards Black people and these attitudes consistently influence White Americans behaviors, African Americans would experience discrimination from most White Americans. In this case, we would expect negative attitudes towards the out-group. As the data show, this is not the case. This does not mean that discrimination is rare. Rather, it is possible that most acts of discrimination are committed by a relatively small group of White Americans (Campbell & Brauer, 2021).
Implications for White American’s In-Group and Out-Group Attitudes
Banaji and Greenwald’s (2013) popular book was largely responsible for claims that implicit bias is real, widespread, and explains racial discrimination. The book ends with several conclusions. Two conclusions are widely accepted among social psychologists and a majority of US Americans, namely Black disadvantage exists and racial discrimination at least partially contributes to this disadvantage. However, other conclusions were not generally accepted and were not clearly supported by evidence, namely attitudes have both reflective and automatic form, people are often unaware of their automatic attitudes, and implicit bias is pervasive, and implicit racial attitudes contribute to discrimination against Black Americans. The claim that implicit biases are widespread was based entirely on the finding that 75% of US Americans show a clear pro-White bias on the race IAT. The present results suggest that this finding is unique to the race IAT and not found with other implicit measures.
Once more, we are not the first to point out that scoring of the race IAT may have exaggerated the pervasiveness of racial biases among White Americans (Blanton et al., 2006, 2009, 2015; Oswald et al., 2013, 2015). However, so far this criticism has fallen on deaf ears and Project Implicit continues to provide individuals with feedback about their race IAT scores. Textbooks proudly point out that over 20 million people have received this feedback, as if this number says something about the validity of the test (Myers & Twenge, 2019).
When visitors might see a discrepancy between their self-views and the test scores, they are informed that this does not invalidate the test because it measures something that is hidden from self-knowledge. The present results suggest that many visitors of the Project Implicit website were given false feedback about their prejudices because even individuals without any negative attitudes towards African Americans end up with a pro-White bias on the race IAT.
This bias can co-exist with evidence that variation in race IAT scores shows some convergent validity with other explicit and implicit measures of individual differences in attitudes towards African Americans. However, variances and means are two independent statistical constructs, and valid variance does not imply that means are valid. Nosek and Bar-Anan (2014) argued that the race IAT is the most valid measure of attitudes because it shows the largest differences in scores between African Americans and White Americans. However, this argument is only valid, if we assume that random measurement error attenuates the differences on other measures. The present study directly tested this assumption and found no evidence for the assumption. Instead, we found that the larger differences between African Americans and White Americans reflects some systematic mean differences that are unique to the race IAT. As noted earlier, a plausible explanation for this systematic bias is that the race IAT activates stereotypes, whereas other measures are purer measures of attitudes.
We hope that our direct demonstration of bias will finally end the practice of providing visitors of the Project Implicit website with misleading information about the validity of the race IAT and misleading information about individuals’ prejudice. There is simply no evidence that prejudice is hidden from honest self-reflection or that such hidden biases are revealed by the race IAT (Schimmack, 2021).
Implications for Future Research
Although our article focuses on the race IAT, the results also have implications for the use and interpretation of the other measures. One advantage of the other measures is that they provide separate information about in-group and out-group attitudes because they avoid the pitting of one group against the other. However, these measures have other problems. Fast reactions to pictures of African Americans and White Americans reflect only first impressions without context. They are also influenced by affective reactions to other aspects such as gender, age, or attractiveness. Thus, these scores may not reflect other aspects of attitudes that are activated in specific contexts. Moreover, the means will depend heavily on the selection of individual pictures. Thus, a lot more work would need to be done to ensure that the picture sets are representative of the whole group. Finally, our results showed that none of the measures had high loadings on the attitude factors. Thus, a single measure has only modest validity.
Unfortunately, psychologists often do not carefully examine the psychometric properties of their measures. Instead, one measure is often arbitrarily chosen and treated as if it were a perfect measure of a construct. Even worse, a specific measure may be chosen from a set of measures because it showed the desired result (John, Loewenstein, & Prelec, 2012). To avoid these problems, we strongly urge intergroup relationship researchers to use a multi-method approach and to use formal measurement models to analyze their data (Schimmack, 2021). This approach will also produce better estimates of effect sizes that are attenuated by random and systematic measurement error.
Adams, P. E. (2010). Understanding the Different Realities, Experience, and Use of Self-Esteem Between Black and White Adolescent Girls. Journal of Black Psychology, 36(3), 255–276. https://doi.org/10.1177/0095798410361454
Banaji, M. R., & Greenwald, A. G. (2013). Blindspot: Hidden biases of good people. New York, NY: Delacorte Press.
Blanton, H., Jaccard, J., Gonzales, P. M., & Christie, C. (2006). Decoding the implicit association test: Implications for criterion prediction. Journal of Experimental Social Psychology, 42(2), 192–212. https://doi.org/10.1016/j.jesp.2005.07.003
Blanton, H., Jaccard, J., Klick, J., Mellers, B., Mitchell, G., & Tetlock, P. E. (2009). Strong claims and weak evidence: Reassessing the predictive validity of the IAT. Journal of Applied Psychology, 94(3), 567–582.
Blanton, H., Jaccard, J., Strauts, E., Mitchell, G., & Tetlock, P. E. (2015). Toward a meaningful metric of implicit prejudice. Journal of Applied Psychology, 100(5), 1468–1481. https://doi.org/10.1037/a0038379
Campbell, M. R., & Brauer, M. (2021). Is discrimination widespread? Testing assumptions about bias on a university campus. Journal of Experimental Psychology: General, 150(4), 756–777. https://doi.org/10.1037/xge0000983
Fazio, R. H., Jackson, J. R., Dunton, B. C., & Williams, C. J. (1995). Variability in automatic activation as an unobtrusive measure of racial attitudes: A bona fide pipeline? Journal of Personality and Social Psychology, 69(6), 1013–1027. https://doi.org/10.1037/0022-3518.104.22.1683
Greenwald, A. G., McGhee, D. E., & Schwartz, J. L. K. (1998). Measuring individual differences in implicit cognition: The Implicit Association Test. Journal of Personality and Social Psychology, 74, 1464–1480.
John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23(5), 524–532. https://doi.org/10.1177/0956797611430953
Jost, J. T. (2019). A quarter century of system justification theory: Questions, answers, criticisms, and societal applications. British Journal of Social Psychology, 58(2), 263–314. https://doi.org/10.1111/bjso.12297
Jost, J. T., Banaji, M. R., & Nosek, B. A. (2004). A Decade of System Justification Theory: Accumulated Evidence of Conscious and Unconscious Bolstering of the Status Quo. Political Psychology, 25(6), 881–919. https://doi.org/10.1111/j.1467-9221.2004.00402.x
Hofmann, W., Gawronski, B., Geschwendner, T., Le, H., & Schmitt, M. (2005). A meta-analysis on the correlation between the Implicit Association Test and explicit self-report measures. Personality and Social Psychology Bulletin, 31, 1369–1385. doi:10.1177/0146167205275613
Muthén, L.K. and Muthén, B.O. (1998-2017). Mplus User’s Guide. Eighth Edition. Los Angeles, CA: Muthén & Muthén
Myers, D. & Twenge, J. (2019). Social psychology (13th edition). McGraw Hill.
Nosek, B. A., Greenwald, A. G., & Banaji, M. R. (2005). Understanding and Using the Implicit Association Test: II. Method Variables and Construct Validity. Personality and Social Psychology Bulletin, 31(2), 166–180. https://doi.org/10.1177/0146167204271418
Olson, M. A., & Fazio, R. H. (2004). Reducing the Influence of Extrapersonal Associations on the Implicit Association Test: Personalizing the IAT. Journal of Personality and Social Psychology, 86(5), 653–667. https://doi.org/10.1037/0022-3522.214.171.1243
Oswald, F. L., Mitchell, G., Blanton, H., Jaccard, J., & Tetlock, P. E. (2013). Predicting ethnic and racial discrimination: A meta-analysis of IAT criterion studies. Journal of Personality and Social Psychology, 105(2), 171–192. https://doi.org/10.1037/a0032734
Oswald, F. L., Mitchell, G., Blanton, H., Jaccard, J., & Tetlock, P. E. (2015). Using the IAT to predict ethnic and racial discrimination: Small effect sizes of unknown societal significance. Journal of Personality and Social Psychology, 108(4), 562–571. https://doi.org/10.1037/pspa0000023
Payne, B. K., Cheng, C. M., Govorun, O., & Stewart, B. D. (2005). An inkblot for attitudes: Affect misattribution as implicit measurement. Journal of Personality and Social Psychology, 89(3), 277–293. https://doi.org/10.1037/0022-35126.96.36.1997
Rosenberg, M. (1986). Conceiving the self. Malabar, FL: Robert E. Krieger.
Schimmack, U. (2021). Invalid Claims About the Validity of Implicit Association Tests by Prisoners of the Implicit Social-Cognition Paradigm. Perspectives on Psychological Science, 16(2), 435–442. https://doi.org/10.1177/1745691621991860
Teige-Mocigemba, S., Becker, M., Sherman, J. W., Reichardt, R., & Christoph Klauer, K. (2017). The affect misattribution procedure: In search of prejudice effects. Experimental Psychology, 64(3), 215–230. https://doi.org/10.1027/1618-3169/a000364
Twenge, J. M., & Crocker, J. (2002). Race and self-esteem: Meta-analyses comparing Whites, Blacks, Hispanics, Asians, and American Indians and comment on Gray-Little and Hafdahl (2000). Psychological Bulletin, 128(3), 371–408. https://doi.org/10.1037/0033-2909.128.3.371
This is part 4 in a mini-series of blogs that illustrate the usefulness of structural equation modeling to test causal models of well-being. The first causal model of well-being was introduced in 1980 by Costa and McCrae. Although hundreds of studies have examined correlates of well-being since then, hardly any progress has been made in theory development. In 1984, Diener (1984) distinguished between top-down and bottom-up theories of well-being, but empirical tests of the different models have not settled this issue. The monster model is a first attempt to develop a causal model of well-being that corrects for measurement error and fits empirical data.
The first part (Part1) introduced the measurement of well-being and the relationship between affect and well-being. The second part added measures of satisfaction with life-domains (Part 2). Part 2 ended with the finding that most of the variance in global life-satisfaction judgments is based on evaluations of important life domains. Satisfaction in important life domains also influences the amount of happiness and sadness individuals experience, whereas positive affect had no direct effect on life-evaluations. In contrast, sadness had a unique negative effect on life-evaluations that was not mediated by life domains.
Part 3 added extraversion to the model. This was a first step towards a test of Costa and McCrae’s assumption that extraversion has a direct effect on positive affect (happiness) and no effect on negative affect (sadness). Without life domains in the model, the results replicated Costa and McCrae’s (1980) results. Yes, personality psychology has replicable findings. However, when domain satisfactions were added to the model, the story changed. Costa and McCrae (1980) assumed that extraversion increases well-being because it has a direct effect on cheerfulness (positive affect) that adds to well-being. However, in the new model, the effect of extraversion on life-satisfaction was mediated by life domains rather than positive affect. The strongest mediation was found for romantic satisfaction. Extraverts tended to have higher romantic satisfaction and romantic satisfaction contributed significantly to overall life-satisfaction. Other domains like recreation and work are also possible mediators, but the sample size was too small to produce more conclusive evidence.
Part 4 is a simple extension of the model in part 3 by adding the other personality dimensions to the model. I start with neuroticism because it is by far the most consistent and strongest predictor of well-being. Costa and McCrae (1980) assumed that neuroticism is a general disposition to experience more negative affect without any relation to positive affect. However, most studies show that neuroticism has a negative relationship with positive aspect as well, although it is not as strong as the relationship with negative affect. Moreover, neuroticism is also related to lower satisfaction in many life domains. Thus, the model simply allowed for neuroticism to be a predictor of both affects and all domain satisfaction. The only assumption made by this model is that the negative effect of neuroticism on life-satisfaction is fully mediated by domain satisfaction and affect.
Figure 1 shows the model and the path coefficients for neuroticism. The first important finding is that neuroticism has a strong direct effect on sadness that is independent of satisfaction with various life domains. This finding suggests that neuroticism may have a direct effect on individuals’ mood rather than interacting with situational factors that are unique to individual life domains. The second finding is that neuroticism has sizeable effects on all life domains ranging from b = -.19 for satisfaction with housing to -31 for satisfaction with friendships.
Following the various paths from neuroticism to life-satisfaction produces a total effect of b = -.38, which confirms the strong negative effect of neuroticism on well-being. About a quarter of this effect is directly mediated by negative affect (sadness), b = -.09. The rest is mediated by the top-down effect of neuroticism on satisfaction with life domains and the bottom-up effect of life domains on global life-evaluations.
McCrae and Costa (1991) expanded their model to include the other Big Five factors. They proposed that agreeableness has a positive influence on well-being that is mediated by romantic satisfaction (adding Liebe) and that conscientiousness has a positive influence on well-being that is mediated by work satisfaction (adding Arbeit). Although this proposal was made three decades ago, it has never been seriously tested because few studies measure domain satisfaction (but see Heller et al., 2004).
To test these hypotheses, I added conscientiousness and agreeableness to the model. Adding both together was necessary because agreeableness and conscientiousness were correlated as reflected in a large modification index when the two factors were assumed to be independent. This does not mean that agreeableness and conscientiousness are correlated factors, an issue that is debated among personality psychologists (Anusic et al., 2009; Biesanz & West, 2004; DeYoung, 2006). One problem is that secondary loadings can produce spurious correlations among scale scores that were used for this model. This could be examined by using a more complex item-level model in the future. For now, agreeableness and conscientiousness were allowed to correlate. The results showed no direct effects of conscientiousness on PA, NA, and LS. In contrast, agreeableness was a positive predictor of PA and a negative predictor of NA. Most important are the relationships with domain satisfactions.
Confirming McCrae and Costa’s (1991) prediction, work satisfaction was predicted by conscientiousness, b = .21, z = 3.4. Also confirming McCrae and Costa, romantic satisfaction was predicted by agreeableness, although the effect size was small, b = .13, z = 2.9. Moreover, conscientiousness was an even stronger predictor, b =.28, z = 6.0. This confirms the old saying “marriage is work.” Also not predicted by McCrae and Costa was that conscientiousness is related to higher housing satisfaction, b = .20, z = 3.7, presumably because conscientious individuals take better care of their houses. The other domains were not significantly related to conscientiousness, |b| < .1.
Also not predicted by McCrae and Costa are additional relationships of agreeableness with other domains such as health, b = .18, z = 3.7, housing, a = .17, z = 2.9, recreation, b = .25, z = 4.0, and friendships, b = .35, z = 5.9. The only domains that were not predicted by agreeableness were financial satisfaction, b = .05, z = 0.8, and work satisfaction, b = .07, z = 1.3. Some of these relationships could reflects benefits for social relationships aside from romantic relationships. Thus, the results are broadly consistent with McCrae and Costa’s assumption that agreeableness is beneficial for well-being.
The total effect of agreeableness in this dataset was b = .21, z = 4.34. All of this effect was mediated by indirect paths, but only the path through romantic satisfaction achieved statistical significance due to a lack of power, b = .03, z = 2.6.
The total effect of conscientiousness was b = .18, z = 4.14. Three indirect paths were significant, namely work, b = .06, z = 3.3. romantic satisfaction, b = .06, z = 4.2, and housing satisfaction, b = .04, z = 2.51.
Overall, these results confirm previous findings that agreeableness and conscientiousness are also positive predictors of well-being and shed first evidence on potential mediators of these relationships. These results need to be replicated in datasets from other populations.
When openness was added to the model, a modification index suggested a correlation between extraversion and openness, which has been found in several multi-method studies (Anusic et al., 2009; DeYoung, 2006). Thus, the two factors were allowed to correlate. Openness had no direct effects on positive affect, negative affect, or life-satisfaction. Moreover, there were only two, weak, just significant relationships with domain satisfaction for work, b = .12, z = 2.0, and health, b = .12, z = 2.2. Consistent with meta-analysis, the total effect is negligible, b = .06, z = 1.3. In short, the results are consistent with previous studies and show that openness is not a predictor of higher or lower well-being. To keep the model simple, it is therefore possible to omit openness from the monster model.
At this point, we have built a complex, but plausible model that links personality traits to subjective well-being by means of domain satisfaction and affect. However, just because this model is plausible and fits the data, does not ensure that it is the right model. An important step in causal modeling is to consider alternative models and to do model comparisons. Overall fit is less important than relatively better fit among alternative models.
The previous model assumed that domain satisfaction causes higher levels of PA and lower levels of NA. Accordingly, affect is a summary of the affect generated in different life domains. This assumption is consistent with bottom-up models of well-being. However, a plausible alternative model assumes that affect is largely influenced by internal dispositions which in turn color our experiences of different life domains. Accordingly neuroticism may simply be a disposition to be more often in a negative mood and this negative mood colors perception of marital satisfaction, job satisfaction, and so on. Costa and McCrae (1980) proposed that neuroticism and extraversion are global affective dispositions. So, it makes sense to postulate that their influence on domain satisfaction and life satisfaction is mediated by affect. McCrae and Costa (1991) postulated that agreeableness and conscientiousness are not affective dispositions, but rather only instrumental for higher satisfaction in some life domains. Thus, their effects should not be mediated by affect. Consistent with this assumption, conscientiousness showed only significant relationships with some domains, including work satisfaction. However, agreeableness was a positive predictor of all life domains, suggesting that it is also a broad affective disposition. I thus modeled agreeableness as a third global affective disposition (see Figure 2).
The effect sizes for affect on domain satisfaction are shown in Table 1.
A comparison of the fit indices for the top-down and bottom-up models shows that both models meet standard criteria for global model fit (CFI > .95; RMSEA < .06). In addition, the results show no clear superiority of one model over the other. CFI and RMSEA show slightly better fit for the bottom-up model, but the Bayesian Information Criterion favors the more parsimonious top-down model. Thus, the data are unable to distinguish between the two models.
Both model assume that conscientiousness is instrumental for higher well-being in only some domains. The key difference between the models is the assumption of the top-down model that changes in domain satisfaction have no influence on affective experiences. That is, an increase in relationship satisfaction does not produce higher levels of PA or a decrease in job satisfaction does not produce a change in NA. These competing predictions can be tested in longitudinal studies.
To conclude part 4 of the monster model series. As surprising as it may sound, the present results provide one of the first tests of McCrae and Costa’s causal theory of well-being (Costa & McCrae, 1980, McCrae & Costa, 1991). Although the present results are consistent with their proposal that agreeableness and conscientiousness are instrumental for higher well-being because they foster higher romantic and job satisfaction, respectively, the present results also show that this model is too simplistic. For example, conscientiousness may also increase well-being because it contributes to higher romantic satisfaction (marriage is work).
One limitation of the present model is the focus on the Big Five as a measure of personality traits. The Big Five are higher-order personality traits of more specific personality traits that are often called facets. Facet level traits may predict additional variance in well-being that is not captured by the Big Five (Schimmack ,Oishi, Furr, & Funder, 2004). Part 5 will add the strongest facet predictors to the model, namely the Depressiveness facet of Neuroticism and the Cheerfulness facet of Extraversion (see also Payne & Schimmack, 2020).
Anusic, I., & Schimmack, U. (2016). Stability and change of personality traits, self-esteem, and well-being: Introducing the meta-analytic stability and change model of retest correlations. Journal of Personality and Social Psychology, 110(5), 766-781. https://doi.org/10.1037/pspp0000066
Campbell, D. T. (1963). From description to experimentation: Interpreting trends as quasi-experiments. In C. W. Harris (Ed.), Problems in measuring change. Madison: University of Wisconsin Press.
Hamaker, E. L., Kuiper, R. M., & Grasman, R. P. P. P. (2015). A critique of the cross-lagged panel model. Psychological Methods, 20(1), 102–116. https://doi.org/10.1037/a0038889
Heise, D. R. (1970). Causal inference from panel data. Sociological Methodology, 2, 3–27.
Orth, U., Clark, D. A., Donnellan, M. B., & Robins, R. W. (2021). Testing prospective effects in longitudinal research: Comparing seven competing cross-lagged models. Journal of Personality and Social Psychology, 120(4), 1013-1034. http://dx.doi.org/10.1037/pspp0000358
Orth, U., Robins, R. W., & Roberts, B. W. (2008). Low self-esteem prospectively predicts depression in adolescence and young adulthood. Journal of Personality and Social Psychology, 95, 695–708. http://dx.doi.org/10.1037/0022-35188.8.131.525
Pelz, D. C., & Andrews, F. (1964). Detecting causal priorities in panel study data, American Sociological Review, 29, 836-848.
Background: A previous blog post shared a conversation between Bill von Hippel and Ulrich Schimmack about Bill’s Replicability Index (Part 1). To recapitulate, I had posted statistical replicability estimates for several hundred social psychologists (Personalized P-Values). Bill’s scores suggested that many of his results with p-values just below .05 might not be replicable. Bill was dismayed by his low R-Index but thought that some of his papers with very low values might be more replicable than the R-Index would indicate. He suggested that we put the R-Index results to an empirical test. He chose his paper with the weakest statistical evidence (interaction p = .07) for a replication study. We jointly agreed on the replication design and sample size. In just three weeks the study was approved, conducted, and the results were analyzed. Here we discuss the results.
…. Three Weeks Later
Bill: Thanks to rapid turnaround at our university IRB, the convenience of modern data collection, and the programming skills of Sam Pearson, we have now completed our replication study on Prolific. We posted the study for 2,000 participants, and 2,031 people signed up. For readers who are interested in a deeper dive, the data file is available at https://osf.io/cu68f/ and the pre-registration at https://osf.io/7ejts.
To cut to the chase, this one is a clear win for Uli’s R-Index. We successfully replicated the standard effect documented in the prior literature (see Figure A), but there was not even a hint of our predicted moderation of that effect, which was the key goal of this replication exercise (see Figure B: Interaction F(1,1167)=.97, p=.325, and the nonsignificant mean differences don’t match predictions). Although I would have obviously preferred to replicate our prior work, given that we failed to do so, I’m pleased that there’s no hint of the effect so I don’t continue to think that maybe it’s hiding in there somewhere. For readers who have an interest in the problem itself, let me devote a few paragraphs to what we did and what we found. For those who are not interested in Darwinian Grandparenting, please skip ahead to Uli’s response.
Previous work has established that people tend to feel closest to their mother’s mother, then their mother’s father, then their father’s mother, and last their father’s father. We replicated this finding in our prior paper and replicated it again here as well. The evolutionary idea underlying the effect is that our mother’s mother knows with certainty that she’s related to us, so she puts greater effort into our care than other grandparents (who do not share her certainty), and hence we feel closest to her. Our mother’s father and father’s mother both have one uncertain link (due to the possibility of cuckoldry), and hence put less effort into our care than our mother’s mother, so we feel a little less close to them. Last on the list is our father’s father, who has two uncertain links to us, and hence we feel least close to him.
The puzzle that motivated our previous work lies in the difference between our mother’s father and father’s mother; although both have one uncertain link, most studies show that people feel closer to their mother’s father than their father’s mother. The explanation we had offered for this effect was based on the idea that our father’s mother often has daughters who often have children, providing her with a more certain outlet for her efforts and affections. According to this possibility, we should only feel closer to our mother’s father than our father’s mother when the latter has grandchildren via daughters, and that is what our prior paper had documented (in the form of a marginally significant interaction and predicted simple effects).
Our clear failure to replicate that finding suggests an alternative explanation for the data in Figure A:
People are closer to their maternal grandparents than their paternal grandparents (possibly for the reasons of genetic certainty outlined above).
People are closer to their grandmothers than their grandfathers (possibly because women tend to be more nurturant than men and more involved in childcare).
As a result of these two main effects, people tend to be closer to their mothers’ father than their father’s mother, and this particular difference emerges in the presence or absence of other more certain kin.
Does our failure to replicate mean that the presence or absence of more certain kin has no impact on grandparenting? Clearly not in the manner I expected, but that doesn’t mean it has no effect. Consider the following (purely exploratory, non-preregistered) analyses of these same data: After failing to find the predicted interaction above, I ran a series of regression analyses, in which closeness to maternal and paternal grandparents were the dependent variables and number of cousins via fathers’ and mothers’ brothers and sisters were the predictor variables. The results are the same whether we’re looking at grandmothers or grandfathers, so for the sake of simplicity, I’ve collapsed the data into closeness to paternal grandparents and closeness to maternal grandparents. Here are the regression tables:
We see three very small but significant findings here (all of which require replication before we have any confidence in them). First, people feel closer to their paternal grandparents to the degree that those grandparents are not also maternal grandparents to someone else (i.e., more cousins through fathers’ sisters are associated with less closeness to paternal grandparents). Second, people feel closer to their paternal grandparents to the degree that their maternal grandparents have more grandchildren through daughters other than their mother (i.e., more cousins through mothers’ sisters are associated with more closeness to paternal grandparents). Third, people feel closer to their maternal grandparents to the degree that those grandparents are not also maternal grandparents to someone else (i.e., more cousins through mothers’ sisters are associated with less closeness to maternal grandparents). Note that none of these effects emerged via cousins through father’s or mother’s brothers. These findings strike me as worthy of follow-up, as they suggest that the presence or absence of equally or more certain kin does indeed have a (very small) influence on grandparents in a manner that evolutionary theory would predict (even if I didn’t predict it myself).
Uli: Wow, I am impressed how quickly research with large samples can be done these days. That is good news for the future of social psychology, at least the studies that are relatively easy to do.
Bill: Agreed! But benefits rarely come without cost and studies on the web are no exception. In this case, the ease of working on the web also distorts our field by pushing us to do the kind of work that is ‘web-able’ (e.g., self-report) or by getting us to wangle the methods to make them work on the web. Be that as it may, this study was a no brainer, as it was my lowest R-Index and pure self-report. Unfortunately, my other papers with really low R-Indices aren’t as easy to go back and retest (although I’m now highly motivated to try).
Uli: Of course, I am happy that R-Index made the correct prediction, but N = 1 is not that informative.
Bill: Consider this N+1, as it adds to your prior record.
Uli: Maybe you set yourself up for failure by picking a marginally significant result.
Bill: That was exactly my goal. I still believed in the finding, so it was a great chance to pit your method against my priors. Not much point in starting with one of my results that we both agree is likely to replicate.
Uli: The R-Index analysis implied that we should only trust your results with p < .001.
Bill: That seems overly conservative to me, but of course I’m a biased judge of my own work. Out of curiosity, is that p value better when you analyze all my critical stats rather than just one per experiment? This strikes me as potentially important, because almost none of my papers would have been accepted based on just a single statistic; rather, they typically depend on a pattern of findings (an issue I mentioned briefly in our blog).
Uli: The rankings are based on automatic extraction of test statistics. Selecting focal tests would only lead to an even more conservative alpha criterion. To evaluate the alpha = .001 criterion, it is not fair to use a single p = .07 result. Looking at the original article about grandparent relationships, I see p < .001 for mother’s mother vs. mother’s father relationships. The other contrasts are just significant and do not look credible according to R-Index (predicting failure for same N). However, they are clearly significant in the replication study. So, R-Index made two correct predictions (one failure and one success), and two wrong predictions. Let’s call it a tie. 🙂
Bill: Kind of you, but still a big win for the R-Index. It’s important to keep in mind that many prior papers had found the other contrasts, whereas we were the first to propose and find the specific moderation highlighted in our paper. So a reasonable prior would set the probability much higher to replicate the other effects, even if we accept that many prior findings were produced in an era of looser research standards. And that, in turn, raises the question of whether it’s possible to integrate your R-Index with some sort of Bayesian prior to see if it improves predictive ability.
Your prediction markets v. R-Index blog makes the very good point that simple is better and the R-Index works awfully well without the work involved in human predictions. But when I reflect on how I make such predictions (I happened to be a participant in one of the early prediction market studies and did very well), I’m essentially asking whether the result in question is a major departure from prior findings or an incremental advance that follows from theory. When the former, I say it won’t replicate without very strong statistical evidence. When the latter, I say it will replicate. Would it be possible to capture that sort of Bayesian processing via machine learning and then use it to supplement the R-Index?
Uli: There is an article that tried to do this. Performance was similar to prediction markets. However, I think it is more interesting to examine the actual predictors that may contribute to the prediction of replication outcomes. For example, we know cognitive psychology and within-subject designs are more replicable than social psychology and between-subject designs. I don’t think, however, we will get very far based on single questionable studies. Bias-corrected meat-analysis may be the only way to salvage robust findings from the era of p-hacking.
To broaden the perspective from this single article to your other articles, one problem with the personalized p-values is that they are aggregated across time. This may lead to overly conservative alpha levels (p < .001) for new research that was conducted in accordance with new rules about transparency, while the rules may be too liberal for older studies that were conducted in a time when awareness about the problems of selection for significance was lacking (say before 2013). Inspired by the “loss of confidence project” (Rohrer et al., 2021), I want to give authors the opportunity to exclude articles from their R-Index analysis that they no longer consider credible themselves. To keep track of these loss-of-confidence declaration, I am proposing to use PubPeer (https://pubpeer.com/). Once an author posts a note on PubPeer that declares loss of confidence in the empirical results of an article, the article will be excluded from the R-Index analysis. Thus, authors can improve their standing in the rankings and, more importantly, change the alpha level to a more liberal level (e.g., from .005 to .01) by (a) publicly declaring loss of confidence in a finding and (b) publishing new research with studies that have more power and honestly report non-significant results.
I hope that the incentive to move up in the rankings will increase the low rate of loss of confidence declarations and help us to clean up the published record faster. Declarations could also be partial. For example, for the 2005 article, you could post a note on PubPeer that the ordering of the grandparent relationships was successfully replicated and the results for cousins were not with a link to the data and hopefully eventually a publication. I would then remove this article from the R-Index analysis. What do you think about this idea?
Bill: I think this is a very promising initiative! The problem, as I see it, is that authors are typically the last ones to lose confidence in their own work. When I read through the recent ‘loss of confidence’ reports, I was pretty underwhelmed by the collection. Not that there was anything wrong with the papers in there, but rather that only a few of them surprised me.
Take my own case as an example. I obviously knew it was possible my result wouldn’t replicate, but I was very willing to believe what turned out to be a chance fluctuation in the data because it was consistent with my hypothesis. Because I found that hypothesis-consistent chance fluctuation on my first try, I would never have stated I have low confidence in it if you hadn’t highlighted it as highly improbable. In other words, there’s no chance I’d have put that paper on a ‘loss of confidence’ list without your R-Index telling me it was crap and even then it took a failure to replicate for me to realize you were right.
Thus, I would guess that uptake into the ‘loss of confidence’ list would be low if it emphasizes work that people feel was sloppy in the first place, not because people are liars, but because people are motivated reasoners.
With that said, if the collection also emphasizes work that people have subsequently failed to replicate, and hence have lost confidence in it, I think it would be used much more frequently and could become a really valuable corrective. When I look at the Darwinian Grandparenting paper, I see that it’s been cited over 150 times on google scholar. I don’t know how many of those papers are citing it for the key moderation effect that we now know doesn’t replicate, but I hope that no one else will cite it for that reason after we publish this blog. No one wants other investigators to waste time following up their work once they realize the results aren’t reliable.
Uli: (feeling a bit blue today). I am not very optimistic that authors will take note of replication failures. Most studies are not conducted after a careful review of the existing literature or a meta-analysis that takes publication bias into account. As a result, citations in articles are often picked because they help to support a finding in an article. While p-hacking of data may have decreased over the past decade in some areas, cherry-picking of references is still common and widespread. I am not really sure how we can speed up self-correction of science. My main hope is that meta-analyses are going to improve and take publication bias more seriously. Fortunately, new methods show promising results in debiasing effect sizes estimates (Bartoš, Maier, Wagenmakers, Doucouliagos, & Stanley, 2021). Z-curve is also being used by meta-analysists and we are hopeful that z-curve 2.0 will soon be accepted for publication in Meta-Psychology (Bartos & Schimmack, 2021). Unfortunately, it will take another decade for these methods to become mainstream and meanwhile many resources will be wasted on half-baked ideas that are grounded in a p-hacked literature. I am not optimistic that psychology will become a rigorous science during my lifetime. So, I am trying to make the best of it. Fortunately, I can just do something else when things are too depressing, like sitting in my backyard and watch Germany win at the Euro cup. Life is good, psychological science not so much.
Bill: I don’t blame you for your pessimism, but I completely disagree. You see a science that remains flawed when we ought to know better, but I see a science that has improved dramatically in the 35 years since I began working in this field. Humans are wildly imperfect actors who did not evolve to be dispassionate interpreters of data. We hope that training people to become scientists will debias them – although the data suggest that it doesn’t – and then we double down by incentivizing scientists to publish results that are as exciting as possible as rapidly as possible.
Thankfully bias is the both the problem and the solution, as other scientists are biased in favor of their theories rather than ours, and out of this messy process the truth eventually emerges. The social sciences are a dicier proposition in this regard, as our ideologies intersect with our findings in ways that are less common in the physical and life sciences. But so long as at least some social scientists feel free to go wherever the data lead them, I think our science will continue to self-correct, even if the process often seems painfully slow.
Uli: Your response to my post is a sign that progress is possible, but 1 out of 400 may just be the exception to the rule to never question your own results. Even researchers who know better become promoters of their own theories, especially when they become popular. I think the only way to curb false enthusiasm is to leave the evaluation of theories (review articles, meta-analysis) to independent scientists. The idea that one scientist can develop and evaluate a theory objectively is simply naive. Leaders of a paradigm are like strikers in soccer. They need to have blinders on to risk failure. We need meta-psychologists to distinguish real contributions from false ones. In this way meta-psychologists are like referees. Referees are not glorious heroes, but they are needed for a good soccer game, and they have the power to call of a goal because a player was offside or used their hands. The problem for science is the illusion that scientists can control themselves.
Over the past two decades, social psychological research on prejudice has been dominated by the implicit cognition paradigm (Meissner, Grigutsch, Koranyi, Müller, & Rothermund, 2019). This paradigm is based on the assumption that many individuals of the majority group (e.g., White US Americans) have an automatic tendency to discriminate against members of a stigmatized minority group (e.g., African Americans). It is assumed that this tendency is difficult to control because many people are unaware of their prejudices.
The implicit cognition paradigm also assumes that biases vary across individuals of the majority group. The most widely used measure of individual differences in implicit biases is the race Implicit Association Test (rIAT; Greenwald, McGhee, & Schwartz, 1998). Like any other measure of individual differences, the race IAT has to meet psychometric criteria to be a useful measure of implicit bias. Unfortunately, the race IAT has been used in hundreds of studies before its psychometric properties were properly evaluated in a program of validation research (Schimmack, 2021a, 2021b).
Meta-analytic reviews of the literature suggest that the race IAT is not as useful for the study of prejudice as it was promised to be (Greenwald et al., 1998). For example, Meissner et al. (2019) concluded that “the predictive value for behavioral criteria is weak and their incremental validity over and above self-report measures is negligible” (p. 1).
In response to criticism of the race IAT, Greenwald, Banaji, and Nosek (2015) argued that “statistically small effects of the implicit association test can have societally large effects” (p. 553). At the same time, Greenwald (1975) warned psychologists that they may be prejudiced against the null-hypothesis. To avoid this bias, he proposed that researchers should define a priori a range of effect sizes that are close enough to zero to decide in favor of the null-hypothesis. Unfortunately, Greenwald did not follow his own advice and a clear criterion for a small, but practically significant amount of predictive validity is lacking. This is a problem because estimates have decreased over time from r = .39 (McConnell & Leibold, 2001), to r = .24 in 2009 ( Greenwald, Poehlman, Uhlmann, and Banaji, 2009), to r = .148 in 2013 (Oswald, Mitchell, Blanton, Jaccard, & Tetlock (2013), and r = .097 in 2019 (Greenwald & Lai, 2020; Kurdi et al., 2019). Without a clear criterion value, it is not clear how this new estimate of predictive validity should be interpreted. Does it still provide evidence for a small, but practically significant effect, or does it provide evidence for the null-hypothesis (Greenwald, 1975)?
Measures are not Causes
To justify the interpretation of a correlation of r = .1 as small but important, it is important to revisit Greenwald et al.’s (2015) arguments for this claim. Greenwald et al. (2015) interpret this correlation as evidence for an effect of the race IAT on behavior. For example, they write “small effects can produce substantial discriminatory impact also by cumulating over repeated occurrences to the same person” (p. 558). The problem with this causal interpretation of a correlation between two measures is that scores on the race IAT have no influence on individuals’ behavior. This simple fact is illustrated in Figure 1. Figure 1 is a causal model that assumes the race IAT reflects valid variance in prejudice and prejudice influences actual behaviors (e.g., not voting for a Black political candidate). The model makes it clear that the correlation between scores on the race IAT (i.e., the iat box) and scores on a behavioral measures (i.e., the crit box) do not have a causal link (i.e., no path leads from the iat box to the crit box). Rather, the two measured variables are correlated because they both reflect the effect of a third variable. That is, prejudice influences race IAT scores and prejudice influences the variance in the criterion variable.
There is general consensus among social scientists that prejudice is a problem and that individual differences in prejudice have important consequences for individuals and society. The effect size of prejudice on a single behavior has not been clearly examined, but to the extent that race IAT scores are not perfectly valid measures of prejudice, the simple correlation of r = .1 is a lower limit of the effect size. Schimmack (2021) estimated that no more than 20% of the variance in race IAT scores is valid variance. With this validity coefficient, a correlation of r = .1 implies an effect of prejudice on actual behaviors of .1 / sqrt(.2) = .22.
Greenwald et al. (2015) correctly point out that effect sizes of this magnitude, r ~ .2, can have practical, real-world implications. The real question, however, is whether predictive validity of .1 justifies the use of the race IAT as a measure of prejudice. This question has to be evaluated in a comparison of predictive validity for the race IAT with other measures of prejudice. Thus, the real question is whether the race IAT has sufficient incremental predictive validity over other measures of prejudice. However, this question has been largely ignored in the debate about the utility of the race IAT (Greenwald & Lai, 2020; Greenwald et al., 2015; Oswald et al., 2013).
Kurdi et al. (2019) discuss incremental predictive validity, but this discussion is not limited to the race IAT and makes the mistake to correct for random measurement error. As a result, the incremental predictive validity for IATs of b = .14 is a hypothetical estimate for IATs that are perfectly reliable. However, it is well-known that IATs are far from perfectly reliable. Thus, this estimate overestimates the incremental predictive validity. Using Kurdi et al.’s data and limiting the analysis to studies with the race IAT, I estimated incremental predictive validity to be b = .08, 95%CI = .04 to .12. It is difficult to argue that this a practically significant amount of incremental predictive validity. At the very least, it does not justify the reliance on the race IAT as the only measure of prejudice or the claim that the race IAT is a superior measure of prejudice (Greenwald et al., 2009).
The meta-analytic estimate of b = .1 has to be interpreted in the context of evidence of substantial heterogeneity across studies (Kurdi et al., 2019). Kurdi et al. (2019) suggest that “it may be more appropriate to ask under what conditions the two [race IAT scores and criterion variables] are more or less highly correlated” (p. 575). However, little progress has been made in uncovering moderators of predictive validity. One possible explanation for this is that previous meta-analysis may have overlooked one important source of variation in effect sizes, namely publication bias. Traditional meta-analyses may be unable to reveal publication bias because they include many articles and outcome measures that did not focus on predictive validity. For example, Kurdi’s meta-analysis included a study by Luo, Li, Ma, Zhang, Rao, and Han (2015). The main focus of this study was to examine the potential moderating influence of oxytocin on neurological responses to pain expressions of Asian and White faces. Like many neurological studies, the sample size was small (N = 32), but the study reported 16 brain measures. For the meta-analysis, correlations were computed across N = 16 participants separately for two experimental conditions. Thus, this study provided as many effect sizes as it had participants. Evidently, power to obtain a significant result with N = 16 and r = .1 is extremely low, and adding these 32 effect sizes to the meta-analysis merely introduced noise. This may undermine the validity of meta-analytic results ((Sharpe, 1997). To address this concern, I conducted a new meta-analysis that differs from the traditional meta-analyses. Rather than coding as many effects from as many studies as possible, I only include focal hypothesis tests from studies that aimed to investigate predictive validity. I call this a focused meta-analysis.
Focused Meta-Analysis of Predictive Validity
Coding of Studies
I relied on Kurdi et al.’s meta-analysis to find articles. I selected only published articles that used the race IAT (k = 96). The main purpose of including unpublished studies is often to correct for publication bias (Kurdi et al., 2019). However, it is unlikely that only 14 (8%) studies that were conducted remained unpublished. Thus, the unpublished studies are not representative and may distort effect size estimates.
Coding of articles in terms of outcome measures that reflect discrimination yielded 60 studies in 45 articles. I examined whether this selection of studies influenced the results by limiting a meta-analysis with Kurdi et al.’s coding of studies to these 60 articles. The weighted average effect size was larger than the reported effect size, a = .167, se = .022, 95%CI = .121 to .212. Thus, Kurdi et al.’s inclusion of a wide range of studies with questionable criterion variables diluted the effect size estimate. However, there remained substantial variability around this effect size estimate using Kurdi et al.’s data, I2 = 55.43%.
The focused coding produced one effect-size per study. It is therefore not necessary to model a nested structure of effect sizes and I used the widely used metafor package to analyze the data (Viechtbauer, 2010). The intercept-only model produced a similar estimate to the results for Kurdi et al.’s coding scheme, a = .201, se = .020, 95%CI = .171 to .249. Thus, focal coding does seem to produce the same effect size estimate as traditional coding. There was also a similar amount of heterogeneity in the effect sizes, I2 = 50.80%.
However, results for publication bias differed. Whereas Kurdi et al.’s coding shows no evidence of publication bias, focused coding produced a significant relationship emerged, b = 1.83, se = .41, z = 4.54, 95%CI = 1.03 to 2.64. The intercept was no longer significant, a = .014, se = .0462, z = 0.31, 95%CI = -.077 to 95%CI = .105. This would imply that the race IAT has no incremental predictive validity. Adding sampling error as a predictor reduced heterogeneity from I2 = 50.80% to 37.71%. Thus, some portion of the heterogeneity is explained by publication bias.
Stanley (2017) recommends to accept the null-hypothesis when the intercept in the previous model is not significant. However, a better criterion is to compare this model to other models. The most widely used alternative model regresses effect sizes on the squared sampling error (Stanley, 2017). This model explained more of the heterogeneity in effect sizes as reflected in a reduction of unexplained heterogeneity from 50.80% to 23.86%. The intercept for this model was significant, a = .113, se = .0232, z = 4.86, 95%CI = .067 to .158.
Figure 2 shows the effect sizes as a function of sampling error and the regression lines for the three models.
Inspection of Figure 1 provides further evidence that the squared-SE model. The red line (squared sampling error) fits the data better than the blue line (sampling error) model. In particular for large samples, PET underestimates effect sizes.
The significant relationship between sample size (sampling error) and effect sizes implies that large effects in small studies cannot be interpreted at face value. For example, the most highly cited study of predictive validity had only a sample size of N = 42 participants (McConnell & Leibold, 2001). The squared-sampling-error model predicts an effect size estimate of r = .30, which is close to the observed correlation of r = .39 in that study.
In sum, a focal meta-analysis replicates Kurdi et al.’s (2019) main finding that the average predictive validity of the race IAT is small, r ~ .1. However, the focal meta-analysis also produced a new finding. Whereas the initial meta-analysis suggested that effect sizes are highly variable, the new meta-analysis suggests that a large portion of this variability is explained by publication bias.
I explored several potential moderator variables, namely (a) number of citations, (b) year of publication, (c) whether IAT effects were direct or moderator effects, (d) whether the correlation coefficient was reported or computed based on test statistics, and (e) whether the criterion was an actual behavior or an attitude measure. The only statistically significant result was a weaker correlation in studies that predicted a moderating effect of the race IAT, b = -.11, se = .05, z = 2.28, p = .032. However, the effect would not be significant after correction for multiple comparison and heterogeneity remained virtually unchanged, I2 = 27.15%.
During the coding of the studies, the article “Ironic effects of racial bias during interracial interactions” stood out because it reported a counter-intuitive result. in this study, Black confederates rated White participants with higher (pro-White) race IAT scores as friendlier. However, other studies find the opposite effect (e.g., McConnell & Leibold, 2001). If the ironic result was reported because it was statistically significant, it would be a selection effect that is not captured by the regression models and it would produce unexplained heterogeneity. I therefore also tested a model that excluded all negative effect. As bias is introduced by this selection, the model is not a test of publication bias, but it may be better able to correct for publication bias. The effect size estimate was very similar, a = .133, se = .017, 95%CI = .010 to .166. However, heterogeneity was reduced to 0%, suggesting that selection for significance fully explains heterogeneity in effect sizes.
In conclusion, moderator analysis did not find any meaningful moderators and heterogeneity was fully explained by publication bias, including publishing counterintuitive findings that suggest less discrimination by individuals with more prejudice. The finding that publication bias explains most of the variance is extremely important because Kurdi et al. (2019) suggested that heterogeneity is large and meaningful, which would suggest that higher predictive validity could be found in future studies. In contrast, the current results suggest that correlations greater than .2 in previous studies were largely due to selection for significance with small samples, which also explains unrealistically high correlations in neuroscience studies with the race IAT (cf. Schimmack, 2021b).
Predictive Validity of Self-Ratings
The predictive validity of self-ratings is important for several reasons. First, it provides a comparison standard for the predictive validity of the race IAT. For example, Greenwald et al. (2009) emphasized that predictive validity for the race IAT was higher than for self-reports. However, Kurdi et al.’s (2019) meta-analysis found the opposite. Another reason to examine the predictive validity of explicit measures is that implicit and explicit measures of racial attitudes are correlated with each other. Thus, it is important to establish the predictive validity of self-ratings to estimate the incremental predictive validity of the race IAT.
Figure 2 shows the results. The sampling-error model shows a non-zero effect size, but sampling error is large, and the confidence interval includes zero, a = .121, se = .117, 95%CI = -.107 to .350. Effect sizes are also extremely heterogeneous, I2 = 62.37%. The intercept for the squared-sampling-error model is significant, a = .176, se = .071, 95%CI = .036 to .316, but the model does not explain more of the heterogeneity in effect sizes than the squared-sampling-error model, I2 = 63.33%. To remain comparability, I use the squared-sampling error estimate. This confirms Kurdi et al.’s finding that self-ratings have slightly higher predictive validity, but the confidence intervals overlap. For any practical purposes, predictive validity of the race IAT and self-reports is similar. Repeating the moderator analyses that were conducted with the race IAT revealed no notable moderators.
Only 21 of the 60 studies reported information about the correlation between the race IAT and self-report measures. There was no indication of publication bias, and the effect size estimates of the three models converge on an estimate of r ~ .2 (Figure 3). Fortunately, this result can be compared with estimates from large internet studies (Axt, 2017) and a meta-analysis of implicit-explicit correlations (Hofmann et al., 2005). These estimates are a bit higher, r ~ .25. Thus, using an estimate of r = .2 is conservative for a test of the incremental predictive validity of the race IAT.
Incremental Predictive Validity
It is straightforward to estimate the incremental predictive validity of the race IAT and self-reports on the basis of the correlations between race IAT, self-ratings, and criterion variables. However, it is a bit more difficult to provide confidence intervals around these estimates. I used a simulated dataset with missing values to reproduce the correlations and sampling error of the meta-analysis. I then regressed, the criterion on the implicit and explicit variable. The incremental predictive validity for the race IAT was b = .07, se = .02, 95%CI = .03 to .12. This finding implies that the race IAT on average explains less than 1% unique variance in prejudice behavior. The incremental predictive validity of the explicit measure was b = .165, se = .03, 95%CI = .11 to .23. This finding suggests that explicit measures explain between 1 and 4 percent of the variance in prejudice behaviors.
Assuming that there is no shared method variance between implicit and explicit measures and criterion variables and that implicit and explicit measures reflect a common construct, prejudice, it is possible to fit a latent variable model to the correlations among the three indicators of prejudice (Schimmack, 2021). Figure 4 shows the model and the parameter estimates.
According to this model, prejudice has a moderate effect on behavior, b = .307, se = .043. This is consistent with general findings about effects of personality traits on behavior (Epstein, 1973; Funder & Ozer, 1983). The loading of the explicit variable on the prejudice factor implies that .582^2 = 34% of the variance in self-ratings of prejudice is valid variance. The loading of the implicit variable on the prejudice factor implies that .353^2 = 12% of the variance in race IAT scores is valid variance. Notably, similar estimates were obtained with structural equation models of data that are not included in this meta-analysis (Schimmack, 2021). Using data from Cunningham et al., (2001) I estimated .43^2 = 18% valid variance. Using Bar-Anan and Vianello (2018), I estimated .44^2 = 19% valid variance. Using data from Axt, I found .44^2 = 19% valid variance, but 8% of the variance could be attributed to group differences between African American and White participants. Thus, the present meta-analytic results are consistent with the conclusion that no more than 20% of the variance in race IAT scores reflects actual prejudice that can influence behavior.
In sum, incremental predictive validity of the race IAT is low for two reasons. First, prejudice has only modest effects on actual behavior in a specific situation. Second, only a small portion of the variance in race IAT scores is valid.
In the 1990s, social psychologists embraced the idea that behavior is often influenced by processes that occur without conscious awareness. This assumption triggered the implicit revolution (Greenwald & Banaji, 2017). The implicit paradigm provided a simple explanation for low correlations between self-ratings of prejudice and implicit measures of prejudice, r ~ .2. Accordingly, many people are not aware how prejudice their unconscious is. The Implicit Association Test seemed to support this view because participants showed more prejudice on the IAT than on self-report measures. First studies of predictive validity also seemed to support this new model of prejudice (McConnell & Leibold, 2001), and the first meta-analysis suggested that implicit bias has a stronger influence on behavior than self-reported attitudes (Greenwald, Poehlman, Uhlmann, & Banaji, 2009, p. 17).
However, the following decade produced many findings that require a reevaluation of the evidence. Greenwald et al. (2009) published the largest test (N = 1057) of predictive validity. This study examined the ability of the race IAT to predict racial bias in the 2008 US presidential election. Although the race IAT was correlated with voting for McCain versus Obama, incremental predictive validity was close to zero and no longer significant when explicit measures were included in the regression model. Then subsequent meta-analyses produced lower estimates of predictive validity and it is no longer clear that predictive validity, especially incremental predictive validity, is high enough to reject the null-hypothesis. Although incremental predictive validity may vary across conditions, no conditions have been identified that show practically significant incremental predictive validity. Unfortunately, IAT proponents continue to make misleading statements based on single studies with small samples. For example, Kurdi et al. claimed that “effect sizes tend to be relatively large in studies on physician–patient interactions” (p. 583). However, this claim was based on a study with just 15 physicians, which makes it impossible to obtain precise effect size estimates about implicit bias effects for physicians.
Beyond Nil-Hypothesis Testing
Just like psychology in general, meta-analyses also suffer from the confusion of nil-hypothesis testing and null-hypothesis testing. The nil-hypothesis is the hypothesis that an effect size is exactly zero. Many methodologists have pointed out that it is rather silly to take the nil-hypothesis at face value because the true effect size is rarely zero (Cohen, 1994). The more important question is whether an effect size is sufficiently different from zero to be theoretically and practically meaningful. As pointed out by Greenwald (1975), effect size estimation has to be complemented with theoretical predictions about effect sizes. However, research on predictive validity of the race IAT lacks clear criteria to evaluate effect size estimates.
As noted in the introduction, there is agreement about the practical importance of statistically small effects for the prediction of discrimination and other prejudiced behaviors. The contentious question is whether the race IAT is a useful measure of dispositions to act prejudiced. Viewed from this perspective, focus on the race IAT is myopic. The real challenge is to develop and validate measures of prejudice. IAT proponents have often dismissed self-reports as invalid, but the actual evidence shows that self-reports have some validity that is at least equal to the validity of the race IAT. Moreover, even distinct self-report measures like the feeling thermometer and the symbolic racism have incremental predictive validity. Thus, prejudice researchers should use a multi-method approach. At present it is not clear that the race IAT can improve the measurement of prejudice (Greenwald et al., 2009; Schimmack, 2021a).
This article introduced a new type of meta-analysis. Rather than trying to find as many vaguely related studies and to code as many outcomes as possible, focused meta-analysis is limited to the main test of the key hypothesis. This approach has several advantages. First, the classic approach creates a large amount of heterogeneity that is unique to a few studies. This noise makes it harder to find real moderators. Second, the inclusion of vaguely related studies may dilute effect sizes. Third, the inclusion of non-focal studies may mask evidence of publication bias that is virtually present in all literatures. Finally, focal meta-analysis are much easier to do and can produce results much faster than the laborious meta-analyses that psychologists are used to. Even when classic meta-analysis exist, they often ignore publication bias. Thus, an important task for the future is to complement existing meta-analysis with focal meta-analysis to ensure that published effect sizes estimates are not diluted by irrelevant studies and not inflated by publication bias.
Enthusiasm about implicit biases has led to interventions that aim to reduce implicit biases. This focus on implicit biases in the real world needs to be reevaluated. First, there is no evidence that prejudice typically operates outside of awareness (Schimmack, 2021a). Second, individual differences in prejudice have only a modest impact on actual behaviors and are difficult to change. Not surprisingly, interventions that focus on implicit bias are not very infective. Rather than focusing on changing individuals’ dispositions, interventions may be more effective by changing situations. In this regard, the focus on internal factors is rather different from the general focus in social psychology on situational factors (Funder & Ozer, 1983). In recent years, it has become apparent that prejudice is often systemic. For example, police training may have a much stronger influence on racial disparities in fatal use of force than individual differences in prejudice of individual officers (Andersen, Di Nota, Boychuk, Schimmack, & Collins, 2021).
The present meta-analysis of the race IAT provides further support for Meissner et al.’s (2019) conclusion that IATs “predictive value for behavioral criteria is weak and their incremental validity over and above self-report measures is negligible” (p. 1). The present meta-analysis provides a quantitative estimate of b = .07. Although researchers can disagree about the importance of small effect sizes, I agree with Meissner that the gains from adding a race IAT to the measurement of prejudice is negligible. Rather than looking for specific contexts in which the race IAT has higher predictive validity, researchers should use a multi-method approach to measure prejudice. The race IAT may be included to further explore its validity, but there is no reason to rely on the race IAT as the single most important measure of individual differences in prejudice.
Funder, D.C., & Ozer, D.J. (1983). Behavior as a function of the situation. Journal of Personality and Social Psychology, 44, 107–112.
Kurdi, B., Seitchik, A. E., Axt, J. R., Carroll, T. J., Karapetyan, A., Kaushik, N., et al. (2019). Relationship between the implicit association test and intergroup behavior: a meta-analysis. American Psychologist. 74, 569–586. doi: 10.1037/amp0000364
The past decade has revealed many flaws in the way psychologists conduct empirical tests of theories. The key problem is that psychologists lacked an accepted strategy to conclude that a prediction was not supported. This fundamental flaw can be traced back to Fisher’s introduction of significance testing. In Fisher’s the null-hypothesis is typically specified as the absence of an effect in either direction. That is, the effect size is exactly zero. Significance testing examines how much empirical results deviate from this prediction. If the probability of the result or even more extreme deviations is less than 5%, the null-hypothesis is rejected. However, if the p-value is greater than .05, no inferences can be drawn from the finding because there are two explanations for this finding. Either the null-hypothesis is true or it is false and the result is a false negative result. The probability of this false negative results is unspecified in Fisher’s framework. This asymmetrical approach to significance testing continues to dominate psychological science.
Criticism of this one-sided approach to significance testing is nearly as old as nil-hypothesis significance testing itself (Greenwald, 1975; Sterling, 1959). Greenwald’s (1975) article is notable because it provided a careful analysis of the problem and it pointed towards a solution to this problem that is rooted in Neyman-Pearson’s alternative to Fisher’s significance testing. Greenwald (1975) showed how it is possible to “Accept the Null-Hypothesis Gracefully” (p. 16).
“Use a range, rather than a point, null hypothesis. The procedural recommendations to follow are much easier to apply if the researcher has decided, in advance of data collection, just what magnitude of effect on a dependent measure or measure of association is large enough not to be considered trivial. This decision may have to be made somewhat arbitrarily but seems better to be made somewhat arbitrarily before data collection than to be made after examination of the data.” (p. 16).
The reason is simply that it is impossible to provide evidence for the nil-hypothesis that an effect size is exactly zero, just like it is impossible to show than an effect size equals any other precise value (e..g., r = .1). Although Greenwald made this sensible suggestion over 40 years ago, it is nearly impossible to find articles that specify a range of effect sizes a priori (e.g.., we expected the effect size to be in the range between r = .3 and r = .5 or we expected the correlation to be larger than r = .1).
Bad training continues to be a main reason for the lack of progress in psychological science. However, other factors also play a role. First, specifying effect sizes a priori has implications for the specification of sample sizes. A researcher who declares that effect sizes as small as r = .1 are meaningful and expected needs large samples to obtain precise effect size estimates. For example, assuming the population correlation is r = .2 and a researcher wants to show that it is at least r = .1, a one-sided test with alpha = .05 and 95% power (i.e., the probability of a successful outcome) is N = 1,035. As most sample sizes in psychology are below N = 200, most studies simply lack the precision to test hypothesis that predict small effects. A solution to this might be to focus on hypotheses that predict large effect sizes. However, to show that a population correlation of r = .4 is greater than r = .3, still requires N = 833 participants. In fact, most studies in psychology barely have enough power to demonstrate that moderate correlations, r = .3, are greater than zero, N = 138. In short, most studies are too small to provide evidence for the null-hypothesis that effect sizes are small than a minimum effect size. Not surprisingly, psychological theories are rarely abandoned because empirical results seemed to support the null-hypothesis.
However, occasionally studies do have large samples and it would be possible to follow Greenwald’s (1975) recommendation to specify a minimum effect size a priori. For example, Greenwald and colleagues conducted a study with N = 1,411 participants who reported their intentions to vote for Obama or McCain in the 2008 US elections. The main hypothesis was that implicit measures of racial attitudes like the race IAT would add to the prediction because some White Democrats might not vote for a Black Democratic candidate. It would have been possible to specify an minimum effect size based on a meta-analysis that was published in the same year. This meta-analysis of smaller studies suggested that the average race IAT – criterion correlation was r = .236. The explicit – criterion correlation was r = .186, effect, and the explicit-implicit correlation was only r = .117. Given the lower estimates for the explicit measures and the low explicit-implicit correlation, a regression analysis would only slightly reduce the effect size for the incremental predictive validity of the race IAT, b = .225. Thus, it would have been possible to test the hypothesis that the effect size is at least b = .1, which would imply that adding the race IAT as a predictor explains at least 1% additional variance in voting behaviors.
In reality, the statistical analyses were conducted with prejudice against the null-hypothesis. First, Greenwald et al. (2009) noted that “conservatism and symbolic racism were the two strongest predictors of voting intention (see Table 1)” (p. 247).
A straightforward way to test the hypothesis that the race IAT contributes to the prediction of voting would simply add the standardized race IAT as an additional predictor and use the regression coefficient to test the prediction that implicit bias as measured with the race IAT contributes to voting against Obama. A more stringent test of incremental predictive validity would also include the other explicit prejudice measures because measurement error alone can produce incremental predictive validity for measures of the same construct. However, this is not what the authors did. Instead, they examined whether the four racial attitude measures jointly predicted variance in addition to political orientation. This was the case, with 2% additional explained variance (p < .0010). However, this result does not tell us anything about the unique contribution of the race IAT. The unique contributions of the four measures were not reported. Instead, another regression model tested whether the race IAT and a second implicit measure (the Affective Misattribution Task) explained incremental variance in addition to political orientation. In this model “the pair of implicit measures incrementally predicted only 0.6% of voting intention variance, p = .05” (p. 247). This model also does not tell us anything about the importance of the race IAT because it was not reported how much of the joint contribution was explained by the race IAT alone. The inclusion of the AMP also makes it impossible to test the statistical significance for the race IAT because most of the prediction may come from the shared variance between the two implicit measures, r = .218. Most important, the model does not test whether the race IAT predicts voting above and beyond explicit measures, including symbolic racism.
Another multiple regression analysis entered symbolic racism and the two implicit measures. In this analysis, the two implicit measures combined explained an additional 0.7% of the variance, but this was not statistically significant, p = .07.
They then fitted the model with all predictor variables. In this model, the four attitude measures explained an additional 1.3% of the variance, p = .01, but no information is provided about the unique contribution of the race IAT or the joint contribution of the two implicit measures. The authors merely comment that “among the four race attitude measures, the thermometer difference measure was the strongest incremental predictor and was also the only one of the four that was individually statistically significant in their simultaneous entry after both symbolic racism and conservatism (p. 247).
To put it mildly, the presented results carefully avoid reporting the crucial result about the incremental predictive validity of the race IAT after explicit measures of prejudice are entered into the equation. Adding the AMP only creates confusion because the empirical question is how much the race IAT adds to the prediction of voting behavior. Whether this variance is shared with another implicit measure or not is not relevant.
Table 1 can be used to obtain the results that were not reported in the article. A regression analysis shows a standardized effect size estimate of 0.000 with a 95%CI that ranges from -.047 to .046. The upper limit of this confidence interval is below the minimum effect size of .1 that was used to specify a reasonable null-hypothesis. Thus, the only study that had sufficient precision to the incremental predictive validity of the race IAT shows that the IAT does not make a meaningful, notable, practically significant contribution to the prediction of racial bias in voting. In contrast, several self-report measures did show that racial bias influenced voting behavior above and beyond the influence of political orientation.
Greenwald et al.’s (2009) article illustrates Greenwald’s (1975) prejudice against the null-hypotheses. Rather than reporting a straightforward result, they present several analyses that disguise the fact that the race IAT did not predict voting behavior. Based on these questionable analyses, the authors misrepresent the findings. For example, they claim that “both the implicit and explicit (i.e., self-report) race attitude measures successfully predicted voting.” They omit that this statement is only correct when political orientation and symbolic racism are not used as predictors.
They then argue that their results “supplement the substantial existing evidence that race attitude IAT measures predict individual behavior (reviewed by Greenwald et al., 2009)” (p. 248). This statement is false. The meta-analysis suggested that incremental predictive validity of the race IAT is r ~ .2, whereas this study shows an effect size of r ~ 0 when political orientation is taken into account.
The abstract, often the only information that is available or read, further misleads readers. “The implicit race attitude measures (Implicit Association Test and Affect Misattribution Procedure) predicted vote choice independently of the self-report race attitude measures, and also independently of political conservatism and symbolic racism. These findings support construct validity of the implicit measures” (p. 242). Careful reading of the results section shows that the statement refers to separate analyses in which implicit measures are tested controlling for explicit attitude ratings OR political orientation OR symbolic racism. The new results presented here show that the race IAT does not predict voting controlling for explicit attitudes AND political orientation AND symbolic racism.
The deceptive analysis of these data has led to many citations that the race IAT is an important predictor of actual behavior. For example, in their popular book “Blindspot” Banaji and Greenwald list this study as an example that “the Race IAT predicted racially discriminatory behavior. A continuing stream of additional studies that have been completed since publication of the meta-analysis likewise supports that conclusion. Here are a few examples of race-relevant behaviors that were predicted by automatic White preference in these more recent studies: voting for John McCain rather than Barack Obama in the 2008 U.S. presidential election” (p. 49)
Kurdi and Banaji (2017) use the study to claim that “investigators have used implicit race attitudes to predict widely divergent outcome measures” (p. 282), without noting that even the reported results showed less than 1% incremental predictive validity. A review of prejudice measures features this study as an example of predictive validity (Fiske & North, 2014).
Of course, a single study with a single criterion is insufficient to accept the null-hypothesis that the race IAT lacks incremental predictive validity. A new meta-analysis by Kurdi with Greenwald as co-author provides new evidence about the typical amount of incremental predictive validity of the incremental predictive validity of the race IAT. The only problem is that this information is not provided. I therefore analyzed the open data to get this information. The meta-analytic results suggest an implicit-criterion correlation of r = .100, se = .01, an explicit-criterion correlation of r = .127, se = .02, and an implicit-explicit correlation of of r = .139, se = .022. A regression analysis yields an estimate of the incremental predictive validity for the race IAT of .084, 95%CI = .040 to .121. While this effect size is statistically significant in a test against the nil-hypothesis, it is also statistically different from Greenwald et al.s’ (2009) estimate of b = .225. Moreover, the point estimate is below .1, which could be used to affirm the null-hypothesis, but the confidence interval includes a value of .1. Thus, there is a 20% chance (an 80%CI would not include .1) that the effect size is greater than .1, but it is unlikely(p < .05) that it is greater than .12.
Greenwald and Lai (2020) wrote an Annual Review article about implicit measures. It mentions that estimates of the predictive validity of IATs have decreased from r = .274 (Greenwald et all, 2009) to r = .097 (Kurdi et al., 2019). No mention is made of a range of effect sizes that would support the null-hypothesis that implicit measures do not add to the prediction of prejudice because they do not measure an implicit cause of behavior that is distinct from causes of prejudice that are reflected in self-report measures. Thus, Greenwald fails to follow the advice of his younger self to provide a strong test of a theory by specifying effect sizes that would provide support for the null-hypothesis and against his theory of implicit cognitions.
It is not only ironic to illustrate the prejudice against falsification with Greenwald’s own research. It also shows that the one-sided testing of theories that avoids failures is not only a lack of proper training in statistics or philosophy of science. After all, Greenwald demonstrated that he is well aware of the problems with nil-hypothesis testing. Thus, only motivated biases can explain the one-sided examination of the evidence. Once a researcher has made a name for themselves, they are no longer neutral observers like judges or juries. They are more like prosecutors who will try as hard as possible to get a conviction and ignore evidence that may support a non-guilty verdict. To make matters worse, science does not really have an adversarial system where a defense lawyer stands up for the defendant (i.e., the null-hypothesis) and no evidence can be presented to support the defendant.
Once we realize the power of motivated reasoning, it is clear that we need to separate the work of theory development and theory evaluation. We cannot let researchers who developed a theory conduct meta-analyses and write review articles, just like we cannot ask film directors to write their own movie reviews. We should leave meta-analyses and reviews to a group of theoretical psychologists who do not conduct original research. As grant money for original research is extremely limited and a lot of time and energy is wasted on grant proposals, there is ample capacity for psychologist to become meta-psychologist. Their work also needs to be evaluated differently. The aim of meta-psychology is not to make novel discoveries, but to confirm that claims by original researchers about their discoveries are actually robust, replicable, and credible. Given the well-documented bias in the published literature, a lot of work remains to be done.