I published a critique of the Implicit Association Test. Using structural equation modeling of multi-method studies, I find low convergent validity among implicit measures of attitudes (prejudice, self-esteem, political orientation) and low discriminant validity between explicit and implicit measures. The latter finding is reflected in high correlations between factors that reflect the shared variance among explicit measures and the shared variance among implicit measures. Using factor loadings to quantify validity, I found that the controversial race IAT has at most 20% valid variance in capturing racial attitudes. Most if not all of this variance is shared with explicit measures. Thus, there is no evidence that IAT scores reflect a distinct form of implicit prejudice that may operate outside of conscious awareness.
This article elicited a commentary by Vianello and Bar-Anan (ref.) and by Kurdi, Ratliff, and Cunningham (pdf). Here is a draft of my response to their commentaries. As you will see, there is little common ground; even the term “validity” is not clearly defined making any discussion about the validity of the IAT meaningless. To make progress as a science (or to become a science), psychologists need to have a common understanding of psychological measurement and methods that can be used to evaluate the validity of measures quantitatively.
Just like pre-publication peer-reviews, the two post-publication commentaries have remarkably little overlap. While Vianello and Bar-Anan (VBA) question my statistical analyses, Kurdi, Ratcliff, and Cunningham accept my statistical results, but argue that these results do not challenge the validity of the IAT.
VBA’s critique is clearer and therefore easier to refute by means of objective model comparisons. The key difference between VBA’s model and my model is the modelling of method variance. VBA’s model assume that all implicit measures of different constructs are influenced by a single method factor. In contrast, my model assumes that implicit measures of prejudice (e.g., the standard race IAT and the Brief Implicit Association Test with the same racial stimuli) share additional method variance. As these hypotheses are nested models, it is possible to test these competing models directly against each other. The results show that a model with content-specific method variance fits the data better (Schimmack, 2020a). The standard inference from a model comparison test is that the model with the worse fit is not an adequate model of the data, but VBA ignored the poorer fit of their model and present a revised model that does not model method variance properly and therefore produces misleading results. Thus, VBA’s commentary is just another demonstration of the power of motivated reasoning that undermines the idealistic notion of a self-correcting science.
KRC ask whether my results imply that the IAT cannot be a valid measure of automatic cognition? To provide a meaningful answer to this question, it is important to define the terms valid, measure, automatic, and cognition. The main problem with KRC’s comment is that these terms remain undefined. Without precise definitions, it is impossible to make scientific progress. This is even true for the concept of validity that has no clear meaning in psychological measurement (Schimmack, 2020c). KRC ignore that I clearly define validity as the correlation between IAT scores and a latent variable that represents the actual variation in constructs such as attitudes towards race, political parties, and the self. My main finding was that IAT scores have only modest validity (i.e., low correlations with the latent variable or low factor loadings) as measures of racial preferences, no validity as a measure of self-esteem, and no proven validity as measures of some implicit constructs that are distinct from attitudes that are reflected in self-report measures. Instead, KRC consistently mischaracterize my findings when they write that “the reanalyses reported by Schimmack find high correlations between relatively indirect (automatic) measures of mental content, as indexed by the IAT, and relatively direct (controlled) measures of mental content.” This statement is simply false and confuses correlations of measures with correlations of latent variables. The high correlations between latent factors that represent shared variance among explicit measures and implicit measures provide evidence of low discriminant validity, not evidence of high validity. Moreover, the modest loadings of the race IAT on the implicit race factor show low validity of the IAT as a measure of racial attitudes.
After mischaracterizing my results, KRC go on to claim that my results do “not cast any doubt on the ability of IATs to index attitudes or to do so in an automatic fashion” (p. 5). However, the low convergent validity among implicit measures remains a problem for any claims that the IAT and other implicit measures measure a common construct with good validity. KRC simply ignore this key finding even though factor loadings provide objective and quantitative information about the construct validity of IAT scores.
The IAT is not the only research instrument with questionable construct validity. However, the IAT is unique because it became a popular measure of individual differences without critical evaluation of its psychometric properties. This is particularly problematic when people are given feedback with IATs on the Project Implicit website, especially for IATs that have demonstrably no validity like the self-esteem IAT. The developers of the IAT and KRC defend this practice by arguing that taking an IAT can be educational. “At this stage in its development, it is preferable to use the IAT mainly as an educational tool to develop awareness of implicit preferences and stereotypes” However, it is not clear how a test with invalid results can be educational. How educational would it be to provide individuals with randomly generated feedback about their intelligence? If this sounds unethical, it is not clear why it is acceptable to provide individuals with misleading feedback about their racial attitudes or self-esteem. As a community, psychologists should take a closer look at the practice of providing online feedback with tests that have low validity because this practice may undermine trust in psychological science.
KRC’s commentary also fails to address important questions about the sources of stability and change in IAT scores over time. KRC suggest that “the jury is still out on whether variation in responding on the IAT mostly reflects individual differences or mostly reflects the effects of the situation” (p. 4). The reason why two decades of research have failed to answer this important question is that social cognition researchers focus on brief laboratory experiments that have little ecological validity and that are unable to demonstrate stability of individual differences over time. However, two longitudinal studies suggest that IAT scores measure stable attitudes rather than context-dependent automatic cognitions. Wil Cunningham, one of the commentators, provided first evidence that variance in IAT scores reflects mostly random measurement error and stable trait variance, with no evidence of situation-specific state variance (Cunningham et al., 2001). Interestingly, KRC ignore the implications of this study. This year, an impressive study examined this question with repeated measures of a six-year period (Onyeador et al., 2020; Schimmack, 2020). The results confirmed that even over this long time-period, variance in IAT scores mostly reflects measurement error and a stable trait without notable variance due to changes in situations.
Another important topic that I could only mention briefly in my original article is incremental predictive validity. KRC mention Kurdi et al.’s (2019) meta-analysis as evidence that the IAT and self-report measures tap different constructs. They fail to mention that the conclusions of this meta-analysis are undermined by the lack of credible, high-powered studies that can demonstrate incremental predictive validity. To quote Kurdi et al.’s abstract “most studies were vastly underpowered” (p. 569). The authors conducted tests of publication bias, but did not find evidence for it. The reason could be that they used tests that have low power to detect publication bias. Some studies included in the meta-analysis are likely to have reported inflated effect sizes due to selection for significance, especially costly fMRI studies with tiny sample sizes. For example, Phelps et al. (2000) report a correlation of r(12) = .58 between scores on the race IAT and differences in amygdala activation in response to Black and White faces. Even if we assume that 20% of the variance in the IAT is valid, the validation corrected correlation would be r = 1.30. In other words, this correlation is implausible given the low validity of race IAT scores. The correlation is also much stronger than the predictive validity of the IAT in Kurdi et al.’s meta-analysis. The most plausible explanation for this result is that researchers’ degrees of freedom in fMRI studies inflated this correlation (Vul et al., 2009). Consistent with this argument, effect sizes in studies with larger sample sizes are much smaller and evidence of incremental predictive validity can be elusive, as in Greenwald et al.’s study of the 2018 election. At present, there is no pre-registered, high-powered study that provides clear evidence of incremental predictive validity. Thus, IAT proponents have failed to respond to Blanton et al.’s (2009) critique of the IAT. Responses to my renewed criticism suggest that IAT researchers are unable or unwilling to respond to valid scientific criticism of the IAT with active coping. Instead, they prefer to engage in emotion-focused, repressive coping that makes IAT researchers feel better without addressing substantive measurement problems.
In conclusion, my critique of the IAT literature and the response by IAT researchers shows a wider problem in psychology that I have called the validation crisis (Schimmack 2020c). Although measurement is at the core of any empirical science, many psychologists lack formal training in psychological measurement. As a result, they create and use measures of unknown validity. This is particularly true for social psychologists because social psychologists in the 1970s and 1980s actively rejected the idea that characteristics within individuals are important for the understanding of human behavior (“the power of the situation”). However, when the cognitive revolution started, the focus shifted from observable situations and behaviors to mental states and processes. To study these phenomena that are not directly observable requires valid measures, just like telescopes need to be validated to observe planets in distant galaxies. The problem is that social cognition researchers developed methods like the IAT to make claims about cognitive processes that are not observable to outside observers or by means of introspection without taking the time to validate these measures. To make progress, the next generation of social psychologists needs to distinguish clearly between constructs and measures and between random and systematic measurement error. As all measures are contaminated by both sources of measurement error, constructs need to be measured with multiple, independent methods that show convergent validity (Campbell & Fiske, 1959; Cronbach & Meehl, 1955). Psychology also needs to move from empty qualitative statements like “the IAT can be valid” to empirically-based statements about the amount of validity of a specific IAT in specific populations in clearly defined situations. This requires a new program of research with larger samples, ecologically valid situations, and meaningful criterion variables.
Blanton, H., Jaccard, J., Klick, J., Mellers, B., Mitchell, G., & Tetlock, P. E. (2009). Strong claims and weak evidence: Reassessing the predictive validity of the IAT. Journal of Applied Psychology, 94, 567–582. doi:10.1037/a0014665
Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81–105. doi:10.1037/h0046016
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302. doi:10.1037/h0040957
Cunningham, W. A., Preacher, K. J., & Banaji, M. R. (2001). Implicit attitude measures: Consistency, stability, and convergent validity. Psychological Science, 12(2), 163–170. https://doi.org/10.1111/1467-9280.00328
Onyeador, I. N., Wittlin, N. M., Burke, S. E., Dovidio, J. F., Perry, S. P., Hardeman, R. R., … van Ryn, M. (2020). The Value of Interracial Contact for Reducing Anti-Black Bias Among Non-Black Physicians: A Cognitive Habits and Growth Evaluation (CHANGE) Study Report. Psychological Science, 31(1), 18–30. https://doi.org/10.1177/0956797619879139
Schimmack, U. (2020a). Open Communication about the invalidity of the race IAT. https://replicationindex.com/2019/09/15/open-communication-about-the-invalidity-of-the-race-iat/
Schimmack, U. (2020b). Racial bias as a trait. https://replicationindex.com/2019/11/28/racial-bias-as-a-trait/ (retrieved 4/21/20)
Schimmack, U. (2020c). The validation crisis. Meta-Psychology (blog)
Vul, E., Harris, C., Winkielman, P., & Pashler, H. (2009). Puzzlingly high correlations in fMRI studies of emotion, personality, and social cognition. Perspectives on Psychological Science, 4(3), 274–290. https://doi.org/10.1111/j.1745-6924.2009.01125.x