Category Archives: Construct Validity

Invalid Claims about the Validity of Implicit Association Tests

Schimmack, U. (2021). Invalid Claims About the Validity of Implicit Association Tests by Prisoners of the Implicit Social-Cognition Paradigm. Perspectives on Psychological Science, 16(2), 435–442.

This post has been revised on March 12, 2021 to make it consistent with the published version ( of my response to commentaries by Vianello and Bar-Anan and Kurdi, Ratliff, and Cunningham in response to my target article about the lack of construct validity of IATs (Schimmack, 2021).

Invalid Claims about the Validity of Implicit Association Tests by Prisoners of the Implicit Social-Cognition Paradigm

In a prior publication, I used structural equation modeling of multimethod data to examine the construct validity of Implicit Association Tests. The results showed no evidence that IATs measure implicit constructs (e.g., implicit self-esteem, implicit racial bias). This critique of IATs elicited several responses by implicit social-cognition researchers, who tried to defend the validity and usefulness of IATs. I carefully examine these arguments and show that they lack validity. IAT proponents consistently ignore or misrepresent facts that challenge the validity of IATs as measures of individual differences in implicit cognitions. One response suggests that IATs can be useful even if they merely measure the same constructs as self-report measures, but I find no support for the claim that IATs have practically significant incremental predictive validity. In conclusions, IATs are widely used without psychometric evidence of construct or predictive validity.

implicit attitudes, Implicit Association Test, validity, prejudice, suicide, mental health

Greenwald and colleagues (1998) introduced Implicit Association Tests (IATs) as a new method to measure individual differences in implicit cognitions. Twenty years later, IATs are widely used for this purpose, but their construct validity has not been established. Even its creator is no longer sure what IATs measure. Whereas Banaji and Greenwald (2013) confidently described
IATs as “a method that gives the clearest window now available into a region of the mind that is inaccessible to question-asking methods” (p. xiii), they now claim that IATs merely measure “the strengths of associations among concepts” (Cvencek et al., 2020, p. 187). This is akin to saying that an old-fashioned thermometer measures the expansion of mercury: It is true, but it has little to do with thermometers’ purpose of measuring temperature.

Fortunately, we do not need Greenwald or Banaji to define the constructs that IATs are supposed to measure. Twenty years of research with IATs makes it clear what researchers believe they are measuring with IATs. A self-esteem IAT is supposed to measure implicit self-esteem (Greenwald & Farnham, 2000). A race IAT is supposed to measure implicit prejudice (Cunningham et al., 2001), and a suicide IAT is supposed to measure implicit suicidal tendencies that can predict suicidal behaviors above and beyond self-reports (Kurdi et al.,
2021). The empirical question is whether IATs are any good at measuring these constructs. I concluded that most IATs are poor measures of their intended constructs (Schimmack, 2021). This conclusion elicited one implicit and two explicit responses.

Implicit Response

The implicit response is to simply ignore criticism and to make invalid claims about the construct validity of IATs (Greenwald & Lai, 2020). For example, a 2020 article coauthored by Nosek, Greenwald, and Banaji (among others) claimed that “available evidence for validity of
IAT measures of self-esteem is limited (Bosson et al., 2000; Greenwald & Farnham, 2000), with some of the strongest evidence coming from empirical tests of the balance-congruity principle” (Cvencek et al., 2020, p. 191). This statement is as valid as Donald Trump’s claim that an honest count of votes would make him the winner of the 2020 election. Over the past 2 decades, several articles have concluded that self-esteem IATs lack validity (Buhrmester et al., 2011; Falk et al., 2015; Walker & Schimmack, 2008). It is unscientific to omit these references from a literature review.

The balance-congruity principle is also not a strong test of the claim that the self-esteem IAT is a valid measure of individual differences in implicit self-esteem. In contrast, the lack of convergent validity with informant ratings and even other implicit measures of
self-esteem provides strong evidence that self-esteem IATs are invalid (Bosson et al., 2000; Falk et al., 2015). Finally, supporting evidence is surprisingly weak. For example, Greenwald and Farnham’s (2000) highly cited article tested predictive validity of the self-esteem IAT with responses to experimentally manipulated successes and failures (n = 94). They did not even report statistical results. Instead, they suggested that even nonsignificant results should be counted as evidence for the validity of the self-esteem IAT:

Although p values for these two effects straddled the p = .05 level that is often treated as a boundary between noteworthy and ignorable results, any inclination to dismiss these findings should be tempered by noting that these two effects agreed with prediction in both direction and shape. (Greenwald & Farnham, 2000, p. 1032)

Twenty years later, this finding has not been replicated, and psychologists have learned to distrust p values that are marginally significant (Benjamin et al., 2018; Schimmack, 2012, 2020). In conclusion, conflict of interest and motivated biases undermine the objectivity of Greenwald and colleagues in evaluations of IATs’ validity.

Explicit Response 1

Vianello and Bar-Anan (2021) criticized my structural equation models of their data. They also presented a new model that appeared to show incremental predictive validity for implicit racial bias and implicit political orientation. I thought it would be possible to resolve some of the disagreement in a direct and open communication with the authors because the disagreement
is about modeling of the same data. I was surprised when the authors declined this offer, given that Bar- Anan coauthored an article that praised the virtues of open scientific communication (Nosek & Bar-Anan, 2012). Readers therefore have to reconcile conflicting viewpoints for themselves. To ensure full transparency, I published syntax, outputs, and a detailed discussion
of the different modeling assumptions on OSF at

In brief, a comparison of the models shows that mine is more parsimonious and has better fit than their model. Because the model is more parsimonious, better fit cannot be attributed to overfitting of the data. Rather, the model is more consistent with the actual data, which in most sciences is considered a good reason to favor a model. Vianello and Bar-Anan’s model also produced unexplained, surprising results. For example, the race IAT has only a weak positive loading on the IAT method factor, and the political-orientation IAT even has a moderate negative loading. It is not clear how a method can have negative loadings on a method factor,
and Vianello and Bar-Anan provided no explanation for this surprising finding.

The two models also produce different results regarding incremental predictive validity (Table 1). My model shows no incremental predictive validity for implicit factors. It is also surprising that Vianello and Bar-Anan found incremental predictive validity for voting behaviors,
because the explicit and implicit factors correlated (r) at .9. This high correlation leaves little room for variance in implicit political orientation that is distinct from political orientation measured with self-ratings.

In conclusion, Vianello and Bar-Anan failed to challenge my conclusion that implicit and explicit measures measure mostly the same constructs and that low correlations between explicit and implicit measures reflect measurement error rather than some hidden implicit processes.

Explicit Response 2

The second response (Kurdi et al., 2021) is a confusing 7,000-word article that is short of facts, filled with false claims, and requires more fact-checking than a Trump interview.

False fact 1

The authors begin with the surprising statement that my findings are “not at all incompatible with the way that many social cognition researchers have thought about the construct of (implicit) evaluation” (p. 423). This statement is misleading. For 3 decades, social-cognition
researchers have pursued the idea that many social-cognitive processes that guide behavior occur outside of awareness. For example, Nosek et al. (2011) claim “most human cognition occurs outside conscious awareness or conscious control” (p. 152) and go on to claim that IATs “measure something different from self-report” (p. 153). And just last year, Greenwald and Lai
(2020) claimed that “in the last 20 years, research on implicit social cognition has established that social judgments and behavior are guided by attitudes and stereotypes of which the actor may lack awareness” (p. 419).

Social psychologists have also been successful in making the term implicit bias a common term in public discussions of social behavior. The second author, Kathy Ratliff, is director of Project Implicit, which “has a mission to develop and deliver methods for investigating and applying phenomena of implicit social cognition, including especially phenomena of implicit bias based on age, race, gender or other factors” (Kurdi et al., 2021, p. 431). It is not clear what this statement means if we do not make a distinction between traditional research on prejudice with self-report measures and the agenda of Project Implicit to study implicit biases with IATs.
In addition, all three authors have published recent articles that allude to IATs as measures of implicit cognitions.

In a highly cited American Psychologist article, Kurdi and coauthors (2019) claim “in addition to dozens of studies that have established construct validity . . . investigators have asked to what extent, and under what conditions, individual differences in implicit attitudes, stereotypes, and identity are associated with variation in behavior toward individuals as a function of their social group membership” (p. 570). The second author coauthored an article with the claim that “Black participants’ implicit attitudes reflected no ingroup/ outgroup preference . . . Black participants’ explicit attitudes reflected an ingroup preference” ( Jiang et al.,
2019). In 2007, Cunningham wrote that the “distinction between automatic and controlled processes now lies at the heart of several of the most influential models of evaluative processing” (Cunningham & Zelazo, 2007, p. 97). And Cunningham coauthored a review article with the claim that “a variety of tasks have been used to reflect implicit psychopathology associations, with the IAT (Greenwald et al., 1998) used most widely” (Teachman
et al., 2019). Finally, many users of IATs assume that they are measuring implicit constructs that are distinct from constructs that are measured with self-ratings. It is therefore a problem for the construct validity of IATs if they lack discriminant validity. At the least, Kurdi et al. fail to explain why anybody should use IATs if they merely measure the same constructs that can be
measured with cheaper self-ratings. In short, the question whether IATs and explicit measures reflect the same constructs or different constructs has theoretical and empirical relevance, and lack of discriminant validity is a problem for many theories of implicit cognitions (but see Cunningham & Zelazo, 2007).

False fact 2

A more serious false claim is that I found “high correlations between relatively indirect (automatic) measures of mental content, as indexed by the IAT, and relatively direct (controlled) measures of mental content, as indexed by a variety of self-report scales” (p. 423). Table 2 shows some of the correlations among implicit and explicit measures in Vianello and Bar-Anan’s data. Only one of these correlations meets the standard criterion of a high correlation (i.e., r = .5; Cohen, 1988). The other correlations are small to moderate. These correlations show at best moderate convergent validity and no evidence of discriminant validity (i.e., higher implicit-implicit than implicit-explicit correlations). Similar results have been reported since the first IATs were created (Bosson et al., 2000). For 20 years, IAT researchers have ignored these low correlations and made grand claims about the validity of IATs. Kurdi et al. are doubling
down on this misinformation by falsely describing these correlations as high.

False fact 3

The third false claim is that “plenty of evidence in favor of dissociations between direct and indirect measures exists” (p. 428). To support this claim, Kurdi et al. cite a meta-analysis of incremental predictive validity (Kurdi et al., 2019). There are several problems with this claim.
First, the meta-analysis corrects only for random measurement error and not systematic measurement error. To the extent that systematic measurement error is present, incremental validity will shrink because explicit and implicit factors are very highly correlated when both sources of error are controlled (Schimmack, 2021). Second, Kurdi et al. fail to mention effect sizes. The meta-analysis suggests that a perfectly reliable IAT would explain about 2% unique variance. However, IATs have only modest reliability. Thus, manifest IAT scores would explain even less unique variance. Finally, even this estimate has to be interpreted with caution because the meta-analysis did not correct for publication bias and included some questionable studies. For example, Phelps et al. (2003) report, among 12 participants, a correlation of .58 between scores on the race IAT and differences in amygdala activation in response to Black and White faces. Assuming 20% valid variance in the IAT scores (Schimmack, 2021), the validation- corrected correlation would be 1.30. In other words, a correlation of .58 is impossible given the low validity of race-IAT scores. It is well known that correlations in functional MRI studies with small samples are not credible (Vul et al., 2009). Moreover, brain activity is not a social behavior. It is therefore unclear why studies like this were included in Kurdi et al.’s (2019) meta-analysis.

Kurdi et al. also used suicides as an important outcome that can be predicted with suicide and death IATs. They cited two articles to support this claim. Fact checking shows that one article reported a statistically significant result (p = .013; Barnes et al., 2017), whereas the other one did not (p > .50; Glenn et al., 2019). I conducted a meta-analysis of all studies that reported incremental predictive validity of suicide or death IATs. The criterion was suicide attempts in the next 3 to 6 months (Table 3). I found eight studies, but six of them came from a single lab (Matthew K. Nock). Nock was also the first one to report a significant result in an extremely underpowered study that included only two suicide attempts (Nock & Banaji, 2007). Five of the eight studies showed a statistically significant result (63%), but the average observed power to achieve significance was only 42%. This discrepancy suggests the presence of publication bias (Schimmack, 2012). Moreover, significant results are all clustered around .05, and none
of the p values meets the stricter criterion of .005 that has been suggested by Nosek and others to claim a discovery (Benjamin et al., 2018). Thus, there is no conclusive evidence to suggest that suicide IATs have incremental predictive validity in the prediction of suicides. This is not surprising because most of the studies were underpowered and unlikely to detect small effects.
Moreover, effect sizes are bound to be small because the convergent validity between suicide and death IATs is low (r = .21; Chiurliza et al., 2018), suggesting that most of the variance in these IATs is measurement error.

In conclusion, 20 years of research with IATs has produced no credible and replicable evidence that IATs have incremental predictive validity over explicit measures. Even if there is some statistically significant incremental predictive validity, the amount of explained
variance may lack practical significance (Kurdi et al., 2019).

False fact 4

Kurdi et al. (2021) object (p. 424) to my claim that “most researchers regard the IAT as a valid measure of enduring attitudes that vary across individuals” (Schimmack, 2021, p. 397). They claim that “the overwhelming theoretical consensus in the community of attitude researchers.
. . is that attitudes emerge from an interaction of persons and situations” (p. 425). It is instructive to compare this surprising claim with Cunningham and Zelazo’s (2007) definition of attitudes as “relatively stable ideas about whether something is good or bad” (p. 97). Kurdi and Banaji (2017) wrote that “differences in implicit attitudes . . . may arise because of multiple components, including relatively stable components [emphasis added]” (p. 286). Rae and Greenwald (2017) stated that it is a “widespread assumption . . . that implicit attitudes are characteristics of people, almost certainly more so than a property of situations” (p. 297).
Greenwald and Lai (2020) stated that test–retest reliability “places an upper limit on correlational tests of construct validity” (p. 425). This statement makes sense only if we assume that the construct to be measured is stable over the retest interval. It is also not clear how it would be ethical to provide individuals with feedback about their IAT scores on the Project Implicit website, if IAT scores were merely a product of the specific situation at the moment they are taking the test. Finally, how can the suicide IAT be a useful predictor of suicide if it cannot not measure some stable dispositions related to suicidal behaviors?

In conclusion, Kurdi et al.’s definition of attitudes is inconsistent with the common definition of attitudes as relatively enduring evaluations. That being said, the more important question is
whether IATs measure stable attitudes or momentary situational effects. Ironically, some of the best evidence comes from Cunningham. Cunningham et al. (2001) repeatedly measured prejudice four times over a 3-month period with multiple measures, including the race IAT. Cunningham et al. (2001) modeled the data with a single trait factor that explained all of the covariation among different measures of racial attitudes. Thus, Cunningham et al. (2001) provided first evidence that most of the valid variance in race IAT scores is perfectly stable over a 3-month period and that person-by-situation interactions had no effect on racial attitudes. There have been few longitudinal studies with IATs since Cunningham et al.’s (2001) seminal study. However, last year, an article examined stability over a 6-year interval (Onyeador et al., 2020). Racial attitudes of more than 3,000 medical students were measured in the first year of medical school, the fourth year of medical school, and the second year of medical residency.
Table 4 shows the correlations for the explicit feeling thermometer and the IAT scores. The first observation is that the Time-1-to-Time-3 correlation for the IAT scores is not smaller than the Time-1-to-Time-2 or the Time-2-to-Time-3 correlations. This pattern shows that a single trait factor can capture the shared variance among the repeated IAT measures. The second observation is that the bold correlations between explicit ratings and IAT scores on the same occasion are only slightly higher than the correlations for different measurement
occasions. This finding shows that there is very little occasion-specific variance in racial attitudes. The third observation is that IAT correlations over time are higher than the corresponding FT-IAT correlations over time. This finding points to IAT-specific method variance that is revealed in studies with multiple implicit measures (Cunningham et al., 2001; Schimmack, 2021). These findings extend Cunningham et al.’s (2001) findings to
a 6-year period and show that most of the valid variance in race IAT scores is stable over long periods of time.

In conclusion, Kurdi et al.’s claims about person-by-situation effects are not supported by evidence.


Like presidential debates, the commentaries and my response present radically different views of reality. In one world, IATs are valid and useful tools that have led to countless new insights into human behavior. In the other world, IATs are noisy measures that add nothing to the information we already get from cheaper self-reports. Readers not well versed in the literature are likely to be confused rather than informed by these conflicting accounts. Although we may expect such vehement disagreement in politics, we should not expect it among scientists.
A common view of scientists is that they are able to resolve disagreement by carefully looking at data and drawing logical conclusions from empirical facts. However, this model of scientists is naive and wrong.

A major source of disagreement among psychologists is that psychology lacks an overarching paradigm; that is, a set of fundamentally shared assumptions and facts. Psychology does not have one paradigm, but many paradigms. The IAT was developed within the implicit social-cognition paradigm that gained influence in the 1990s (Bargh et al., 1996; Greenwald & Banaji, 1995; Nosek et al., 2011). Over the past decade, it has become apparent that the empirical foundations of this paradigm are shaky (Doyen et al., 2012; D. Kahneman quoted in Yong, 2012, Supplemental Material; Schimmack, 2020). It took a long time to see the problems because paradigms are like prisons that make it impossible to see the world from the outside. A key force that prevents researchers within a paradigm from noticing problems is publication bias. Publication bias ensures that studies that are consistent with a paradigm are published, cited, and highlighted in review articles to provide false evidence in support for a paradigm
(Greenwald & Lai, 2020; Kurdi et al., 2021).

Over the past decade, it has become apparent how pervasive these biases have been, especially in social psychology (Schimmack, 2020). The responses to my critique of IATs merely confirms how powerful paradigms and conflicts of interest can be. It is therefore necessary to allocate more resources to validation projects by independent researchers. In addition, validation studies should be preregistered and properly powered, and results need to be published whether they show validity or not. Conducting validation studies of widely used measures could be an important role for the emerging field of meta-psychology that is not focused on new discoveries, but rather on evaluating paradigmatic research from an outsider, meta-perspective (Carlsson et al., 2017). Viewed from this perspective, many IATs that are in use lack credible evidence of construct validity.

*References marked with an asterisk report studies included in
the suicide IAT meta-analysis

Banaji, M. R., & Greenwald, A. G. (2013). Blindspot: Hidden
biases of good people. Delacorte Press.

Bargh, J. A., Chen, M., & Burrows, L. (1996). Automaticity
of social behavior: Direct effects of trait construct and
stereotype activation on action. Journal of Personality
and Social Psychology, 71(2), 230–244.

*Barnes, S. M., Bahraini, N. H., Forster, J. E., Stearns-Yoder, K. A.,
Hostetter, T. A., Smith, G., Nagamoto, H. T., & Nock,
M. K. (2017). Moving beyond self-report: Implicit associations
about death/ life prospectively predict suicidal
behavior among veterans. Suicide and Life-Threatening
Behavior, 47, 67–77.

Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A.,
Wagenmakers, E.-J., Berk, R., Bollen, K. A., Brembs, B.,
Brown, L., Camerer, C., Cesarini, D., Chambers, C. D.,
Clyde, M., Cook, T. D., Boeck, P., De, Dienes, Z., Dreber,
A., Easwaran, K., Efferson, C., . . . Johnson, V. E. (2018).
Redefine statistical significance. Nature Human Behaviour,
2, 6–10.

Bosson, J. K., Swann, W. B. Jr., & Pennebaker, J. W. (2000).
Stalking the perfect measure of implicit self-esteem:
The blind men and the elephant revisited? Journal of
Personality and Social Psychology, 79, 631–643. https://

Buhrmester, M. D., Blanton, H., & Swann, W. B., Jr. (2011).
Implicit self-esteem: Nature, measurement, and a new way
forward. Journal of Personality and Social Psychology,
100(2), 365–385.

Carlsson, R., Danielsson, H., Heene, M., Ker, Å., Innes, Lakens,
D., Schimmack, U., Schönbrodt, F. D., van Assen, M., &
Weinstein, Y. Inaugural editorial of Meta-Psychology. Meta-
Psychology, 1.

Chiurliza, B., Hagan, C. R., Rogers, M. L., Podlogar, M. C., Hom,
M. A., Stanley, I. H., & Joiner, T. E. (2018). Implicit measures
of suicide risk in a military sample. Assessment, 25(5),
Cohen, J. (1988). Statistical power analysis for the behavioral
sciences (2nd ed.). Erlbaum.

Cunningham, W. A., Preacher, K. J., & Banaji, M. R. (2001).
Implicit attitude measures: Consistency, stability, and
No Evidence for Construct Validity of IAT 441
convergent validity. Psychological Science, 12(2), 163–170

Cunningham, W. A., & Zelazo, P. D. (2007). Attitudes and
evaluations: A social cognitive neuroscience perspective.
Trends in Cognitive Sciences, 11, 97–104. https://

Cvencek, D., Meltzoff, A. N., Maddox, C. D., Nosek, B. A.,
Rudman, L. A., Devos, T., Dunham, Y., Baron, A. S.,
Steffens, M. C., Lane, K., Horcajo, J., Ashburn Nardo, L.,
Quinby, A., Srivastava, S. B., Schmidt, K., Aidman, E.,
Tang, E., Farnham, S., Mellott, D. S., . . . Greenwald, A. G.
(2020). Meta-analytic use of balanced identity theory to
validate the Implicit Association Test. Personality and
Social Psychology Bulletin, 47(2), 185–200. https://doi

Doyen, S., Klein, O., Pichon, C. L., & Cleeremans, A. (2012).
Behavioral priming: It’s all in the mind, but whose mind?
PLOS ONE, 7(1), Article e29081.

Falk, C. F., Heine, S. J., Takemura, K., Zhang, C. X., & Hsu,
C. (2015). Are implicit self-esteem measures valid for
assessing individual and cultural differences. Journal of
Personality, 83, 56–68.

*Glenn, C. R., Millner, A. J., Esposito, E. C., Porter, A. C.,
& Nock, M. K. (2019). Implicit identification with death
predicts suicidal thoughts and behaviors in adolescents.
Journal of Clinical Child & Adolescent Psychology, 48,

Greenwald, A. G., & Banaji, M. R. (1995). Implicit social cognition:
Attitudes, self-esteem, and stereotypes. Psychological
Review, 102(1), 4–27.

Greenwald, A. G., & Farnham, S. D. (2000). Using the Implicit
Association Test to measure self-esteem and self-concept.
Journal of Personality and Social Psychology, 79, 1022–1038

Greenwald, A. G., & Lai, C. K. (2020). Implicit social cognition.
Annual Review of Psychology, 71, 419–445. https://

Greenwald, A. G., McGhee, D. E., & Schwartz, J. L. K. (1998).
Measuring individual differences in implicit cognition:
The Implicit Association Test. Journal of Personality and
Social Psychology, 74, 1464–1480.

*Harrison, D. P., Stritzke, W. G. K., Fay, N., & Hudaib, A.-R.
(2018). Suicide risk assessment: Trust an implicit probe
or listen to the patient? Psychological Assessment, 30(10),

Jiang, C., Vitiello, C., Axt, J. R., Campbell, J. T., & Ratliff, K. A.
(2019). An examination of ingroup preferences among
people with multiple socially stigmatized identities. Self
and Identity. Advance online publication.

Kurdi, B., & Banaji, M. R. (2017). Reports of the death of
the individual difference approach to implicit social cognition
may be greatly exaggerated: A commentary on Payne,
Vuletich, and Lundberg. Psychological Inquiry, 28,

Kurdi, B., Ratliff, K. A., & Cunningham, W. A. (2021). Can
the Implicit Association Test serve as a valid measure of
automatic cognition? A response to Schimmack (2021).
Perspectives on Psychological Science, 16(2), 422–434.

Kurdi, B., Seitchik, A. E., Axt, J. R., Carroll, T. J., Karapetyan,
A., Kaushik, N., Tomezsko, D., Greenwald, A. G., &
Banaji, M. R. (2019). Relationship between the Implicit
Association Test and intergroup behavior: A meta-analysis.
American Psychologist, 74(5), 569–586.

*Millner, A. J., Augenstein, T. M., Visser, K. H., Gallagher, K.,
Vergara, G. A., D’Angelo, E. J., & Nock, M. K. (2019). Implicit
cognitions as a behavioral marker of suicide attempts in
adolescents. Archives of Suicide Research, 23(1), 47–63.

*Nock, M. K., & Banaji, M. R. (2007). Prediction of suicide ideation
and attempts among adolescents using a brief performance-
based test. Journal of Consulting and Clinical
Psychology, 75(5), 707–715.

*Nock, M. K., Park, J. M., Finn, C. T., Deliberto, T. L.,
Dour, H. J., & Banaji, M. R. (2010). Measuring the suicidal
mind: Implicit cognition predicts suicidal behavior.
Psychological Science, 21(4), 511–517. https://doi

Nosek, B. A., & Bar-Anan, Y. (2012). Scientific utopia: I. Opening
scientific communication. Psychological Inquiry, 23(3),

Nosek, B. A., Hawkins, C. B., & Frazier, R. S. (2011). Implicit
social cognition: From measures to mechanisms. Trends
in Cognitive Sciences, 15(4), 152–159.

Onyeador, I. N., Wittlin, N. M., Burke, S. E., Dovidio, J. F.,
Perry, S. P., Hardeman, R. R., Dyrbye, L. N., Herrin, J.,
Phelan, S. M., & van Ryn, M. (2020). The value of interracial
contact for reducing anti-Black bias among non-Black
physicians: A Cognitive Habits and Growth Evaluation
(CHANGE) study report. Psychological Science, 31(1),

Phelps, E. A., Cannistraci, C. J., & Cunningham, W. A. (2003).
Intact performance on an indirect measure of race bias
following amygdala damage. Neuropsychologia, 41(2),

Rae, J. R., & Greenwald, A. G. (2017). Persons or situations?
Individual differences explain variance in aggregated
implicit race attitudes. Psychological Inquiry, 28, 297–300.

*Randall, J. R., Rowe, B. H., Dong, K. A., Nock, M. K., &
Colman, I. (2013). Assessment of self-harm risk using
implicit thoughts. Psychological Assessment, 25(3), 714–721

Schimmack, U. (2012). The ironic effect of significant results
on the credibility of multiple-study articles. Psychological
Methods, 17(4), 551–566.

Schimmack, U. (2020). A meta-psychological perspective on
the decade of replication failures in social psychology.
Canadian Psychology/Psychologie canadienne, 61(4),

Schimmack, U. (2021). The Implicit Association Test: A method
in search of a construct. Perspectives on Psychological Science, 16(2), 396–414.

Teachman, B. A., Clerkin, E. M., Cunningham, W. A., Dreyer-
Oren, S., & Werntz, A. (2019). Implicit cognition and
psychopathology: Looking back and looking forward.
Annual Review of Clinical Psychology, 15, 123–148.

*Tello, N., Harika-Germaneau, G., Serra, W., Jaafari, N., &
Chatard, A. (2020). Forecasting a fatal decision: Direct
replication of the predictive validity of the Suicide–
Implicit Association Test. Psychological Science, 31(1),

Vianello, M., & Bar-Anan, Y. (2021). Can the Implicit Association
Test measure automatic judgment? The validation continues.
Perspectives on Psychological Science, 16(2), 415–421.

Vul, E., Harris, C., Winkielman, P., & Pashler, H. (2009).
Puzzlingly high correlations in fMRI studies of emotion,
personality, and social cognition. Perspectives on
Psychological Science, 4(3), 274–290.

Walker, S. S., & Schimmack, U. (2008). Validity of a happiness
implicit association test as a measure of subjective wellbeing.
Journal of Research in Personality, 42, 490–497.

Yong, E. (2012 October 12). Nobel laureate challenges
psychologists to clean up their act. Nature. https://doi

Racial Bias as a Trait

Prejudice is an important topic in psychology that can be examined from various perspectives. Nevertheless, prejudice research is typically studied by social psychologists. As a result, research has focused on social cognitive processes that are activated in response to racial stimuli (e.g., pictures of African Americans) and experimental manipulations of the situation (e.g., race of experimenter). Other research has focused on cognitive processes that can lead to the formation of racial bias (e.g., the minimal group paradigm). Sometimes this work has been based on a model of prejudice that assumes racial bias is a common attribute of all people (Devine, 1989) and that individuals only differ in their willingness or ability to act on their racial biases.

An alternative view is that racial biases vary across individuals and are shaped by experiences with out-group members. The most prominent theory is contact theory, which postulates that contact with out-group members reduces racial bias. In social psychology, individual differences in racial biases are typically called attitudes, where attitudes are broad dispositions to respond to a class of attitude objects in a consistent manner. For example, individuals with positive attitudes towards African Americans are more likely to have positive thoughts, feelings, and behaviors in interactions with African Americans.

The notion of attitudes as general dispositions shows that attitudes play the same role in social psychology that traits play in personality psychology. For example, extraversion is a general disposition to have more positive thoughts, feelings, and to engage more in social interactions. One important research question in personality psychology are the causes of variation in personality. Why are some people more extraverted than others? A related question is how stable personality traits are. If the causes of extraversion are environmental factors, extraversion should change when the environment changes. If the causes of extraversion are within the person (e.g., early childhood experiences, genetic differences), extraversion should be stable. Thus, the stability of personality traits over time is an empirical question that can only be answered in longitudinal studies that measure personality traits repeatedly. A meta-analysis shows that the Big Five personality traits are highly stable over time (Anusic & Schimmack, 2016).

In comparison, the stability of attitudes has received relatively little attention in social psychology because stable individual differences are often neglected in social cognitive models of attitudes. This is unfortunate because the origins of racial bias are important to the understanding of racial bias and to design interventions that help individuals to reduce their racial biases.

How stable are racial biases?

The lack of data has not stopped social psychologists from speculating about the stability of racial biases. “It’s not as malleable as mood and not as reliable as a personality trait. It’s in between the two–a blend of both a trait and a state characteristic” (Nosek in Azar, 2008). In 2019, Nosek was less certain about the stability of racial biases. “One is does that mean we have have some degree of trait variance because there is some stability over time and what is the rest? Is the rest error or is it state variance in some way, right. Some variation that is meaningful variation that is sensitive to the context of measurement. Surely it is some of both, but we don’t know how much” (The Psychology Podcast, 2019).

Other social psychologists have made stronger claims about the stability of racial bias. Payne argued that racial bias is a state because implicit bias measures show higher internal consistency than retest correlations (Payne, 2017). However, the comparison of internal consistency and retest correlations is problematic because situational factors may simply produce situation-specific measurement errors rather than reflecting real changes in the underlying trait; a problem that is well recognized in personality psychology. To examine this question more thoroughly, it is necessary to obtain multiple retests and decompose the variances into trait, state, and error variances (Anusic & Schimmack, 2016). Even this approach cannot distinguish between state variance and systematic measurement error, which requires multi-method data (Schimmack, 2019).

A Longitudinal Multi-Method Study of Racial Bias

A recent article reported the results of an impressive longitudinal study of racial bias with over 3,000 medical students who completed measures of racial bias and inter-group contact three times over a period of six year (first year of medical school, fourth year of medical school, 2nd year of residency) (Onyeador et al., 2019). I used the openly shared data to fit a multi-method state-trait-error model to the data (

The model integrates several theoretical assumptions that are consistent with previous research (Schimmack, 2019). First, the model assumes that explicit ratings of racial bias (feeling thermometer) and implicit measures of racial bias (Implicit Association Test) are complementary measures of individual differences in racial bias. Second, the model assumes that one source of variance in racial bias is a stable trait. Third, the model assumes that racial bias differs across racial groups, in that Black individuals have more favorable attitudes towards Black people than members from other groups. Fourth, the model assumes that contact is negatively correlated with racial bias without making a strong causal assumption about the direction of this relationship. The model also assumes that Black individuals have more contact with Black individuals and that contact partially explains why Black individuals have less racial biases.

The new hypotheses that could be explored with these data concerned the presence of state variance in racial bias. First, state variance should produce correlations between the occasion specific variances of the two methods. That is, after statistically removing trait variance, residual state variance in feeling thermometer scores should be correlated with residual variances in IAT scores. For example, as medical students interact more with Black staff and patients in residency, their racial biases could change and this would produce changes in explicit ratings and in IAT scores. Second, state variance is expected to be somewhat stable over shorter time intervals because environments tend to be stable over shorter time intervals.

The model in Figure 1 met standard criteria of model fit, CFI = .997, RMSEA = .016.

Describing the model from left to right, race (0 = Black, 1 = White) has the expected relationship with quantity of contact (quant1) in year 1 (reflecting everyday interactions with Black individuals) and with the racial bias (att) factor. In addition, more contact is related to less pro-White bias (-.28). The attitude factor is a stronger predictor of the explicit trait factor (.78; ft; White feeling-thermometer – Black feeling-thermometer) than on the implicit trait factor (.60, iat). The influence of the explicit trait factor on measures on the three occasions (.58-.63) suggests that about one-third of the variance in these measures is trait variance. The same is true for individual IATs (.59-.62). The effect of the attitude factor on individual IATs (.60 * .60 = .36; .36^2 = .13 suggests that less than 20% of the variance in an individual IAT reflects racial bias. This estimate is consistent with the results from multi-method studies (Schimmack, 2019). However, these results suggests that the amount of valid trait variance can increase up to 36%, by aggregating scores of several IATs. In sum, these results provide first evidence that racial bias is stable over a period of six years and that both explicit ratings and implicit ratings capture trait variance in racial bias.

Turning to the bottom part of the model, there is weak evidence to suggest that residual variances (that are not trait variance) in explicit and implicit ratings are correlated. Although the correlation of r = .06 at time 1 is statistically significant, the correlations at time 2 (r = .03) and time 3 (r = .00) are not. This finding suggests that most of the residual variance is method specific measurement error rather than state-variance in racial bias. There is some evidence that the explicit ratings capture more than occasion-specific measurement error because state variance at time 1 predicts state variance at time 2 (r = .25) and from time 2 to time 3 (r = .20). This is not the case for the IAT scores. Finally, contact with Black medical staff at time 2 is a weak, but significant predictor of explicit measures of racial bias at time 2 and time 3, but it does not predict IAT scores at time 2 and 3. These findings do not support the hypothesis that changes in racial bias measures reflect real changes in racial biases.

The results are consistent with the only other multi-method longitudinal study of racial bias that covered only a brief period of three months. In this study, even implicit measures showed no convergent validity for the state (non-trait) variance on the same occasion (Cunningham, Preacher, & Banaji, 1995).


Examining predictors of individual differences in racial bias is important to understand the origins of racial biases and to develop interventions that help individuals to reduce their racial biases. Examining the stability of racial bias in longitudinal studies shows that these biases are stable dispositions and there is little evidence that they change with changing life-experiences. One explanation is that only close contact may be able to shift attitudes and that few people have close relationships with outgroup members. Thus stable environments may contribute to stability in racial bias.

Given the trait-like nature of racial bias, interventions that target attitudes and general dispositions may be relatively ineffective, as Onyeador et al.’s (2019) article suggested. Thus, it may be more effective to target and assess actual behaviors in diversity training. Expecting diversity training to change general dispositions may be misguided and lead to false conclusions about the effectiveness of diversity training programs.

Anti-Black Bias on the IAT predicts Pro-Black Bias in Behavior

Over 20 years ago, Anthony Greenwald and colleagues introduced the Implicit Association Test (IAT) as a measure of individual differences in implicit bias (Greenwald et al., 1998). The assumption underlying the IAT is that individuals can harbour unconscious, automatic, hidden, or implicit racial biases. These implicit biases are distinct from explicit bias. Somebody could be consciously unbiased, while their unconscious is prejudice. Theoretically, the opposite would also be possible, but taking IAT scores at face value, the unconscious is more prejudice than conscious reports of attitudes imply. It is also assumed that these implicit attitudes can influence behavior in ways that bypass conscious control of behavior. As a result, implicit bias in attitudes leads to implicit bias in behavior.

The problem with this simple model of implicit bias is that it lacks scientific support. In a recent review of validation studies, I found no scientific evidence that the IAT measures hidden or implicit biases outside of people’s awareness (Schimmack, 2019a). Rather, it seems to be a messy measure of consciously accessible attitudes.

Another contentious issue is the predictive validity of IAT scores. It is commonly implied that IAT scores predict bias in actual behavior. This prediction is so straightforward that the IAT is routinely used in implicit bias training (e.g., at my university) with the assumption that individuals who show bias on the IAT are likely to show anti-Black bias in actual behavior.

Even though the link between IAT scores and actual behavior is crucial for the use of the IAT in implicit bias training, this important question has been examined in relatively few studies and many of these studies had serious methodological limitations (Schimmack, 20199b).

To make things even more confusing, a couple of papers even suggested that White individuals’ unconscious is not always biased against Black people: “An unintentional, robust, and replicable Pro-Black bias in social judgment (Axt, Ebersole, & Nosek, 2016; Axt, 2017).

I used the open data of these two articles to examine more closely the relationship between scores on the attitude measures (the Brief Implicit Association Test & a direct explicit rating on a 7-point scale) and performance on a task where participants had to accept or reject 60 applicants into an academic honor society. Along with pictures of applicants, participants were provided with information about academic performance. These data were analyzed with signal-detection theory to obtain a measure of bias. Pro-White bias would be reflected in a lower admission standard for White applicants than for Black applicants. However, despite pro-White attitudes, participants showed a pro-Black bias in their admissions to the honor society.

Figure 1 shows the results for the Brief IAT. The blue lines show are the coordinates with 0 scores (no bias) on both tasks. The decreasing red line shows the linear relationship between BIAT scores on the x-axis and bias in admission decisions on the y-axis. The decreasing trend shows that, as expected, respondents with more pro-White bias on the BIAT are less likely to accept Black applicants. However, the picture also shows that participants with no bias on the BIAT have a bias to select more Black than White applicants. Most important, the vertical red line shows behavior of participants with the average performance on the BIAT. Even though these participants are considered to have a moderate pro-White bias, they show a pro-Black bias in their acceptance rates. Thus, there is no evidence that IAT scores are a predictor of discriminatory behavior. In fact, even the most extreme IAT scores fail to identify participants who discriminate against Black applicants.

A similar picture emerges for the explicit ratings of racial attitudes.

The next analysis examine convergent and predictive validity of the BIAT in a latent variable model (Schimmack, 2019). In this model, the BIAT and the explicit measure are treated as complementary measures of a single attitude for two reasons. First, multi-method studies fail to show that the IAT and explicit measures tap different attitudes (Schimmack, 2019a). Second, it is impossible to model systematic method variance in the BIAT in studies that use only a single implicit measure of attitudes.

The model also includes a group variable that distinguishes the convenience samples in Axt et al.’s studies (2016) and the sample of educators in Axt (2017). The grouping variable is coded with 1 for educators and 0 for the comparison samples.

The model meets standard criteria of model fit, CFI = .996, RMSEA = .002.

Figure 3 shows the y-standardized results so that relationships with the group variable can be interpreted as Cohen’s d effect sizes. The results show a notable difference (d = -59) in attitudes between the two samples with less pro-White attitudes for educators. In addition, educators have a small bias to favor Black applicants in their acceptance decisions (d = .19).

The model also shows that racial attitudes influence acceptance decisions with a moderate effect size, r = -.398. Finally, the model shows that the BIAT and the single-item explicit rating have modest validity as measures of racial attitudes, r = .392, .429, respectively. The results for the BIAT are consistent with other estimates that a single IAT has no more than 20% (.392^2 = 15%) valid variance. Thus, the results here are entirely consistent with the view that explicit and implicit measures tap a single attitude and that there is no need to postulate hidden, unconscious attitudes that can have an independent influence on behavior.

Based on their results, Axt et al. (2016) caution readers that the relationship between attitudes and behaviors is more complex than the common narrative of implicit bias assumes.

The authors “suggest that the prevailing emphasis on pro-White biases in judgment and behavior in the existing literature would improve by refining the theoretical understanding of under what conditions behavior favoring dominant or minority groups will occur.” (p. 33).


For two decades, the developers of the IAT have argued that the IAT measures a distinct type of attitudes that reside in individuals’ unconscious and can influence behavior in ways that bypass conscious control. As a result, even individuals who aim to be unbiased might exhibit prejudice in their behavior. Moreover, the finding that the majority of White people show a pro-White bias in their IAT scores was used to explain why discrimination and prejudice persist. This narrative is at the core of implicit bias training.

The problem with this story is that it is not supported by scientific evidence. First, there is no evidence that IAT scores reflect some form of unconscious or implicit bias. Rather, IAT scores seem to tap the same cognitive and affective processes that influence explicit ratings. Second, there is no evidence that processes that influence IAT scores can bypass conscious control of behavior. Third, there is no evidence that a pro-White bias in attitudes automatically produces a pro-White bias in actual behaviors. Not even Freud assumed that unconscious processes would have this effect on behavior. In fact, he postulated that various defense mechanisms may prevent individuals from acting on their undesirable impulses. Thus, the prediction that attitudes are sufficient to predict behavior is too simplistic.

Axt et al. (2016= speculate that “bias correction can occur automatically and without awareness” (p. 32). While this is an intriguing hypothesis, there is little evidence for such smart automatic control processes. This model also implies that it is impossible to predict actual behaviors from attitudes because correction processes can alter the influence of attitudes on behavior. This implies that only studies of actual behavior can reveal the ability of IAT scores to predict actual behavior. For example, only studies of actual behavior can demonstrate whether police officers with pro-White IAT scores show racial bias in the use of force. The problem is that 20 years of IAT research have uncovered no robust evidence that IAT scores actually predict important real-world behaviors (Schimmack, 2019b).

In conclusion, the results of Axt’s studies suggest that the use of the IAT in implicit bias training needs to be reconsidered. Not only are test scores highly variable and often provide false information about individuals’ attitudes; they also do not predict actual behavior of discrimination. It is wrong to assume that individuals who show a pro-White bias on the IAT are bound to act on these attitudes and discriminate against Black people or other minorities. Therefore, the focus on attitudes in implicit bias training may be misguided. It may be more productive to focus on factors that do influence actual behaviors and to provide individuals with clear guidelines that help them to act in accordance with these norms. The belief that this is not sufficient is based on an unsupported model of unconscious forces that can bypass awareness.

This conclusion is not totally new. In 2008, Blanton criticized the use of the IAT in applied settings (IAT: Fad or fabulous?)

“There’s not a single study showing that above and below that cutoff people differ in any way based on that score,” says Blanton.

And Brian Nosek agreed.

Guilty as charged, says the University of Virginia’s Brian Nosek, PhD, an IAT developer.

However, this admission of guilt has not changed behavior. Nosek and other IAT proponents continue to support Project Implicit that provided millions of visitors with false information about their attitudes or mental health issues based on a test with poor psychometric properties. A true admission of guilt would be to stop this unscientific and unethical practice.


Axt, J.R. (2017). An unintentional pro-Black bias in judgement among educators. British Journal of Educational Psychology, 87, 408-421.

Axt, J.R., Ebersole, C.R. & Nosek, B.A. (2016). An unintentional, robust, and replicable pro-Black bias in social judgment. Social Cognition34, 1-39.

Greenwald, A. G., McGhee, D. E., & Schwartz, J. L. K. (1998). Measuring individual differences in implicit cognition: The Implicit Association Test. Journal of Personality and Social Psychology, 74, 1464–1480.

Schimmack, U. (2019). The Implicit Association Test: A Method in Search of a construct. Perspectives on Psychological Science

Schimmack, U. (2019). The race IAT: A Case Study of The Validity Crisis in Psychology.

Open Communication about the invalidity of the race IAT

In the old days, most scientific communication occured behind closed doors, when reviewers provide anonymous peer-reviews that determine the fate of manuscripts. In the old days, rejected manuscripts would not be able to contribute to scientific communications because nobody would know about them.

All of this has changed with the birth of open science. Now authors can share manuscripts on pre-print servers and researchers can discuss merits of these manuscripts on social media. The benefit of this open scientific communication is that more people can join in and contribute to the communication.

Yoav Bar-Anan co-authored an article with Brian Nosek titled “Scientific Utopia: I. Opening Scientific Communication.” In this spirit of openness, I would like to have an open scientific communication with Yoav and his co-author Michelangelo Vianello about their 2018 article “A Multi-Method Multi-Trait Test of the Dual-Attitude Perspective

I have criticized their model in an in press article in Perspectives of Psychological Science (Schimmack, 2019). In a commentary, Yoav and Michelangelo argue that their model is “compatible with the logic of an MTMM investigation (Campbell & Fiske, 1959). They argue that it is important to have multiple traits to identify method variance in a matrix with multiple measures of multiple traits. They then propose that I lost the ability to identify method variance by examining one attitude (i.e., race, self-esteem, political orientation) at a time. They then point out that I did not include all measures and included the Modern Racism Scale as an indicator of political orientation to note that I did not provide a reason for these choices. While this is true, Yoav and Michelangelo had access to the data and could have tested whether these choices made any differences. They do not. This is obvious for the modern racism scale that can be eliminated from the measurement model without any changes in the overall model.

To cut to the chase, the main source of disagreement is the modelling of method variance in the multi-trait-multi-method data set. The issue is clear when we examine the original model published in Bar-Anan and Vianello (2018).

In this model, method variance in IATs and related tasks like the Brief IAT is modelled with the INDIRECT METHOD factor. The model assumes that all of the method variance that is present in implicit measures is shared across attitude domains and across all implicit measures. The only way for this model to allow for different amounts of method variance in different implicit measures is by assigning different loadings to the various methods. Moreover, the loadings provide information about the nature of the shared variance and the amount of method variance in the various methods. Although this is valuable and important information, the authors never discuss this information and its implications.

Many of these loadings are very small. For example, the loading of the race IAT and the brief race IAT are .11 and .02. In other words, the correlation between these two measures is inflated by .11 * .02 = .0022 points. This means that the correlation of r = .52 between these two measures is r = .5178 after we remove the influence of method variance.

It makes absolutely no sense to accuse me of separating the models, when there is no evidence of implicit method variance that is shared across attitudes. The remaining parameter estimates are not affected if a factor with low loadings is removed from a model.

Here I show that examining one attitude at a time produces exactly the same results as the full model. I focus on the most controversial IAT; the race IAT. After all, there is general agreement that there is little evidence of discriminant validity for political orientation (r = .91, in the Figure above), and there is little evidence for any validity in the self-esteem IAT based on several other investigations of this topic with a multi-method approach (Bosson et al., 2000; Falk et al., 2015).

Model 1 is based on Yoav and Michelangelo’s model that assumes that there is practically no method variance in IAT-variants. Thus, we can fit a simple dual-attitude model to the data. In this model, contact is regressed onto implicit and explicit attitude factors to see the unique contribution of the two factors without making causal assumptions. The model has acceptable fit, CFI = .952, RMSEA = .013.

The correlation between the two factors is .66, while it is r = .69 in the full model in Figure 1. The loading of the race IAT on the implicit factor is .66, while it is .62 in the full model in Figure 1. Thus, as expected based on the low loadings on the IMPLICIT METHOD factor, the results are no different when the model is fitted only to the measure of racial attitudes.

Model 2 makes the assumption that IAT-variants share method variance. Adding the method factor to the model increased model fit, CFI = .973, RMSEA = .010. As the models are nested, it is also possible to compare model fit with a chi-square test. With five degrees of freedom difference, chi-square changed from 167. 19 to 112.32. Thus, the model comparison favours the model with a method factor.

The main difference between the models is that there the evidence is less supportive of a dual attitude model and that the amount of valid variance in the race IAT decreases from .66^2 = 43% to r = .47^2 = 22%.

In sum, the 2018 article made strong claims about the race IAT. These claims were based on a model that implied that there is no systematic measurement error in IAT scores. I showed that this assumption is false and that a model with a method factor for IATs and IAT-variants fits the data better than a model without such a factor. It also makes no theoretical sense to postulate that there is no systematic method variance in IATs, when several previous studies have demonstrated that attitudes are only one source of variance in IAT scores (Klauer, Voss, Schmitz, & Teige-Mocigemba, 2007).

How is it possible that the race IAT and other IATs are widely used in psychological research and on public websites to provide individuals with false feedback about their hidden attitudes without any evidence of its validity as an individual difference measure of hidden attitudes that influence behaviour outside of awareness?

The answer is that most of these studies assumed that the IAT is valid rather than testing its validity. Another reason is that psychological research is focused on providing evidence that confirms theories rather than subjecting theories to empirical tests that they may fail. Finally, psychologists ignore effect sizes. As a result, the finding that IAT scores have incremental predictive validity of less than 4% variance in a criterion is celebrated as evidence for the validity of IATs, but even this small estimate is based on underpowered studies and may shrink in replication studies (cf. Kurdi et al., 2019).

It is understandable that proponents of the IAT respond with defiant defensiveness to my critique of the IAT. However, I am not the first to question the validity of the IAT, but these criticisms were ignored. At least Banaji and Greenwald recognized in 2013 that they do “not have the luxury of believing that what appears true and valid now will always appear so” (p. xv). It is time to face the facts. It may be painful to accept that the IAT is not what it was promised to be 21 years ago, but that is what the current evidence suggests. There is nothing wrong with my models and their interpretation, and it is time to tell visitors of the Project Implicit website that they should not attach any meaning to their IAT scores. A more productive way to counter my criticism of the IAT would be to conduct a proper validation study with multiple methods and validation criteria that are predicted to be uniquely related to IAT scores in a preregistered study.


Bosson, J. K., Swann, W. B., Jr., & Pennebaker, J. W. (2000). Stalking the perfect measure of implicit self-esteem: The blind men and the elephant revisited? Journal of Personality and Social Psychology, 79, 631–643.

Falk, C. F., Heine, S. J., Takemura, K., Zhang, C. X., & Hsu, C. (2015). Are implicit self-esteem measures valid for assessing individual and cultural differences. Journal of Personality, 83, 56–68. doi:10.1111/jopy.12082

Klauer, K. C., Voss, A., Schmitz, F., & Teige-Mocigemba, S. (2007). Process components of the Implicit Association Test: A diffusion-model analysis. Journal of Personality and Social Psychology, 93, 353–368.

Kurdi, B., Seitchik, A. E., Axt, J. R., Carroll, T. J., Karapetyan, A., Kaushik, N., . . . Banaji, M. R. (2019). Relationship between the Implicit Association Test and intergroup behavior: A meta-analysis. American Psychologist, 74, 569–586.

The Implicit Association Test: A Measure in Search of a Construct (in press, PoPS)

Here is a link to the manuscript, data, and MPLUS scripts for reproducibility.


Greenwald et al. (1998) proposed that the IAT measures individual differences in implicit social cognition.  This claim requires evidence of construct validity. I review the evidence and show that there is insufficient evidence for this claim.  Most important, I show that few studies were able to test discriminant validity of the IAT as a measure of implicit constructs. I examine discriminant validity in several multi-method studies and find no or weak evidence for discriminant validity. I also show that validity of the IAT as a measure of attitudes varies across constructs. Validity of the self-esteem IAT is low, but estimates vary across studies.  About 20% of the variance in the race IAT reflects racial preferences. The highest validity is obtained for measuring political orientation with the IAT (64% valid variance).  Most of this valid variance stems from a distinction between individuals with opposing attitudes, while reaction times contribute less than 10% of variance in the prediction of explicit attitude measures.  In all domains, explicit measures are more valid than the IAT, but the IAT can be used as a measure of sensitive attitudes to reduce measurement error by using a multi-method measurement model.

Keywords:  Personality, Individual Differences, Social Cognition, Measurement, Construct Validity, Convergent Validity, Discriminant Validity, Structural Equation Modeling


Despite its popularity, relatively little is known about the construct validity of the IAT.

As Cronbach (1989) pointed out, construct validation is better examined by independent experts than by authors of a test because “colleagues are especially able to refine the interpretation, as they compensate for blind spots and capitalize on their own distinctive experience” (p. 163).

It is of utmost importance to determine how much of the variance in IAT scores is valid variance and how much of the variance is due to measurement error, especially when IAT scores are used to provide individualized feedback.

There is also no consensus in the literature whether the IAT measures something different from explicit measures.

In conclusion, while there is general consensus to make a distinction between explicit measures and implicit measures, it is not clear what the IAT measures

To complicate matters further, the validity of the IAT may vary across attitude objects. After all the IAT is a method, just like Likert scales are a method, and it is impossible to say that a method is valid (Cronbach, 1971).

At present, relatively little is known about the contribution of these three parameters to observed correlations in hundreds of mono-method studies.

A Critical Review of Greenwald et al.’s (1998) Original Article

In conclusion, the seminal IAT article introduced the IAT as a measure of implicit constructs that cannot be measured with explicit measures, but it did not really test this dual-attitude model.

Construct Validity in 2007

In conclusion, the 2007 review of construct validity revealed major psychometric challenges for the construct validity of the IAT, which explains why some researchers have concluded that the IAT cannot be used to measure individual differences (Payne et al., 2017).  It also revealed that most studies were mono-method studies that could not examine convergent and discriminant validity

Cunningham, Preacher and Banaji (2001)

Another noteworthy finding is that a single factor accounted for correlations among all measures on the same occasion and across measurement occasions. This finding shows that there were no true changes in racial attitudes over the course of this two-month study.  This finding is important because Cunningham et al.’s (2001) study is often cited as evidence that implicit attitudes are highly unstable and malleable (e.g., Payne et al., 2017). This interpretation is based on the failure to distinguish random measurement error and true change in the construct that is being measured (Anusic & Schimmack, 2016).  While Cunningham et al.’s (2001) results suggest that the IAT is a highly unreliable measure, the results also suggest that the racial attitudes that are measured with the race IAT are highly stable over periods of weeks or months. 

Bar-Anan & Vianello, 2018

this large study of construct validity also provides little evidence for the original claim that the IAT measures a new construct that cannot be measured with explicit measures, and confirms the estimate from Cunningham et al. (2001) that about 20% of the variance in IAT scores reflects variance in racial attitudes.

Greenwald et al. (2009)

“When entered after the self-report measures, the two implicit measures incrementally explained 2.1% of vote intention variance, p=.001, and when political conservativism was also included in the model, “the pair of implicit measures incrementally predicted only 0.6% of voting intention variance, p = .05.”  (Greenwald et al., 2009, p. 247).

I tried to reproduce these results with the published correlation matrix and failed to do so. I contacted Anthony Greenwald, who provided the raw data, but I was unable to recreate the sample size of N = 1,057. Instead I obtained a similar sample size of N = 1,035.  Performing the analysis on this sample also produced non-significant results (IAT: b = -.003, se = .044, t = .070, p = .944; AMP: b = -.014, se = .042, t = 0.344, p = .731).  Thus, there is no evidence for incremental predictive validity in this study.

Axt (2018)

With N = 540,723 respondents, sampling error is very small, σ = .002, and parameter estimates can be interpreted as true scores in the population of Project Implicit visitors.  A comparison of the factor loadings shows that explicit ratings are more valid than IAT scores. The factor loading of the race IAT on the attitude factor once more suggests that about 20% of the variance in IAT scores reflects racial attitudes

Falk, Heine, Zhang, and Hsu (2015)

Most important, the self-esteem IAT and the other implicit measures have low and non-significant loadings on the self-esteem factor. 

Bar-Anan & Vianello (2018)

Thus, low validity contributes considerably to low observed correlations between IAT scores and explicit self-esteem measures.

Bar-Anan & Vianello (2018) – Political Orientation

More important, the factor loading of the IAT on the implicit factor is much higher than for self-esteem or racial attitudes, suggesting over 50% of the variance in political orientation IAT scores is valid variance, π = .79, σ = .016.  The loading of the self-report on the explicit ratings was also higher, π = .90, σ = .010

Variation of Implicit – Explicit Correlations Across Domains

This suggests that the IAT is good in classifying individuals into opposing groups, but it has low validity of individual differences in the strength of attitudes.

What Do IATs Measure?

The present results suggest that measurement error alone is often sufficient to explain these low correlations.  Thus, there is little empirical support for the claim that the IAT measures implicit attitudes that are not accessible to introspection and that cannot be measured with self-report measures. 

For 21 years the lack of discriminant validity has been overlooked because psychologists often fail to take measurement error into account and do not clearly distinguish between measures and constructs.

In the future, researchers need to be more careful when they make claims about constructs based on a single measure like the IAT because measurement error can produce misleading results.

Researchers should avoid terms like implicit attitude or implicit preferences that make claims about constructs simply because attitudes were measured with an implicit measure

Recently, Greenwald and Banaji (2017) also expressed concerns about their earlier assumption that IAT scores reflect unconscious processes.  “Even though the present authors find themselves occasionally lapsing to use implicit and explicit as if they had conceptual meaning, they strongly endorse the empirical understanding of the implicit– explicit distinction” (p. 862).

How Well Does the IAT Measure What it Measures?

Studies with the IAT can be divided into applied studies (A-studies) and basic studies (B-studies).  B-studies employ the IAT to study basic psychological processes.  In contrast, A-studies use the IAT as a measure of individual differences. Whereas B-studies contribute to the understanding of the IAT, A-studies require that IAT scores have construct validity.  Thus, B-studies should provide quantitative information about the psychometric properties for researchers who are conducting A-studies. Unfortunately, 21 years of B-studies have failed to do so. For example, after an exhaustive review of the IAT literature, de Houwer et al. (2009) conclude that “IAT effects are reliable enough to be used as a measure of individual differences” (p. 363).  This conclusion is not helpful for the use of the IAT in A-studies because (a) no quantitative information about reliability is given, and (b) reliability is necessary but not sufficient for validity.  Height can be measured reliably, but it is not a valid measure of happiness. 

This article provides the first quantitative information about validity of three IATs.  The evidence suggests that the self-esteem IAT has no clear evidence of construct validity (Falk et al., 2015).  The race-IAT has about 20% valid variance and even less valid variance in studies that focus on attitudes of members from a single group.  The political orientation IAT has over 40% valid variance, but most of this variance is explained by group-differences and overlaps with explicit measures of political orientation.  Although validity of the IAT needs to be examined on a case by case basis, the results suggest that the IAT has limited utility as a measurement method in A-studies.  It is either invalid or the construct can be measured more easily with direct ratings.

Implications for the Use of IAT scores in Personality Assessment

I suggest to replace the reliability coefficient with the validity coefficient.  For example, if we assume that 20% of the variance in scores on the race IAT is valid variance, the 95%CI for IAT scores from Project Implicit (Axt, 2018), using the D-scoring method, with a mean of .30 and a standard deviation of.46 ranges from -.51 to 1.11. Thus, participants who score at the mean level could have an extreme pro-White bias (Cohen’s d = 1.11/.46 = 2.41), but also an extreme pro-Black Bias (Cohen’s d = -.51/.46 = -1.10).  Thus, it seems problematic to provide individuals with feedback that their IAT score may reveal something about their attitudes that is more valid than their beliefs. 


Social psychologists have always distrusted self-report, especially for the measurement of sensitive topics like prejudice.  Many attempts were made to measure attitudes and other constructs with indirect methods.  The IAT was a major breakthrough because it has relatively high reliability compared to other methods.  Thus, creating the IAT was a major achievement that should not be underestimated because the IAT lacks construct validity as a measure of implicit constructs. Even creating an indirect measure of attitudes is a formidable feat. However, in the early 1990s, social psychologists were enthralled by work in cognitive psychology that demonstrated unconscious or uncontrollable processes (Greenwald & Banaji, 1995). Implicit measures were based on this work and it seemed reasonable to assume that they might provide a window into the unconscious (Banaji & Greenwald, 2013). However, the processes that are involved in the measurement of attitudes with implicit measures are not the personality characteristics that are being measured.  There is nothing implicit about being a Republican or Democrat, gay or straight, or having low self-esteem.  Conflating implicit processes in the measurement of attitudes with implicit personality constructs has created a lot of confusion. It is time to end this confusion. The IAT is an implicit measure of attitudes with varying validity.  It is not a window into people’s unconscious feelings, cognitions, or attitudes.

The Validation Crisis in Psychology

Most published psychological measures are unvalid.  (subtitle)
*unvalid = the validity of the measure is un-known.

This blog post served as a first draft for a manuscript that is currently under review at Meta-Psychology. You can find the latest version here (pdf).


8 years ago, psychologists started to realize that they have a replication crisis. Many published results do not replicate in honest replication attempts that allow the data to decide whether a hypothesis is true or false.

The replication crisis is sometimes attributed to the lack of replication studies before 2011. However, this is not the case. Most published results were replicated successfully. However, these successes were entirely predictable from the fact that only successful replications would be published (Sterling, 1959). These sham replication studies provided illusory evidence for theories that have been discredited over the past eight years by credible replication studies.

New initiatives that are called open science are likely to improve the replicability of psychological science in the future, although progress towards this goal is painfully slow.

This blog post addresses another problem in psychological science. I call it the validation crisis. Replicability is only one necessary feature of a healthy science. Another necessary feature of a healthy science is the use of valid measures. This feature of a healthy science is as obvious as the need for replicability. To test theories that relate theoretical constructs to each other (e.g., construct A influences construct B for individuals drawn from population P under conditions C), it is necessary to have valid measures of constructs. However, it is unclear which criteria a measure has to fulfill to have construct validity. Thus, even successful and replicable tests of a theory may be false because the measures that were used lacked construct validity.

Construct Validity

The classic article on “Construct Validity” was written by two giants in psychology; Cronbach and Meehl (1955). Every graduate student of psychology and surely every psychologists who published a psychological measure should be familiar with this article.

The article was the result of an APA task force that tried to establish criteria, now called psychometric properties, for tests to be published. The result of this project was the creation of the construct “Construct validity”

The chief innovation in the Committee’s report was the term construct validity. (p. 281).

Cronbach and Meehl provide their own definition of this construct.

Construct validation is involved whenever a test is to be interpreted
as a measure of some attribute or quality which is not “operationally
defined” (p. 282).

In modern language, construct validity is the relationship between variation in observed test scores and a latent variable that reflects corresponding variation in a theoretical construct (Schimmack, 2010).

Thinking about construct validity in this way makes it immediately obvious why it is much easier to demonstrate predictive validity, which is the relationship between observed tests scores and observed criterion scores than to establish construct validity, which is the relationship between observed test scores and a latent, unobserved variable. To demonstrate predictive validity, one can simply obtain scores on a measure and a criterion and compute the correlation between the two variables. The correlation coefficient shows the amount of predictive validity of the measure. However, because constructs are not observable, it is impossible to use simple correlations to examine construct validity.

The problem of construct validation can be illustrated with the development of IQ scores. IQ scores can have predictive validity (e.g., performance in graduate school) without making any claims about the construct that is being measured (IQ tests measure whatever they measure and what they measure predicts important outcomes). However, IQ tests are often treated as measures of intelligence. For IQ tests to be valid measures of intelligence, it is necessary to define the construct of intelligence and to demonstrate that observed IQ scores are related to unobserved variation in intelligence. Thus, construct validation requires clear definitions of constructs that are independent of the measure that is being validated. Without clear definition of constructs, the meaning of a measure reverts essentially to “whatever the measure is measuring,” as in the old saying “Intelligence is whatever IQ tests are measuring. This saying shows the problem of research with measures that have no clear construct and no construct validity.

In conclusion, the challenge in construct validation research is to relate a specific measure to a well-defined construct and to establish that variation in test scores are related to variation in the construct.

What are Constructs

Construct validation starts with an assumption. Individuals are assumed to have an attribute, today we may say personality trait. Personality traits are typically not directly observable (e.g., kindness rather than height), but systematic observation suggests that the attribute exists (some people are kinder than others across time and situations). The first step is to develop a measure of this attribute (e.g., a self-report measure “How kind are you?”). If the test is valid, variation in the observed scores on the measure should be related to the personality trait.

A construct is some postulated attribute of people, assumed to be reflected in test performance (p. 283).

The term “reflected” is consistent with a latent variable model, where unobserved traits are reflected in observable indicators. In fact, Cronbach and Meehl argue that factor analysis (not principle component analysis!) provides very important information for construct validity.

We depart from Anastasi at two points. She writes, “The validity of
a psychological test should not be confused with an analysis of the factors
which determine the behavior under consideration.” We, however,
regard such analysis as a most important type of validation. (p. 286).

Factor analysis is useful because factors are unobserved variables and factor loadings show how strongly an observed measure is related to variation in a an unobserved variable; the factor. If multiple measures of a construct are available, they should be positively correlated with each other and factor analysis will extract a common factor. For example, if multiple independent raters agree in their ratings of individuals’ kindness, the common factor in these ratings may correspond to the personality trait kindness, and the factor loadings provide evidence about the degree of construct validity of each measure (Schimmack, 2010).

In conclusion, factor analysis provides useful information about construct validity of measures because factors represent the construct and factor loadings show how strongly an observed measure is related to the construct.

It is clear that factors here function as constructs (p. 287).

Convergent Validity

The term convergent validity was introduced a few years later in another seminal article on validation research by Campbell and Fiske (1959). However, the basic idea of convergent validity was specified by Cronbach and Meehl (1955) in the section “Correlation matrices and factor analysis”

If two tests are presumed to measure the same construct, a correlation between them is predicted (p. 287).

If a trait such as dominance is hypothesized, and the items inquire about behaviors subsumed under this label, then the hypothesis appears to require that these items be generally intercorrelated (p. 288)

Cronbach and Meehl realize the problem of using just two observed measures to examine convergent validity. For example, self-informant correlations are often used in personality psychology to demonstrate validity of self-ratings. However, a correlation of r = .4 between self-ratings and informant ratings is open to very different interpretations. The correlation could reflect very high validity of self-ratings and modest validity of informant ratings or the opposite could be true.

If the obtained correlation departs from the expectation, however, there is no way to know whether the fault lies in test A, test B, or the formulation of the construct. A matrix of intercorrelations often points out profitable ways of dividing the construct into more meaningful parts, factor analysis being
a useful computational method in such studies. (p. 300)

A multi-method approach avoids this problem and factor loadings on a common factor can be interpreted as validity coefficients. More valid measures should have higher loadings than less valid measures. Factor analysis requires a minimum of three observed variables, but more is better. Thus, construct validation requires a multi-method assessment.

Discriminant Validity

The term discriminant validity was also introduced later by Campbell and Fiske (1959). However, Cronbach and Meehl already point out that high or low correlations can support construct validity. Crucial for construct validity is that the correlations are consistent with theoretical expectations.

For example, low correlations between intelligence and happiness do not undermine the validity of an intelligence measure because there is no theoretical expectation that intelligence is related to happiness. In contrast, low correlations between intelligence and job performance would be a problem if the jobs require problem solving skills and intelligence is an ability to solve problems faster or better.

Only if the underlying theory of the trait being measured calls for high item
intercorrelations do the correlations support construct validity (p. 288).

Quantifying Construct Validity

It is rare to see quantitative claims about construct validity. Most articles that claim construct validity of a measure simply state that the measure has demonstrated construct validity as if a test is either valid or invalid. However, the previous discussion already made it clear that construct validity is a quantitative construct because construct validity is the relation between variation in a measure and variation in the construct and this relation can vary . If we use standardized coefficients like factor loadings to assess the construct validity of a measure, construct validity can range from -1 to 1.

Contrary to the current practices, Cronbach and Meehl assumed that most users of measures would be interested in a “construct validity coefficient.”

There is an understandable tendency to seek a “construct validity
coefficient. A numerical statement of the degree of construct validity
would be a statement of the proportion of the test score variance that is
attributable to the construct variable. This numerical estimate can sometimes be arrived at by a factor analysis” (p. 289).

Cronbach and Meehl are well-aware that it is difficult to quantify validity precisely, even if multiple measures of a construct are available because the factor may not be perfectly corresponding with the construct.

Rarely will it be possible to estimate definite “construct saturations,” because no factor corresponding closely to the construct will be available (p. 289).

And nobody today seems to remember Cronbach and Meehl’s (1955) warning that rejection of the null-hypothesis, the test has zero validity, is not the end goal of validation research.

It should be particularly noted that rejecting the null hypothesis does not finish the job of construct validation (p. 290)

The problem is not to conclude that the test “is valid” for measuring- the construct variable. The task is to state as definitely as possible the degree of validity the test is presumed to have (p. 290).

One reason why psychologists may not follow this sensible advice is that estimates of construct validity for many tests are likely to be low (Schimmack, 2010).

The Nomological Net – A Structural Equation Model

Some readers may be familiar with the term “nomological net” that was popularized by Cronbach and Meehl. In modern language a nomological net is essentially a structural equation model.

The laws in a nomological network may relate (a) observable properties
or quantities to each other; or (b) theoretical constructs to observables;
or (c) different theoretical constructs to one another. These “laws”
may be statistical or deterministic.

It is probably no accident that at the same time as Cronbach and Mehl started to think about constructs as separate from observed measures, structural equation model was developed as a combination of factor analysis that made it possible to relate observed variables to variation in unobserved constructs and path analysis that made it possible to relate variation in constructs to each other. Although laws in a nomological network can take on more complex forms than linear relationships, a structural equation model is a nomological network (but a nomological network is not necessarily a structural equation model).

As proper construct validation requires a multi-method approach and demonstration of convergent and discriminant validity, SEM is ideally suited to examine whether the observed correlations among measures in a mulit-trait-multi-method matrix are consistent with theoretical expectations. In this regard, SEM is superior to factor analysis. For example, it is possible to model shared method variance, which is impossible with factor analysis.

Cronbach and Meehl also realize that constructs can change as more information becomes available. It may also occur that the data fail to provide evidence for a construct. In this sense, construct validiation is an ongoing process of improved understanding of unobserved constructs and how they are related to observable measures.

Ideally this iterative process would start with a simple structural equation model that is fitted to some data. If the model does not fit, the model can be modified and tested with new data. Over time, the model would become more complex and more stable because core measures of constructs would establish the construct validity, while peripheral relationships may be modified if new data suggest that theoretical assumptions need to be changed.

When observations will not fit into the network as it stands, the scientist has a certain freedom in selecting where to modify the network (p. 290).

Too often psychologists use SEM only to confirm an assumed nomological network and it is often considered inappropriate to change a nomological network to fit observed data. However, SEM is as much testing of an existing construct as exploration of a new construct.

The example from the natural sciences was the initial definition of gold as having a golden color. However, later it was discovered that the pure metal gold is actually silver or white and that the typical yellow color comes from copper impurities. In the same way, scientific constructs of intelligence can change depending on the data that are observed. For example, the original theory may assume that intelligence is a unidimensional construct (g), but empirical data could show that intelligence is multi-faceted with specific intelligences for specific domains.

However, given the lack of construct validation research in psychology, psychology has seen little progress in the understanding of such basic constructs such as extraversion, self-esteem, or wellbeing. Often these constructs are still assessed with measures that were originally proposed as measures of these constructs, as if divine intervention led to the creation of the best measure of these constructs and future research only confirmed their superiority.

Instead many claims about construct validity are based on conjectures than empirical support by means of nomological networks. This was true in 1955. Unfortunately, it is still true over 50 years later.

For most tests intended to measure constructs, adequate criteria do not exist. This being the case, many such tests have been left unvalidated, or a finespun network of rationalizations has been offered as if it were validation. Rationalization is not construct validation. One who claims that his test reflects a construct cannot maintain his claim in the face of recurrent negative results because these results show that his construct is too loosely defined to yield verifiable inferences (p. 291).

Given the difficulty of defining constructs and finding measures for it, even measures that show promise in the beginning might fail to demonstrate construct validity later and new measures should show higher construct validity than the early measures. However, psychology shows no development in measures of the same construct. The most widely used measure of self-esteem is still Rosenberg’s scale from 1965 and the most widely used measure of wellbieng is still Diener et al.’s scale from 1984. It is not clear how psychology can make progress, if it doesn’t make progress in the development of nomological networks that provide information about constructs and about the construct validity of measures.

Cronbach and Meehl are clear that nomological networks are needed to claim construct validity.

To validate a claim that a test measures a construct, a nomological net surrounding the concept must exist (p. 291).

However, there are few attempts to examine construct validity with structural equation models (Connelly & Ones, 2010; Zou, Schimmack, & Gere, 2013). [please share more if you know some]

One possible reason is that construct validation research may reveal that authors initial constructs need to be modified or their measures have modest validity. For example, McCrae, Zonderman, Costa, Bond, and Paunonen (1996) dismissed structural equation modeling as a useful method to examine the construct validity of Big Five measures because it failed to support their conception of the Big Five as orthogonal dimensions with simple structure.

Recommendations for Users of Psychological Measures

The consumer can accept a test as a measure of a construct only when there is a strong positive fit between predictions and subsequent data. When the evidence from a proper investigation of a published test is essentially negative, it should be reported as a stop sign to discourage use of the test pending a reconciliation of test and construct, or final abandonment of the test (p. 296).

It is very unlikely that all hunches by psychologists lead to the discovery of useful constructs and development of valid tests of these constructs. Given the lack of knowledge about the mind, it is rather more likely that many constructs turn out to be non-existent and that measures have low construct validity.

However, the history of psychological measurement has only seen development of more and more constructs and more and more measures to measure this increasing universe of constructs. Since the 1990s, constructs have doubled because every construct has been split into an explicit and an implicit version of the construct. Presumably, there is even implicit political orientation or gender identity.

The proliferation of constructs and measures is not a sign of a healthy science. Rather it shows the inability of empirical studies to demonstrate that a measure is not valid or that a construct may not exist. This is mostly due to self-serving biases and motivated reasoning of test developers. The gains from a measure that is widely used are immense. Thus, weak evidence is used to claim that a measure is valid and consumers are complicit because they can use these measures to make new discoveries. Even when evidence shows that a measure may not work as intended (e.g.,
Bosson et al., 2000), it is often ignored (Greenwald & Farnham, 2001).


Just like psychologist have started to appreciate replication failures in the past years, they need to embrace validation failures. Some of the measures that are currently used in psychology are likely to have insufficient construct validity. If this was the decade of replication, the 2020s may become the decade of validation, and maybe the 2030s may produce the first replicable studies with valid measures. Maybe this is overly optimistic, given the lack of improvement in validation research since Cronbach and Meehl (1955) outlined a program of construct validation research. Ample citations show that they were successful in introducing the term, but they failed in establishing rigorous criteria of construct validity. The time to change this is now.