Category Archives: Implicit Association Test

Racial Bias as a Trait

Prejudice is an important topic in psychology that can be examined from various perspectives. Nevertheless, prejudice research is typically studied by social psychologists. As a result, research has focused on social cognitive processes that are activated in response to racial stimuli (e.g., pictures of African Americans) and experimental manipulations of the situation (e.g., race of experimenter). Other research has focused on cognitive processes that can lead to the formation of racial bias (e.g., the minimal group paradigm). Sometimes this work has been based on a model of prejudice that assumes racial bias is a common attribute of all people (Devine, 1989) and that individuals only differ in their willingness or ability to act on their racial biases.

An alternative view is that racial biases vary across individuals and are shaped by experiences with out-group members. The most prominent theory is contact theory, which postulates that contact with out-group members reduces racial bias. In social psychology, individual differences in racial biases are typically called attitudes, where attitudes are broad dispositions to respond to a class of attitude objects in a consistent manner. For example, individuals with positive attitudes towards African Americans are more likely to have positive thoughts, feelings, and behaviors in interactions with African Americans.

The notion of attitudes as general dispositions shows that attitudes play the same role in social psychology that traits play in personality psychology. For example, extraversion is a general disposition to have more positive thoughts, feelings, and to engage more in social interactions. One important research question in personality psychology are the causes of variation in personality. Why are some people more extraverted than others? A related question is how stable personality traits are. If the causes of extraversion are environmental factors, extraversion should change when the environment changes. If the causes of extraversion are within the person (e.g., early childhood experiences, genetic differences), extraversion should be stable. Thus, the stability of personality traits over time is an empirical question that can only be answered in longitudinal studies that measure personality traits repeatedly. A meta-analysis shows that the Big Five personality traits are highly stable over time (Anusic & Schimmack, 2016).

In comparison, the stability of attitudes has received relatively little attention in social psychology because stable individual differences are often neglected in social cognitive models of attitudes. This is unfortunate because the origins of racial bias are important to the understanding of racial bias and to design interventions that help individuals to reduce their racial biases.

How stable are racial biases?

The lack of data has not stopped social psychologists from speculating about the stability of racial biases. “It’s not as malleable as mood and not as reliable as a personality trait. It’s in between the two–a blend of both a trait and a state characteristic” (Nosek in Azar, 2008). In 2019, Nosek was less certain about the stability of racial biases. “One is does that mean we have have some degree of trait variance because there is some stability over time and what is the rest? Is the rest error or is it state variance in some way, right. Some variation that is meaningful variation that is sensitive to the context of measurement. Surely it is some of both, but we don’t know how much” (The Psychology Podcast, 2019).

Other social psychologists have made stronger claims about the stability of racial bias. Payne argued that racial bias is a state because implicit bias measures show higher internal consistency than retest correlations (Payne, 2017). However, the comparison of internal consistency and retest correlations is problematic because situational factors may simply produce situation-specific measurement errors rather than reflecting real changes in the underlying trait; a problem that is well recognized in personality psychology. To examine this question more thoroughly, it is necessary to obtain multiple retests and decompose the variances into trait, state, and error variances (Anusic & Schimmack, 2016). Even this approach cannot distinguish between state variance and systematic measurement error, which requires multi-method data (Schimmack, 2019).

A Longitudinal Multi-Method Study of Racial Bias

A recent article reported the results of an impressive longitudinal study of racial bias with over 3,000 medical students who completed measures of racial bias and inter-group contact three times over a period of six year (first year of medical school, fourth year of medical school, 2nd year of residency) (Onyeador et al., 2019). I used the openly shared data to fit a multi-method state-trait-error model to the data (https://osf.io/78cqx/).

The model integrates several theoretical assumptions that are consistent with previous research (Schimmack, 2019). First, the model assumes that explicit ratings of racial bias (feeling thermometer) and implicit measures of racial bias (Implicit Association Test) are complementary measures of individual differences in racial bias. Second, the model assumes that one source of variance in racial bias is a stable trait. Third, the model assumes that racial bias differs across racial groups, in that Black individuals have more favorable attitudes towards Black people than members from other groups. Fourth, the model assumes that contact is negatively correlated with racial bias without making a strong causal assumption about the direction of this relationship. The model also assumes that Black individuals have more contact with Black individuals and that contact partially explains why Black individuals have less racial biases.

The new hypotheses that could be explored with these data concerned the presence of state variance in racial bias. First, state variance should produce correlations between the occasion specific variances of the two methods. That is, after statistically removing trait variance, residual state variance in feeling thermometer scores should be correlated with residual variances in IAT scores. For example, as medical students interact more with Black staff and patients in residency, their racial biases could change and this would produce changes in explicit ratings and in IAT scores. Second, state variance is expected to be somewhat stable over shorter time intervals because environments tend to be stable over shorter time intervals.

The model in Figure 1 met standard criteria of model fit, CFI = .997, RMSEA = .016.

Describing the model from left to right, race (0 = Black, 1 = White) has the expected relationship with quantity of contact (quant1) in year 1 (reflecting everyday interactions with Black individuals) and with the racial bias (att) factor. In addition, more contact is related to less pro-White bias (-.28). The attitude factor is a stronger predictor of the explicit trait factor (.78; ft; White feeling-thermometer – Black feeling-thermometer) than on the implicit trait factor (.60, iat). The influence of the explicit trait factor on measures on the three occasions (.58-.63) suggests that about one-third of the variance in these measures is trait variance. The same is true for individual IATs (.59-.62). The effect of the attitude factor on individual IATs (.60 * .60 = .36; .36^2 = .13 suggests that less than 20% of the variance in an individual IAT reflects racial bias. This estimate is consistent with the results from multi-method studies (Schimmack, 2019). However, these results suggests that the amount of valid trait variance can increase up to 36%, by aggregating scores of several IATs. In sum, these results provide first evidence that racial bias is stable over a period of six years and that both explicit ratings and implicit ratings capture trait variance in racial bias.

Turning to the bottom part of the model, there is weak evidence to suggest that residual variances (that are not trait variance) in explicit and implicit ratings are correlated. Although the correlation of r = .06 at time 1 is statistically significant, the correlations at time 2 (r = .03) and time 3 (r = .00) are not. This finding suggests that most of the residual variance is method specific measurement error rather than state-variance in racial bias. There is some evidence that the explicit ratings capture more than occasion-specific measurement error because state variance at time 1 predicts state variance at time 2 (r = .25) and from time 2 to time 3 (r = .20). This is not the case for the IAT scores. Finally, contact with Black medical staff at time 2 is a weak, but significant predictor of explicit measures of racial bias at time 2 and time 3, but it does not predict IAT scores at time 2 and 3. These findings do not support the hypothesis that changes in racial bias measures reflect real changes in racial biases.

The results are consistent with the only other multi-method longitudinal study of racial bias that covered only a brief period of three months. In this study, even implicit measures showed no convergent validity for the state (non-trait) variance on the same occasion (Cunningham, Preacher, & Banaji, 1995).

Conclusion

Examining predictors of individual differences in racial bias is important to understand the origins of racial biases and to develop interventions that help individuals to reduce their racial biases. Examining the stability of racial bias in longitudinal studies shows that these biases are stable dispositions and there is little evidence that they change with changing life-experiences. One explanation is that only close contact may be able to shift attitudes and that few people have close relationships with outgroup members. Thus stable environments may contribute to stability in racial bias.

Given the trait-like nature of racial bias, interventions that target attitudes and general dispositions may be relatively ineffective, as Onyeador et al.’s (2019) article suggested. Thus, it may be more effective to target and assess actual behaviors in diversity training. Expecting diversity training to change general dispositions may be misguided and lead to false conclusions about the effectiveness of diversity training programs.

Anti-Black Bias on the IAT predicts Pro-Black Bias in Behavior

Over 20 years ago, Anthony Greenwald and colleagues introduced the Implicit Association Test (IAT) as a measure of individual differences in implicit bias (Greenwald et al., 1998). The assumption underlying the IAT is that individuals can harbour unconscious, automatic, hidden, or implicit racial biases. These implicit biases are distinct from explicit bias. Somebody could be consciously unbiased, while their unconscious is prejudice. Theoretically, the opposite would also be possible, but taking IAT scores at face value, the unconscious is more prejudice than conscious reports of attitudes imply. It is also assumed that these implicit attitudes can influence behavior in ways that bypass conscious control of behavior. As a result, implicit bias in attitudes leads to implicit bias in behavior.

The problem with this simple model of implicit bias is that it lacks scientific support. In a recent review of validation studies, I found no scientific evidence that the IAT measures hidden or implicit biases outside of people’s awareness (Schimmack, 2019a). Rather, it seems to be a messy measure of consciously accessible attitudes.

Another contentious issue is the predictive validity of IAT scores. It is commonly implied that IAT scores predict bias in actual behavior. This prediction is so straightforward that the IAT is routinely used in implicit bias training (e.g., at my university) with the assumption that individuals who show bias on the IAT are likely to show anti-Black bias in actual behavior.

Even though the link between IAT scores and actual behavior is crucial for the use of the IAT in implicit bias training, this important question has been examined in relatively few studies and many of these studies had serious methodological limitations (Schimmack, 20199b).

To make things even more confusing, a couple of papers even suggested that White individuals’ unconscious is not always biased against Black people: “An unintentional, robust, and replicable Pro-Black bias in social judgment (Axt, Ebersole, & Nosek, 2016; Axt, 2017).

I used the open data of these two articles to examine more closely the relationship between scores on the attitude measures (the Brief Implicit Association Test & a direct explicit rating on a 7-point scale) and performance on a task where participants had to accept or reject 60 applicants into an academic honor society. Along with pictures of applicants, participants were provided with information about academic performance. These data were analyzed with signal-detection theory to obtain a measure of bias. Pro-White bias would be reflected in a lower admission standard for White applicants than for Black applicants. However, despite pro-White attitudes, participants showed a pro-Black bias in their admissions to the honor society.

Figure 1 shows the results for the Brief IAT. The blue lines show are the coordinates with 0 scores (no bias) on both tasks. The decreasing red line shows the linear relationship between BIAT scores on the x-axis and bias in admission decisions on the y-axis. The decreasing trend shows that, as expected, respondents with more pro-White bias on the BIAT are less likely to accept Black applicants. However, the picture also shows that participants with no bias on the BIAT have a bias to select more Black than White applicants. Most important, the vertical red line shows behavior of participants with the average performance on the BIAT. Even though these participants are considered to have a moderate pro-White bias, they show a pro-Black bias in their acceptance rates. Thus, there is no evidence that IAT scores are a predictor of discriminatory behavior. In fact, even the most extreme IAT scores fail to identify participants who discriminate against Black applicants.

A similar picture emerges for the explicit ratings of racial attitudes.

The next analysis examine convergent and predictive validity of the BIAT in a latent variable model (Schimmack, 2019). In this model, the BIAT and the explicit measure are treated as complementary measures of a single attitude for two reasons. First, multi-method studies fail to show that the IAT and explicit measures tap different attitudes (Schimmack, 2019a). Second, it is impossible to model systematic method variance in the BIAT in studies that use only a single implicit measure of attitudes.

The model also includes a group variable that distinguishes the convenience samples in Axt et al.’s studies (2016) and the sample of educators in Axt (2017). The grouping variable is coded with 1 for educators and 0 for the comparison samples.

The model meets standard criteria of model fit, CFI = .996, RMSEA = .002.

Figure 3 shows the y-standardized results so that relationships with the group variable can be interpreted as Cohen’s d effect sizes. The results show a notable difference (d = -59) in attitudes between the two samples with less pro-White attitudes for educators. In addition, educators have a small bias to favor Black applicants in their acceptance decisions (d = .19).

The model also shows that racial attitudes influence acceptance decisions with a moderate effect size, r = -.398. Finally, the model shows that the BIAT and the single-item explicit rating have modest validity as measures of racial attitudes, r = .392, .429, respectively. The results for the BIAT are consistent with other estimates that a single IAT has no more than 20% (.392^2 = 15%) valid variance. Thus, the results here are entirely consistent with the view that explicit and implicit measures tap a single attitude and that there is no need to postulate hidden, unconscious attitudes that can have an independent influence on behavior.

Based on their results, Axt et al. (2016) caution readers that the relationship between attitudes and behaviors is more complex than the common narrative of implicit bias assumes.

The authors “suggest that the prevailing emphasis on pro-White biases in judgment and behavior in the existing literature would improve by refining the theoretical understanding of under what conditions behavior favoring dominant or minority groups will occur.” (p. 33).

Implications

For two decades, the developers of the IAT have argued that the IAT measures a distinct type of attitudes that reside in individuals’ unconscious and can influence behavior in ways that bypass conscious control. As a result, even individuals who aim to be unbiased might exhibit prejudice in their behavior. Moreover, the finding that the majority of White people show a pro-White bias in their IAT scores was used to explain why discrimination and prejudice persist. This narrative is at the core of implicit bias training.

The problem with this story is that it is not supported by scientific evidence. First, there is no evidence that IAT scores reflect some form of unconscious or implicit bias. Rather, IAT scores seem to tap the same cognitive and affective processes that influence explicit ratings. Second, there is no evidence that processes that influence IAT scores can bypass conscious control of behavior. Third, there is no evidence that a pro-White bias in attitudes automatically produces a pro-White bias in actual behaviors. Not even Freud assumed that unconscious processes would have this effect on behavior. In fact, he postulated that various defense mechanisms may prevent individuals from acting on their undesirable impulses. Thus, the prediction that attitudes are sufficient to predict behavior is too simplistic.

Axt et al. (2016= speculate that “bias correction can occur automatically and without awareness” (p. 32). While this is an intriguing hypothesis, there is little evidence for such smart automatic control processes. This model also implies that it is impossible to predict actual behaviors from attitudes because correction processes can alter the influence of attitudes on behavior. This implies that only studies of actual behavior can reveal the ability of IAT scores to predict actual behavior. For example, only studies of actual behavior can demonstrate whether police officers with pro-White IAT scores show racial bias in the use of force. The problem is that 20 years of IAT research have uncovered no robust evidence that IAT scores actually predict important real-world behaviors (Schimmack, 2019b).

In conclusion, the results of Axt’s studies suggest that the use of the IAT in implicit bias training needs to be reconsidered. Not only are test scores highly variable and often provide false information about individuals’ attitudes; they also do not predict actual behavior of discrimination. It is wrong to assume that individuals who show a pro-White bias on the IAT are bound to act on these attitudes and discriminate against Black people or other minorities. Therefore, the focus on attitudes in implicit bias training may be misguided. It may be more productive to focus on factors that do influence actual behaviors and to provide individuals with clear guidelines that help them to act in accordance with these norms. The belief that this is not sufficient is based on an unsupported model of unconscious forces that can bypass awareness.

This conclusion is not totally new. In 2008, Blanton criticized the use of the IAT in applied settings (IAT: Fad or fabulous?)

“There’s not a single study showing that above and below that cutoff people differ in any way based on that score,” says Blanton.

And Brian Nosek agreed.

Guilty as charged, says the University of Virginia’s Brian Nosek, PhD, an IAT developer.

However, this admission of guilt has not changed behavior. Nosek and other IAT proponents continue to support Project Implicit that provided millions of visitors with false information about their attitudes or mental health issues based on a test with poor psychometric properties. A true admission of guilt would be to stop this unscientific and unethical practice.

References

Axt, J.R. (2017). An unintentional pro-Black bias in judgement among educators. British Journal of Educational Psychology, 87, 408-421.

Axt, J.R., Ebersole, C.R. & Nosek, B.A. (2016). An unintentional, robust, and replicable pro-Black bias in social judgment. Social Cognition34, 1-39.

Greenwald, A. G., McGhee, D. E., & Schwartz, J. L. K. (1998). Measuring individual differences in implicit cognition: The Implicit Association Test. Journal of Personality and Social Psychology, 74, 1464–1480.

Schimmack, U. (2019). The Implicit Association Test: A Method in Search of a construct. Perspectives on Psychological Sciencehttps://doi.org/10.1177/1745691619863798

Schimmack, U. (2019). The race IAT: A Case Study of The Validity Crisis in Psychology.
https://replicationindex.com/2019/02/06/the-race-iat-a-case-study-of-the-validity-crisis-in-psychology/

Open Communication about the invalidity of the race IAT

In the old days, most scientific communication occured behind closed doors, when reviewers provide anonymous peer-reviews that determine the fate of manuscripts. In the old days, rejected manuscripts would not be able to contribute to scientific communications because nobody would know about them.

All of this has changed with the birth of open science. Now authors can share manuscripts on pre-print servers and researchers can discuss merits of these manuscripts on social media. The benefit of this open scientific communication is that more people can join in and contribute to the communication.

Yoav Bar-Anan co-authored an article with Brian Nosek titled “Scientific Utopia: I. Opening Scientific Communication.” In this spirit of openness, I would like to have an open scientific communication with Yoav and his co-author Michelangelo Vianello about their 2018 article “A Multi-Method Multi-Trait Test of the Dual-Attitude Perspective

I have criticized their model in an in press article in Perspectives of Psychological Science (Schimmack, 2019). In a commentary, Yoav and Michelangelo argue that their model is “compatible with the logic of an MTMM investigation (Campbell & Fiske, 1959). They argue that it is important to have multiple traits to identify method variance in a matrix with multiple measures of multiple traits. They then propose that I lost the ability to identify method variance by examining one attitude (i.e., race, self-esteem, political orientation) at a time. They then point out that I did not include all measures and included the Modern Racism Scale as an indicator of political orientation to note that I did not provide a reason for these choices. While this is true, Yoav and Michelangelo had access to the data and could have tested whether these choices made any differences. They do not. This is obvious for the modern racism scale that can be eliminated from the measurement model without any changes in the overall model.

To cut to the chase, the main source of disagreement is the modelling of method variance in the multi-trait-multi-method data set. The issue is clear when we examine the original model published in Bar-Anan and Vianello (2018).

In this model, method variance in IATs and related tasks like the Brief IAT is modelled with the INDIRECT METHOD factor. The model assumes that all of the method variance that is present in implicit measures is shared across attitude domains and across all implicit measures. The only way for this model to allow for different amounts of method variance in different implicit measures is by assigning different loadings to the various methods. Moreover, the loadings provide information about the nature of the shared variance and the amount of method variance in the various methods. Although this is valuable and important information, the authors never discuss this information and its implications.

Many of these loadings are very small. For example, the loading of the race IAT and the brief race IAT are .11 and .02. In other words, the correlation between these two measures is inflated by .11 * .02 = .0022 points. This means that the correlation of r = .52 between these two measures is r = .5178 after we remove the influence of method variance.

It makes absolutely no sense to accuse me of separating the models, when there is no evidence of implicit method variance that is shared across attitudes. The remaining parameter estimates are not affected if a factor with low loadings is removed from a model.

Here I show that examining one attitude at a time produces exactly the same results as the full model. I focus on the most controversial IAT; the race IAT. After all, there is general agreement that there is little evidence of discriminant validity for political orientation (r = .91, in the Figure above), and there is little evidence for any validity in the self-esteem IAT based on several other investigations of this topic with a multi-method approach (Bosson et al., 2000; Falk et al., 2015).

Model 1 is based on Yoav and Michelangelo’s model that assumes that there is practically no method variance in IAT-variants. Thus, we can fit a simple dual-attitude model to the data. In this model, contact is regressed onto implicit and explicit attitude factors to see the unique contribution of the two factors without making causal assumptions. The model has acceptable fit, CFI = .952, RMSEA = .013.

The correlation between the two factors is .66, while it is r = .69 in the full model in Figure 1. The loading of the race IAT on the implicit factor is .66, while it is .62 in the full model in Figure 1. Thus, as expected based on the low loadings on the IMPLICIT METHOD factor, the results are no different when the model is fitted only to the measure of racial attitudes.

Model 2 makes the assumption that IAT-variants share method variance. Adding the method factor to the model increased model fit, CFI = .973, RMSEA = .010. As the models are nested, it is also possible to compare model fit with a chi-square test. With five degrees of freedom difference, chi-square changed from 167. 19 to 112.32. Thus, the model comparison favours the model with a method factor.

The main difference between the models is that there the evidence is less supportive of a dual attitude model and that the amount of valid variance in the race IAT decreases from .66^2 = 43% to r = .47^2 = 22%.

In sum, the 2018 article made strong claims about the race IAT. These claims were based on a model that implied that there is no systematic measurement error in IAT scores. I showed that this assumption is false and that a model with a method factor for IATs and IAT-variants fits the data better than a model without such a factor. It also makes no theoretical sense to postulate that there is no systematic method variance in IATs, when several previous studies have demonstrated that attitudes are only one source of variance in IAT scores (Klauer, Voss, Schmitz, & Teige-Mocigemba, 2007).

How is it possible that the race IAT and other IATs are widely used in psychological research and on public websites to provide individuals with false feedback about their hidden attitudes without any evidence of its validity as an individual difference measure of hidden attitudes that influence behaviour outside of awareness?

The answer is that most of these studies assumed that the IAT is valid rather than testing its validity. Another reason is that psychological research is focused on providing evidence that confirms theories rather than subjecting theories to empirical tests that they may fail. Finally, psychologists ignore effect sizes. As a result, the finding that IAT scores have incremental predictive validity of less than 4% variance in a criterion is celebrated as evidence for the validity of IATs, but even this small estimate is based on underpowered studies and may shrink in replication studies (cf. Kurdi et al., 2019).

It is understandable that proponents of the IAT respond with defiant defensiveness to my critique of the IAT. However, I am not the first to question the validity of the IAT, but these criticisms were ignored. At least Banaji and Greenwald recognized in 2013 that they do “not have the luxury of believing that what appears true and valid now will always appear so” (p. xv). It is time to face the facts. It may be painful to accept that the IAT is not what it was promised to be 21 years ago, but that is what the current evidence suggests. There is nothing wrong with my models and their interpretation, and it is time to tell visitors of the Project Implicit website that they should not attach any meaning to their IAT scores. A more productive way to counter my criticism of the IAT would be to conduct a proper validation study with multiple methods and validation criteria that are predicted to be uniquely related to IAT scores in a preregistered study.

References

Bosson, J. K., Swann, W. B., Jr., & Pennebaker, J. W. (2000). Stalking the perfect measure of implicit self-esteem: The blind men and the elephant revisited? Journal of Personality and Social Psychology, 79, 631–643.

Falk, C. F., Heine, S. J., Takemura, K., Zhang, C. X., & Hsu, C. (2015). Are implicit self-esteem measures valid for assessing individual and cultural differences. Journal of Personality, 83, 56–68. doi:10.1111/jopy.12082

Klauer, K. C., Voss, A., Schmitz, F., & Teige-Mocigemba, S. (2007). Process components of the Implicit Association Test: A diffusion-model analysis. Journal of Personality and Social Psychology, 93, 353–368.

Kurdi, B., Seitchik, A. E., Axt, J. R., Carroll, T. J., Karapetyan, A., Kaushik, N., . . . Banaji, M. R. (2019). Relationship between the Implicit Association Test and intergroup behavior: A meta-analysis. American Psychologist, 74, 569–586.

Brain Nosek explains the IAT

I spent 20 minutes, actually more than 20 minutes because I had to rewind to transcribe, listening to a recent podcast in which Brain Nosek was asked some questions about the IAT and implicit bias training (The Psychology Podcast, August 1, 2019).

Scott Barry Kaufman: How do you see the IAT now and how did you see it when you started work on Project Implicit? How discrepant are these stats of mind?

Brian Nosek: I hope I have learned a lot from all the research that we have done on it over the years. In the big picture I have the same view that I have had since we did the first set of studies. It is a great tool for research purposes and we have been able to learn a lot about the tool itself and about human behavior and interaction with the tool and a lot about the psychology of things that are [gap] occur with less control AND less awareness than just asking people how they feel about topics. So that has been and continues to be a very productive research area for trying to understand better how humans work.

And then the main concern that we had at onset and that is actually a lot of the discussion of even creating the website is the same anticipated some of the concerns and overuses that happened with the IAT in the present and that is the natural – I don’t know if natural is the right word – the common desire that people have for simple solutions and thinking well a measure is a direct indicator of something that we care about and it shouldn’t have any error in measurement and it should be applicable to lots and lots of situations.  And thus lots of potential of misuse of the IAT despite it being a very productive research tool and education too.  I like the experience of doing it and delivering to an audience and the discussion it provokes; what is it that it means, what does it mean about me, what does it mean about the world; those are really productive intellectual discussions and debates.  But the risk part the overapplication of the IAT for selection processes. We should use this. We should [?] use this for deciding who gets a job or not; we should [?] use this who is on a jury or not. Those are the kind of real-world applications of it as a measure that go far beyond its validity.  And so this isn‘t exact answering your question because even at the very beginning when we launched the website we said explicitly it should not be used for these purposes and I still believe this to be true. What has changed over time is the refinement of where it is we understand the evidence base against some of the major questions. And what is amazing about it is that there has been so much research and we still don’t have a great handle on really big questions relating to the IAT and measures like it.  So this is just part of [unclear]  how hard it is to actually make progress in the study of human behavior.   

Scott Barry Kaufman:  Let’s talk shop for a second [my translation; enough with the BS]. My dissertation at Yale a couple of year after years was looking at the question are there individual differences in implicit cognition.  And the idea was to ask this question because from a trait perspective I felt that was a huge gap in the literature. There was so much research on the reliability and validity of IQ tests for instance, but I wanted to ask the question if we adapt some of these implicit cognition measures from the social psychological experimental literature for an individual differences paradigm you know are they reliable and stable differences. And I have a whole appendix of failed experiments – by the way, you should tell how to publish that some day but we’ll get to that in a second, but so much of my dissertation, I am putting failed in quotes because you know I mean that was useful information … it was virtually impossible to capture reliable individual differences that cohered over time but I did find one that did and I published that as a serial reaction time task, but anyway, before we completely lose my audience which is a general audience I just want to say that I am trying to link this because for me one of the things that I am most wary about with the IAT is like – and this might be more of a feature than a bug – but it may be capturing at this given moment in time when a person is taking the test it is capturing a lot of the societal norms and influences are on that person’s associations but not capturing so much an intrinsic sort of stable individual differences variable. So I just wanted to throw that out and see what your current thoughts on that are.

Brian Nosek:   Yeah, it is clear that it is not trait like in the same way that a measure like the Big Five for personality is trait-like.  It does show stability over time, but much more weakly than that.  Across a variety of topics you might see a test-retest correlation for the IAT measuring the same construct of about .5  The curiosity for this is;  I guess it is a few curiosities. One is does that mean we have have some degree of trait variance because there is some stability over time and what is the rest? Is the rest error or is it state variance in some way, right. Some variation that is meaningful variation that is sensitive to the context of measurement. Surely it is some of both, but we don’t know how much. And there isn’t yet a real good insight on where the prediction components of the IAT are and how it anticipates behavior, right.  If we could separate in a real reliable way the trait part, the state part, and the error part, than we should be able to uniquely predict different type of things between the trait, the state, and the trait components. Another twist which is very interesting that is totally understudied in my view is the variations in which it is state or trait like seems to vary by the topic you are investigating. When you do a Democrat – Republican IAT, to what extent do people favor one over the other, the correlation with self-report is very strong and the stability over time is stronger than when you measure Black-White or some of the other types of topics. So there is also something about the attitude construct itself that you are assessing that is not as much measurement based but that is interacting with the measure that is anticipating the extent to which it is trait or state like. So these are all interesting things that if I had time to study them would be the problems I would be studying, but I had to leave that aside

Scott Barry Kaufman: You touch on a really interesting point about this. How would you measure the outcome of this two-day or week- training thing? It seems that would not be a very good thing to then go back to the IAT and see a difference between the IAT, IAT pre and IAT-post, doesn’t seem like the best outcome you know you’d want, I mean you ….

Brian Nosek I mean you could just change the IAT and that would be the end of it. But, of course, if that doesn’t actually shift behavior then what was the point?

Scott Barry Kaufman:  to what extent are we making advances in demonstrating that there are these implicit influences on explicit behavior that are outside of our value system? Where are we at right now? 

[Uli, coughs, Bargh, elderly priming]

Brian Nosek: Yeah, that is a good question. I cannot really comment on the micro-aggression literature. I don’t follow that as a distinct literature, but on the general point I think it is the big picture story is pretty clear with evidence which is we do things with automaticity, we do things that are counterproductive to our interests all the time, and sometimes we recognize we are doing it, sometimes we don’t, but a lot of time it is not controllable.  But that is a very big picture, very global, very non-specific point.

If you want to find out what 21 years of research on the IAT have shown, you can read my paper (Schimmack, in press, PoPS). In short,

  • most of the variance in the race IAT (Black-White) is random and systematic measurement error.
  • Up to a quarter of the variance reflects racial attitudes that are also reflected in self-report measures of racial attitudes; most clearly in direct ratings of feelings towards Blacks and Whites.
  • there is little evidence that any of the variance in IAT scores reflects some implicit attitudes that are outside of people’s awareness
  • there is no reliable evidence that IAT scores predict discriminatory behavior in the real world
  • visitors of Project Implicit are given invalid feedback that they may hold unconscious biases and are not properly informed about the poor psychometric properties of the test.
  • Founders of Project Implicit have not disclosed how much money they make from speaking engagements related to Project Implicit, royalties from the book “Blindspot,” and do not declare conflict of interest in IAT-related publications.
  • It is not without irony that educators on implicit bias may fail to realize that they have an implicit bias in reading the literature and to dismiss criticism.

The Implicit Association Test: A Measure in Search of a Construct (in press, PoPS)

Here is a link to the manuscript, data, and MPLUS scripts for reproducibility. https://osf.io/mu7e6/

ABSTRACT

Greenwald et al. (1998) proposed that the IAT measures individual differences in implicit social cognition.  This claim requires evidence of construct validity. I review the evidence and show that there is insufficient evidence for this claim.  Most important, I show that few studies were able to test discriminant validity of the IAT as a measure of implicit constructs. I examine discriminant validity in several multi-method studies and find no or weak evidence for discriminant validity. I also show that validity of the IAT as a measure of attitudes varies across constructs. Validity of the self-esteem IAT is low, but estimates vary across studies.  About 20% of the variance in the race IAT reflects racial preferences. The highest validity is obtained for measuring political orientation with the IAT (64% valid variance).  Most of this valid variance stems from a distinction between individuals with opposing attitudes, while reaction times contribute less than 10% of variance in the prediction of explicit attitude measures.  In all domains, explicit measures are more valid than the IAT, but the IAT can be used as a measure of sensitive attitudes to reduce measurement error by using a multi-method measurement model.

Keywords:  Personality, Individual Differences, Social Cognition, Measurement, Construct Validity, Convergent Validity, Discriminant Validity, Structural Equation Modeling

HIGHLIGHTS

Despite its popularity, relatively little is known about the construct validity of the IAT.

As Cronbach (1989) pointed out, construct validation is better examined by independent experts than by authors of a test because “colleagues are especially able to refine the interpretation, as they compensate for blind spots and capitalize on their own distinctive experience” (p. 163).

It is of utmost importance to determine how much of the variance in IAT scores is valid variance and how much of the variance is due to measurement error, especially when IAT scores are used to provide individualized feedback.

There is also no consensus in the literature whether the IAT measures something different from explicit measures.

In conclusion, while there is general consensus to make a distinction between explicit measures and implicit measures, it is not clear what the IAT measures

To complicate matters further, the validity of the IAT may vary across attitude objects. After all the IAT is a method, just like Likert scales are a method, and it is impossible to say that a method is valid (Cronbach, 1971).

At present, relatively little is known about the contribution of these three parameters to observed correlations in hundreds of mono-method studies.

A Critical Review of Greenwald et al.’s (1998) Original Article

In conclusion, the seminal IAT article introduced the IAT as a measure of implicit constructs that cannot be measured with explicit measures, but it did not really test this dual-attitude model.

Construct Validity in 2007

In conclusion, the 2007 review of construct validity revealed major psychometric challenges for the construct validity of the IAT, which explains why some researchers have concluded that the IAT cannot be used to measure individual differences (Payne et al., 2017).  It also revealed that most studies were mono-method studies that could not examine convergent and discriminant validity

Cunningham, Preacher and Banaji (2001)

Another noteworthy finding is that a single factor accounted for correlations among all measures on the same occasion and across measurement occasions. This finding shows that there were no true changes in racial attitudes over the course of this two-month study.  This finding is important because Cunningham et al.’s (2001) study is often cited as evidence that implicit attitudes are highly unstable and malleable (e.g., Payne et al., 2017). This interpretation is based on the failure to distinguish random measurement error and true change in the construct that is being measured (Anusic & Schimmack, 2016).  While Cunningham et al.’s (2001) results suggest that the IAT is a highly unreliable measure, the results also suggest that the racial attitudes that are measured with the race IAT are highly stable over periods of weeks or months. 

Bar-Anan & Vianello, 2018

this large study of construct validity also provides little evidence for the original claim that the IAT measures a new construct that cannot be measured with explicit measures, and confirms the estimate from Cunningham et al. (2001) that about 20% of the variance in IAT scores reflects variance in racial attitudes.

Greenwald et al. (2009)

“When entered after the self-report measures, the two implicit measures incrementally explained 2.1% of vote intention variance, p=.001, and when political conservativism was also included in the model, “the pair of implicit measures incrementally predicted only 0.6% of voting intention variance, p = .05.”  (Greenwald et al., 2009, p. 247).

I tried to reproduce these results with the published correlation matrix and failed to do so. I contacted Anthony Greenwald, who provided the raw data, but I was unable to recreate the sample size of N = 1,057. Instead I obtained a similar sample size of N = 1,035.  Performing the analysis on this sample also produced non-significant results (IAT: b = -.003, se = .044, t = .070, p = .944; AMP: b = -.014, se = .042, t = 0.344, p = .731).  Thus, there is no evidence for incremental predictive validity in this study.

Axt (2018)

With N = 540,723 respondents, sampling error is very small, σ = .002, and parameter estimates can be interpreted as true scores in the population of Project Implicit visitors.  A comparison of the factor loadings shows that explicit ratings are more valid than IAT scores. The factor loading of the race IAT on the attitude factor once more suggests that about 20% of the variance in IAT scores reflects racial attitudes

Falk, Heine, Zhang, and Hsu (2015)

Most important, the self-esteem IAT and the other implicit measures have low and non-significant loadings on the self-esteem factor. 

Bar-Anan & Vianello (2018)

Thus, low validity contributes considerably to low observed correlations between IAT scores and explicit self-esteem measures.

Bar-Anan & Vianello (2018) – Political Orientation

More important, the factor loading of the IAT on the implicit factor is much higher than for self-esteem or racial attitudes, suggesting over 50% of the variance in political orientation IAT scores is valid variance, π = .79, σ = .016.  The loading of the self-report on the explicit ratings was also higher, π = .90, σ = .010

Variation of Implicit – Explicit Correlations Across Domains

This suggests that the IAT is good in classifying individuals into opposing groups, but it has low validity of individual differences in the strength of attitudes.

What Do IATs Measure?

The present results suggest that measurement error alone is often sufficient to explain these low correlations.  Thus, there is little empirical support for the claim that the IAT measures implicit attitudes that are not accessible to introspection and that cannot be measured with self-report measures. 

For 21 years the lack of discriminant validity has been overlooked because psychologists often fail to take measurement error into account and do not clearly distinguish between measures and constructs.

In the future, researchers need to be more careful when they make claims about constructs based on a single measure like the IAT because measurement error can produce misleading results.

Researchers should avoid terms like implicit attitude or implicit preferences that make claims about constructs simply because attitudes were measured with an implicit measure

Recently, Greenwald and Banaji (2017) also expressed concerns about their earlier assumption that IAT scores reflect unconscious processes.  “Even though the present authors find themselves occasionally lapsing to use implicit and explicit as if they had conceptual meaning, they strongly endorse the empirical understanding of the implicit– explicit distinction” (p. 862).

How Well Does the IAT Measure What it Measures?

Studies with the IAT can be divided into applied studies (A-studies) and basic studies (B-studies).  B-studies employ the IAT to study basic psychological processes.  In contrast, A-studies use the IAT as a measure of individual differences. Whereas B-studies contribute to the understanding of the IAT, A-studies require that IAT scores have construct validity.  Thus, B-studies should provide quantitative information about the psychometric properties for researchers who are conducting A-studies. Unfortunately, 21 years of B-studies have failed to do so. For example, after an exhaustive review of the IAT literature, de Houwer et al. (2009) conclude that “IAT effects are reliable enough to be used as a measure of individual differences” (p. 363).  This conclusion is not helpful for the use of the IAT in A-studies because (a) no quantitative information about reliability is given, and (b) reliability is necessary but not sufficient for validity.  Height can be measured reliably, but it is not a valid measure of happiness. 

This article provides the first quantitative information about validity of three IATs.  The evidence suggests that the self-esteem IAT has no clear evidence of construct validity (Falk et al., 2015).  The race-IAT has about 20% valid variance and even less valid variance in studies that focus on attitudes of members from a single group.  The political orientation IAT has over 40% valid variance, but most of this variance is explained by group-differences and overlaps with explicit measures of political orientation.  Although validity of the IAT needs to be examined on a case by case basis, the results suggest that the IAT has limited utility as a measurement method in A-studies.  It is either invalid or the construct can be measured more easily with direct ratings.

Implications for the Use of IAT scores in Personality Assessment

I suggest to replace the reliability coefficient with the validity coefficient.  For example, if we assume that 20% of the variance in scores on the race IAT is valid variance, the 95%CI for IAT scores from Project Implicit (Axt, 2018), using the D-scoring method, with a mean of .30 and a standard deviation of.46 ranges from -.51 to 1.11. Thus, participants who score at the mean level could have an extreme pro-White bias (Cohen’s d = 1.11/.46 = 2.41), but also an extreme pro-Black Bias (Cohen’s d = -.51/.46 = -1.10).  Thus, it seems problematic to provide individuals with feedback that their IAT score may reveal something about their attitudes that is more valid than their beliefs. 

Conclusion

Social psychologists have always distrusted self-report, especially for the measurement of sensitive topics like prejudice.  Many attempts were made to measure attitudes and other constructs with indirect methods.  The IAT was a major breakthrough because it has relatively high reliability compared to other methods.  Thus, creating the IAT was a major achievement that should not be underestimated because the IAT lacks construct validity as a measure of implicit constructs. Even creating an indirect measure of attitudes is a formidable feat. However, in the early 1990s, social psychologists were enthralled by work in cognitive psychology that demonstrated unconscious or uncontrollable processes (Greenwald & Banaji, 1995). Implicit measures were based on this work and it seemed reasonable to assume that they might provide a window into the unconscious (Banaji & Greenwald, 2013). However, the processes that are involved in the measurement of attitudes with implicit measures are not the personality characteristics that are being measured.  There is nothing implicit about being a Republican or Democrat, gay or straight, or having low self-esteem.  Conflating implicit processes in the measurement of attitudes with implicit personality constructs has created a lot of confusion. It is time to end this confusion. The IAT is an implicit measure of attitudes with varying validity.  It is not a window into people’s unconscious feelings, cognitions, or attitudes.

The race IAT: A Case Study of The Validity Crisis in Psychology:

Good science requires valid measures. This statement is hardly controversial. Not surprisingly, all authors of some psychological measure claim that their measure is valid. However, validation research is expensive and difficult to publish in prestigious journals. As a result, psychological science has a validity crisis. Many measures are used in hundreds of articles without clear definitions of constructs and without quantitative information about their validity (Schimmack, 2010).

The Implicit Association Test (AT) is no exception. The IAT was introduced in 1998 with strong and highly replicable evidence that average attitudes towards objects pairs (e.g., flowers vs. spiders) can be measured with reaction times in a classification task (Greenwald et al., 1998). Although the title of the article promised a measure of individual differences, the main evidence in the article were mean differences between groups. Thus, the original article provided little evidence that the IAT is a valid measure of individual differences.

The use of the IAT as a measure of individual differences in attitudes requires scientific evidence that tests scores are linked to variation in attitudes. Key evidence for the validity of a test are reliability, convergent validity, discriminant validity, and incremental predictive validity (Campbell & Fiske, 1959).

The validity of the IAT as a measure of attitudes has to be examined on a case by case basis because the link between associations and attitudes can vary depending on the attitude object. For attitude objects like pop drinks, Coke vs. Pepsi, associations may be strongly related to attitudes. In fact, the IAT has good predictive validity for choices between two pop drinks (Hofmann, Gawronski, Gschwendner, & Schmitt, 2005). However, it lacks convergent validity when it is used to measure self-esteem (Bosson & Swan, & Pennebaker, 2000).

The IAT is best known as a measure of prejudice, racial bias, or attitudes of White Americans towards African Americans. On the one hand, the inventor of the IAT, Greenwald, argues that the race IAT has predictive validity (Greenwald et al., 2009). Others take issue with the evidence: “Implicit Association Test scores did not permit prediction of individual-level behaviors” (Blanton et al., 2009, p. 567); “the IAT provides little insight into who will discriminate against whom, and provides no more insight than explicit measures of bias” (Oswald et al., 2013).

Nine years later, Greenwald and colleagues present a new meta-analysis of predictive validity of the IAT (Kurdi et al., 2018) based on 217 research reports and a total sample size of N = 36,071 participants. The results of this meta-analysis are reported in the abstract.

We found significant implicit– criterion correlations (ICCs) and explicit– criterion correlations (ECCs), with unique contributions of implicit (beta = .14) and explicit measures (beta = .11) revealed by structural equation modeling.

The problem with meta-analyses is that they aggregate information with diverse methods, measures, and criterion variables, and the meta-analysis showed high variability in predictive validity. Thus, the headline finding does not provide information about the predictive validity of the race IAT. As noted by the authors, “Statistically, the high degree of heterogeneity suggests that any single point estimate of the implicit– criterion relationship would be misleading” (p. 7).

Another problem of meta-analysis is that it is difficult to find reliable moderator variables if original studies have small samples and large sampling error. As a result, a non-significant moderator effect cannot be interpreted as evidence that results are homogeneous. Thus, a better way to examine the predictive validity of the race IAT is to limit the meta-analysis to studies that used the race IAT.

Another problem of small studies is that they introduce a lot of noise because point estimates are biased by sampling error. Stanley, Jarrell, and Doucouliagos (2010) made the ingenious suggestion to limit meta-analysis to the top 10% of studies with the largest sample sizes. As these studies have small sampling error to begin with, aggregating them will produce estimates with even smaller sampling error and inclusion of many small studies with high heterogeneity is not necessary. A smaller number of studies also makes it easier to evaluate the quality of studies and to examine sources of heterogeneity across studies. I used this approach to examine the predictive validity of the race IAT using the studies included in Kurdi et al.’s (2018) meta-analysis (data).

Description of the Data

The datafile contained the variable groupStemCat2 that coded the groups compared in the IAT. Only studies classified as groupStemCat2 == “African American and Africans” were selected, leaving 1328 entries (rows). Next, I selected only studies with an IAT-criterion correlation, leaving 1004 entries. Next, I selected only entries with a minimum sample size of N = 100, leaving 235 entries (more than 10%).

The 235 entries were based on 21 studies, indicating that the meta-analysis coded, on average, more than 10 different effects for each study.

The median IAT-criterion correlation across all 235 studies was r = .070. In comparison, the median r for the 769 studies with N < 100 was r = .044. Thus, selecting for studies with large N did not reduce the effect size estimate.

When I first computed the median for each study and then the median across studies, I obtained a similar median correlation of r = .065. There was no significant correlation between sample size and median ICC-criterion correlation across the 21 studies, r = .12. Thus, there is no evidence of publication bias.

I now review the 21 studies in decreasing order of the median IAT-criterion correlation. I evaluate the quality of the studies with 1 to 5 stars ranging from lowest to highest quality. As some studies were not intended to be validation studies, this evaluation does not reflect the quality of a study per se. The evaluation is based on the ability of a study to validate the IAT as a measure of racial bias.

1. * Ma et al. (Study 2), N = 303, r = .34

Ma et al. (2012) used several IATs to predict voting intentions in the 2012 US presidential election. Importantly, Study 2 did not include the race IAT that was used in Study 1 (#15, median r = .03). Instead, the race IAT was modified to include pictures of the two candidates Obama and Romney. Although it is interesting that an IAT that requires race classifications of candidates predicted voting intentions, this study cannot be used to claim that the race IAT as a measure of racial bias has predictive validity because the IAT measures specific attitudes towards candidates rather than attitudes towards African Americans in general.

2. *** Knowles et al., N = 285, r = .26

This study used the race IAT to predict voting intentions and endorsement of Obama’s health care reforms. The main finding was that the race IAT was a significant predictor of voting intentions (Odds Ratio = .61; r = .20) and that this relationship remained significant after including the Modern Racism scale as predictor (Odds Ratio = .67, effect size r = .15). The correlation is similar to the result obtained in the next study with a larger sample.

3. ***** Greenwald et al. (2009), N = 1,057, r = .17

The most conclusive results come from Greenwald et al.’s (2009) study with the largest sample size of all studies. In a sample of N = 1,057 participants, the race IAT predicted voting intentions in the 2008 US election (Obama vs. McCain), r = .17. However, in a model that included political orientation as predictor of voting intentions, only explicit attitude measures added incremental predictive validity, b = .10, SE = .03, t = 3.98, but the IAT did not, b = .00, SE = .02, t = 0.18.

4. * Cooper et al., N = 178, r = .12

The sample size in the meta-analysis does not match the sample size of the original study. Although 269 patients were involved, the race IAT was administered to 40 primary care clinicians. Thus, predictive validity can only be assessed on a small sample of N = 40 physicians who provided independent IAT scores. Table 3 lists seven dependent variables and shows two significant results (p = .02, p = .02) for Black patients.

5. * Biernat et al. (Study 1), N = 136, r = .10

Study 1 included the race IAT and donations to a Black vs. other student organizations as the criterion variable. The negative relationship was not significant (effect size r = .05). The meta-analysis also included the shifting standard variable (effect size r = .14). Shifting standards refers to the extent to which participants shifted standards in their judgments of Black versus White targets’ academic ability. The main point of the article was that shifting standards rather than implicit attitude measures predict racial bias in actual behavior. “In three studies, the tendency to shift standards was uncorrelated with other measures of prejudice but predicted reduced allocation of funds to a Black student organization.” Thus, it seems debatable to use shifting standards as a validation criterion for the race IAT because the key criterion variable were the donations, while shifting standards were a competing indirect measure of prejudice.

6. ** Zhang et al. (Study 2), N = 196, r = .10

This study examined thought listings after participants watched a crime committed by a Black offender on Law and Order. “Across two programs, no statistically significant relations between the nature of the thoughts and the scores on IAT were found, F(2, 85) = 2.4, p < .11 for program 1, and F(2, 84) = 1.98, p < .53 for program 2.” The main limitation of this study is that thought listings are not a real social behavior. As the effect size for this study is close to the median, excluding it has no notable effect on the final result.

7. * Ashburn et al., N = 300, r = .09

The title of this article is “Race and the psychological health of African Americans.” The sample consists of 300 African American participants. Although it is interesting to examine racial attitudes of African Americans, this study does not address the question whether the race IAT is a valid measure of prejudice against African Americans.

8. *** Eno et al. (Study 1), N = 105, r = .09

This article examines responses to a movie set during the Civil Rights Era; “Remember the Titans.” After watching the movie, participants made several ratings about interpretations of events. Only one event, attributing Emma’s actions to an accident, showed a significant correlation with the IAT, r = .20, but attributions to racism also showed a correlation in the same direction, r = .10. For the other events, attributions had the same non-significant effect size, Girls interests r = .12, Girls race, r = .07; Brick racism, r = -.10, Brick Black coach’s actions, r = -.10.

9. *** Aberson & Haag, N = 153, r = .07

Abserson and Haag administered the race IAT to 153 participants and asked questions about quantity and quality of contact with African Americans. They found non-significant correlations with quantity, r = -.12 and quality, r = -.10, and a significant positive correlation with the interaction, r = .17. The positive interaction effect suggests that individuals with low contact, which implies low quality contact as well, are not different from individuals with frequent high quality contact.

10. *Hagiwara et al., N = 106, r = .07

This study is another study of Black patients and non-Black physician. The main limitation is that there were only 14 physicians and only 2 were White.

11. **** Bar-Anan & Nosek, N = 397, r = .06

This study used contact as a validation criterion. The race IAT showed a correlation of r = -.14 with group contact. , N in the range from 492-647. The Brief IAT showed practically the same relationship, r = -.13. The appendix reports that contact was more strongly correlated with the explicit measures; thermometer r = .27, preference r = .31. Using structural equation modeling, as recommended by Greenwald and colleagues, I found no evidence that the IAT has unique predictive validity in the prediction of contact when explicit measures were included as predictors, b = .03, SE = .07, t = 0.37.

12. *** Aberson & Gaffney, N = 386, median r = .05

This study related the race IAT to measures of positive and negative contact, r = .10, r = -.01, respectively. Correlations with an explicit measure were considerably stronger, r = .38, r = -.35, respectively. These results mirror the results presented above.

13. * Orey et al., N = 386, median r = .04

This study examined racial attitudes among Black respondents. Although this is an interesting question, the data cannot be used to examine the predictive validity of the race IAT as a measure of prejudice.

14. * Krieger et al., N = 708, median r = .04

This study used the race IAT with 442 Black participants and criterion measures of perceived discrimination and health. Although this is a worthwhile research topic, the results cannot be used to evaluate the validity of the race IAT as a measure of prejudice.

15. *** Ma et al. (Study 1), N = 335, median r = .03

This study used the race IAT to predict voter intentions in the 2012 presidential election. The study found no significant relationship. “However, neither category-level measures were related to intention to vote for Obama (rs ≤ .06, ps ≥ .26)” (p. 31). The meta-analysis recorded a correlation of r = .045, based on email correspondence with the authors. It is not clear why the race IAT would not predict voting intentions in 2012, when it did predict voting intentions in 2008. One possibility is that Obama was now seen as a an individual rather than as a member of a particular group so that general attitudes towards African Americans no longer influenced voting intentions. No matter what the reason is, this study does not provide evidence for the predictive validity of the race IAT.

16. **** Oliver et al., N = 105, median r = .02

This study was on online study of 543 family and internal medicine physicians. They completed the race IAT and gave treatment recommendations for a hypothetical case. Race of the patient was experimentally manipulated. The abstract states that “physicians possessed explicit and implicit racial biases, but those biases did not predict
treatment recommendations” (p. 177). The sample size in the meta-analysis is smaller because the total sample was broken down into smaller subgroups.

17. * Nosek & Hansen, N = 207, median r = .01

This study did not include a clear validation criterion. The aim was to examine the relationship between the race IAT and cultural knowledge about stereoetypes. “In seven studies (158 samples, N = 107,709), the IAT was reliably and variably related to explicit attitudes, and explicit attitudes accounted for the relationship between the IAT and cultural knowledge.” The cultural knowledge measures were used as criterion variables. A positive relation, r = .10, was obtained for the item “If given the choice, who would most employers choose to hire, a Black American or a White American? (1 definitely White to 7 definitely Black).” A negative relation, r = -.09, was obtained for the item “Who is more likely to be a target of discrimination, a Black American or a White American? (1 definitely White to 7 definitely Black).”

18. *Plant et al., N = 229, median r = .00

This article examined voting intentions in a sample of 229 students. The results are not reported in the article. The meta-analysis reported a positive r = .04 and a negative r = -.04 for two separate entries with different explicit measures, which must be a coding mistake. As voting behavior has been examined in larger and more representative samples (#3, #15), these results can be ignored.

19. *Krieger et al. (2011), N = 503, r = .00

This study recruited 504 African Americans and 501 White Americans. All participants completed the race IAT. However, the study did not include clear validation criteria. The meta-analysis used self-reported experiences of discrimination as validation criterion. However, the important question is whether the race IAT predicts behaviors of people who discriminate, not the experience of victims of discrimination.

20. *Fiedorowicz, N = 257, r = -.01

This study is a dissertation and the validation criterion was religious fundamentalism.

21. *Heider & Skowronski, N = 140, r = -.02

This study separated the measurement of prejudice with the race IAT and the measurement of the criterion variables by several weeks. The criterion was cooperative behavior in a prisoner dilemma game. The results showed that “both the IAT (b = -.21, t = -2.51, p = .013) and the Pro-Black subscore (b = .17, t = 2.10, p = .037) were significant predictors of more cooperation with the Black confederate. However, these results were false and have been corrected (see Carlsson et al., 2018, for a detailed discussion).

Heider, J. D., & Skowronski, J.J. (2011). Addendum to Heider and Skowronski (2007): Improving the predictive validity of the Implicit Association Test. North American Journal of Psychology, 13, 17-20

Discussion

In summary, a detailed examination of the race IAT studies included in the meta-analysis shows considerable heterogeneity in the quality of the studies and their ability to examine the predictive validity of the race IAT. The best study is Greenwald et al.’s (2009) study with a large sample and voting in the Obama vs. McCain race as the criterion variable. However, another voting study failed to replicate these findings in 2012. The second best study was BarAnan and Nosek’s study with intergroup contact as a validation criterion, but it failed to show incremental predictive validity of the IAT.

Studies with physicians show no clear evidence of racial bias. This could be due to the professionalism of physicians and the results should not be generalized to the general population. The remaining studies were considered unsuitable to examine predictive validity. For example, some studies with African American participants did not use the IAT to measure prejudice.

Based on this limited evidence it is impossible to draw strong conclusions about the predictive validity of the race IAT. My assessment of the evidence is rather consistent with the authors of the meta-analysis, who found that “out of the 2,240 ICCs included in this metaanalysis, there were only 24 effect sizes from 13 studies that (a) had the relationship between implicit cognition and behavior as their primary focus” (p. 13).

This confirms my observation in the introduction that psychological science has a validation crisis because researchers rarely conduct validation studies. In fact, despite all the concerns about replicability, the lack of replication studies are much more numerous than validation studies. The consequences of the validation crisis is that psychologists routinely make theoretical claims based on measures with unknown validity. As shown here, this is also true for the IAT. At present, it is impossible to make evidence-based claims about the validity of the IAT because it is unknown what the IAT measures and how well it measures what it measures.

Theoretical Confusion about Implicit Measures

The lack of theoretical understanding of the IAT is evident in Greenwald and Banaji’s (2017) recent article, where they suggest that “implicit cognition influences explicit cognition that, in turn, drives behavior” (Kurdi et al., p. 13). This model would imply that implicit measures like the IAT do not have a direct link to behavior because conscious processes ultimately determine actions. This speculative model is illustrated with Bar-Anan and Nosek’s (#11) data that showed no incremental predictive validity on contact. The model can be transformed into a causal chain by changing the bidiretional path into an assumed causal relationship between implicit and explicit attitudes.

However, it is also possible to change the model into a single factor model, that considers unique variance in implicit and explicit measures as mere method variance.

Thus, any claims about implicit bias and explicit bias is premature because the existing data are consistent with various theoretical models. To make scientific claims about implicit forms of racial bias, it would be necessary to obtain data that can distinguish empirically between single construct and dual-construct models.

Conclusion

The race IAT is 20 years old. It has been used in hundreds of articles to make empirical claims about prejudice. The confusion between measures and constructs has created a public discourse about implicit racial bias that may occur outside of awareness. However, this discourse is removed from the empirical facts. The most important finding of the recent meta-analysis is that a careful search of the literature uncovered only a handful of serious validation studies and that the results of these studies are suggestive at best. Even if future studies would provide more conclusive evidence of incremental predictive validity, this finding would be insufficient to claim that the IAT is a valid measure of implicit bias. The IAT could have incremental predictive validity even if it were just a complementary measure of consciously accessible prejudice that does not share method variance with explicit measures. A multi-method approach is needed to examine the construct validity of the IAT as a measure of implicit race bias. Such evidence simply does not exist. Greenwald and colleagues had 20 years and ample funding to conduct such validation studies, but they failed to do so. In contrast, their articles consistently confuse measures and constructs and give the impression that the IAT measures unconscious processes that are hidden from introspection (“conscious experience provides only a small window into how the mind works”, “click here to discover your hidden thoughts”).

Greenwald and Banaji are well aware that their claims matter. “Research on implicit social cognition has witnessed higher levels of attention both from the general public and from governmental and commercial entities, making regular reporting of what is known an added responsibility” (Kurdi et al., 2018, p. 3). I concur. However, I do not believe that their meta-analysis fulfills this promise. An unbiased assessment of the evidence shows no compelling evidence that the race IAT is a valid measure of implicit racial bias; and without a valid measure of implicit racial bias it is impossible to make scientific statements about implicit racial bias. I think the general public deserves to know this. Unfortunately, there is no need for scientific evidence that prejudice and discrimination still exists. Ideally, psychologists will spend more effort in developing valid measures of racism that can provide trustworthy information about variation across individuals, geographic regions, groups, and time. Many people believe that psychologists are already doing it, but this review of the literature shows that this is not the case. It is high time to actually do what the general public expects from us.

No Incremental Predictive Validity of Implicit Attitude Measures

The general public has accepted the idea of implicit bias; that is, individuals may be prejudice without awareness. For example, in 2018 Starbucks closed their stores for one day to train employees to detect and avoid implicit bias (cf. Schimmack, 2018).

However, among psychological scientists the concept of implicit bias is controversial (Blanton et al., 2009; Schimmack, 2019). The notion of implicit bias is only a scientific construct if it can be observed with scientific methods, and this requires valid measures of implicit bias.

Valid measures of implicit bias require evidence of reliability, convergent validity, discriminant validity, and incremental predictive validity. Proponents of implicit bias claim that measures of implicit bias have demonstrated these properties. Critics are not convinced.

For example, Cunningham, Preacher, and Banaji (2001) conducted a multi-method study and claimed that their results showed convergent validity among implicit measures and that implicit measures correlated more strongly with each other than with explicit measures. However, Schimmack (2019) demonstrated that a model with a single factor fit the data better and that the explicit measures loaded higher on this factor than the evaluative priming measure. This finding challenges the claim that implicit measures possess discriminant validity. That is, the are implicit measures of racial bias, but they are not measures of implicit racial bias.

A forthcoming meta-analysis claims that implicit measures have unique predictive validity (Kurdi et al., 2018). The average effect size for the correlation between an implicit measure and a criterion was r = .14. However, this estimate is based on studies across many different attitude objects and includes implicit measures of stereotypes and identity. Not surprisingly, the predictive validity was heterogeneous. Thus, the average does not provide information about the predictive validity of the race IAT as a measure of implicit bias. The most important observation was that sample sizes of many studies were too small to investigate predictive validity given the small expected effect size. Most studies had sample sizes with fewer than 100 participants (see Figure 1).

A notable exception is a study of voting intentions in the historic 2008 presidential election, where US voters had a choice to elect the first Black president, Obama, or the Republican candidate McCain. A major question at that time was how much race and prejudice would influence the vote. Greenwald, Tucker Smith, Sriram, Bar-Anan, and Nosek (2009) conducted a study to address this question. They obtained data from N = 1,057 participants who completed online implicit measures and responded to survey questions. The key outcome variable was a simple dichotomous question about voting intentions. The sample was not a national representative sample as indicated by 84.2% declared votes for Obama versus 15.8% declared votes for McCain. The predictor variables were two self-report measures of prejudice (feeling-thermometer, Likert scale), two implicit measures (Brief IAT, AMP), the Symbolic Racism Scale, and a measure of political orientation (Conservative vs. Liberal).

The correlation among all measures were reported in Table 1.

The results for the Brief IAT (BIAT) are highlighted. First, the BIAT does predict voting intentions (r = .17). Second, the BIAT shows convergent validity with the second implicit measure; the Affective Missattribution Paradigm (AMP). Third, the IAT also correlates with the explicit measures of racial bias. Most important, the correlations with the implicit AMP are weaker than the correlations with the explicit measures. This finding confirms Schimmack’s (2019) finding that implicit measures lack discriminant validity.

The correlation table does not address the question whether implicit measures have incremental predictive validity. To examine this question, I fit a structural equation model to the reproduced covariance matrix based on the reported correlations and standard deviations using MPLUS8.2. The model shown in Figure 1 had good overall fit, chi2(9, N = 1057) = 15.40, CFI = .997, RMSEA = .026, 90%CI = .000 to .047.

The model shows that explicit and implicit measures of racial bias load on a common factor (att). Whereas the explicit measures share method variance, the residuals of the two implicit measures are not correlated. This confirms the lack of discriminant validity. That is, there is no unique variance shared only by implicit measures. The strongest predictor of voting intentions is political orientation. Symbolic racism is a mixture of conservatism and racial bias, and it has no unique relationship with voting intentions. Racial bias does make a unique contribution to voting intentions, (b = .22, SE = .05, t = 4.4). The blue path shows that the BIAT does have predictive validity above and beyond political orientation, but the effect is indirect. That is, the IAT is a measure of racial bias and racial bias contributes to voter intentions. The red path shows that the BIAT has no unique relationship with voting intentions. The negative coefficient is not significant. Thus, there is no evidence that the unique variance in the BIAT reflects some form of implicit racial bias that influences voting intentions.

In short, these results provide no evidence for the claim that implicit measures tap implicit racial biases. In fact, there is no scientific evidence for the concept of implicit bias, which would require evidence of discriminant validity and incremental validity.

Conclusion

The use of structural equation modeling (SEM) was highly recommended by the authors of the forthcoming meta-analysis (Kurdi et al., 2018). Here I applied SEM used the best data with multiple explicit and implicit measures, an important criterion variable, and a large sample size that is sufficient to detect small relationships. Contrary to the meta-analysis, the results do not support the claim that implicit measures have incremental predictive validity. In addition, the results confirmed Schimmack’s (2019) results that implicit measures lack discriminant validity. Thus, the construct of implicit racial bias lacks empirical support. Implicit measures like the IAT are best considered as implicit measures of racial bias that is also reflected in explicit measures.

With regard to the political question whether racial bias influenced voting in the 2008 election, these results suggest that racial bias did indeed matter. Using only explicit measures would have underestimated the effect of racial bias due to the substantial method variance in these measures. Thus, the IAT can make an important contribution to the measurement of racial bias because it doesn’t share method variance with explicit measures.

In the future, users of implicit measures need to be more careful in their claims about the construct validity of implicit measures. Greenwald et al. (2009) constantly conflate implicit measures of racial bias with measures of implicit racial bias. For example, the title claims “Implicit Race Attitudes Predicted Vote” , the term “Implicit race attitude measure” is ambiguous because it could mean implicit measure or implicit attitude, whereas the term “implicit measures of race attitudes” implies that the measures are implicit but the construct is racial bias; otherwise it would be “implicit measures of implicit racial bias.” The confusion arises from a long tradition in psychology to conflate measures and constructs (e.g., intelligence is whatever an IQ test measures) (Campbell & Fiske, 1959). Structural equation modeling makes it clear that measures (boxes) and constructs (circles) are distinct and that measurement theory is needed to relate measures to constructs. At present, there is clear evidence that implicit measures can measure racial bias, but there is no evidence that attitudes have an explicit and an implicit component. Thus, scientific claims about racial bias do not support the idea that racial bias is implicit. This idea is based on the confusion of measures and constructs in the social cognition literature.

Reexamining Cunningham, Preacher, and Banaji’s Multi-Method Model of Racism Measures

Article:
William A. Cunningham, Kristopher J. Preacher, and Mahzarin R. Banaji. (2001).
Implicit Attitude Measures: Consistency, Stability, and Convergent Validity, Psychological Science, 12(2), 163-170.

Abstract:
In recent years, several techniques have been developed to measure implicit social cognition. Despite their increased use, little attention has been devoted to their reliability and validity. This article undertakes a direct assessment of the interitem consistency, stability, and convergent validity of some implicit attitude measures. Attitudes toward blacks and whites were measured on four separate occasions, each 2 weeks apart, using three relatively implicit measures (response window evaluative priming, the Implicit Association Test, and the response-window Implicit Association Test) and one explicit measure (Modern Racism Scale). After correcting for interitem inconsistency with latent variable analyses, we found that (a) stability indices improved and (b) implicit measures were substantially correlated with each other, forming a single latent factor. The psychometric properties of response-latency implicit measures have greater integrity than recently suggested.

Critique of Original Article

This article has been cited 362 times (Web of Science, January 2017).  It still is one of the most rigorous evaluations of the psychometric properties of the race Implicit Association Test (IAT).  As noted in the abstract, the strength of the study is the use of several implicit measures and the repeated measurement of attitudes on four separate occasions.  This design makes it possible to separate several variance components in the race IAT.  First, it is possible to examine how much variance is explained by causal factors that are stable over time and shared by implicit and explicit attitude measures.  Second, it is possible to measure the amount of variance that is unique to the IAT.  As this component is not shared with other implicit measures, this variance can be attributed to systematic measurement error that is stable over time.  A third variance component is variance that is shared only with other implicit measures and that is stable over time. This variance component could reflect stable implicit racial attitudes.  Finally, it is possible to identify occasion specific variance in attitudes.  This component would reveal systematic changes in implicit attitudes.

The original article presents a structural equation model that makes it possible to identify some of these variance components.  However, the model is not ideal for this purpose and the authors do not test some of these variance components.  For example, the model does not include any occasion specific variation in attitudes.  This could be because attitudes do not vary over the one-month interval of the study, or it could mean that the model failed to specify this variance component.

This reanalysis also challenges the claim by the original authors that they provided evidence for a dissociation of implicit and explicit attitudes.  “We found a dissociation between implicit and explicit measures of race attitude: Participants simultaneously self-reported nonprejudiced explicit attitudes toward black Americans while showing an implicit difficulty in associating black with positive attributes” (p. 169). The main problem is that the design does not allow to make this claim because the study included only a single explicit racism measure.  Consequently, it is impossible to determine whether unique variance in the explicit measure reflects systematic measurement in explicit attitude measures (social desirable responding, acquiescence response styles) or whether this variance reflects consciously accessible attitudes that are distinct from implicit attitudes.  In this regard, the authors claim that “a single-factor solution does not fit the data” (p. 170) is inconsistent with their own structural equation model that shows a single second-order factor that explains the covariance among the three implicit measures and the explicit measure.

The authors caution that a single IAT measure is not very reliable, but their statement about reliability is vague. “Our analyses of implicit attitude measures suggest that the degree of measurement error in response-latency measures can be substantial; estimates of Cronbach’s alpha indicated that, on average, more than 30% of the variance associated with the measurements was random error.” (p. 160).  More than 30% random measurement error leaves a rather large range of reliability estimates ranging from 0% to 70%.   The respective parameter estimates for the IAT in Figure 4 are .53^2 = .28, .65^2 = .42, .74^2 = .55, and .38^2 = .14.  These reliability estimates vary considerably due to the small sample size, but the loading of the first IAT would suggest that only 19% of the variance in a single IAT is reliable. As reliablity is the upper limit for validity, it would imply that no more than 20% of the variance in a single IAT captures variation in implicit racial attitudes.

The authors caution readers about the use of a single IAT to measure implicit attitudes. “When using latency-based measures as indices of individual differences, it may be essential to employ analytic techniques, such as covariance structure modeling, that can separate measurement error from a measure of individual differences. Without such analyses, estimates of relationships involving implicit measures may produce misleading null results” (p. 169).  However, the authors fail to mention that the low reliability of a single IAT also has important implications for the use of the IAT for the assessment of implicit prejudice.  Given this low estimate of validity, users of the Harvard website that provides information about individual’s performance on the IAT should be warned that the feedback is neither reliable nor valid by conventional standards for psychological tests.

Reanalysis of Published Correlation Matrix

The Table below reproduces the correlation matrix. The standard deviations in the last row are rescaled to avoid rounding problems. This has no effect on the results.

1
.80   1
.78 .82  1
.76 .77 .86   1
.21 .15 .15 .14   1
.13 .14 .10 .08 .31  1
.16 .26 .23 .20 .42 .50 1
.14 .17 .16 .13 .16 .33 .17 1
.20 .16 .19 .26 .33 .11 .23 .07 1
.26 .29 .18 .19 .20 .27 .36 .29 .26   1
.35 .33 .34 .25 .28 .29 .34 .33 .36 .39   1
.19 .17 .08 .07 .12 .25 .30 .14 .01 .17 .24 1
.00 .11 .07 .04 .27 .18 .19 .02 .03 .01 .02 .07 1
.16 .08 .04 .08 .26 .27 .24 .22 .14 .32 .32 .17 .13 1
.12 .01 .02 .07 .13 .19 .18 .00 .02 .00 .11 .04 .17 .30 1
.33 .18 .26 .31 .14 .24 .31 .15 .22 .20 .27 .04 .01 .48 .42 1

SD 0.84 0.82 0.88 0.86 2.2066 1.2951 1.0130 0.9076 1.2 1.0 1.1 1.0 0.7 0.8 0.8 0.9

1-4 = Modern Racism Scale (1-4); 5-8 Implicit Association Test (1-4);  9-12 = Response Window IAT (1-4);  13-16 Response Window Evaluative Priming (1-4)

newmodel

Fitting the data to the original model reproduced the original results.  I then fitted the data to a model with a single attitude factor (see Figure 1).  The model also allowed for measure-specific variances.  An initial model showed no significant measure-specific variances for the two versions of the IAT .  Hence, these method factors were not included in the final model.  To control for variance that is clearly consciously accessible, I modeled the relationship between the explicit factor and the attitude factor as a causal path from the explicit factor to the attitude factor.  This path should not be interpreted as a causal relationship in this case. Rather the path can be used to estimate how much of the variance in the attitude factor is explained by consciously accessible information that influences the explicit measure.  In this model, the residual variance is variation that is shared among implicit measures, but not with the explicit measure.

The model had good fit to the data.  I then imposed constraints on factor loadings.  The constrained model had better fit than the unconstrained model (delta AIC = 4.60, delta BIC = 43.53).  The main finding is that the standard IAT had a loading of .55 on the attitude factor.  The indirect path from the implicit attitude factor to a single IAT measure is only slightly smaller, .55*.92 = .51.  The 95%CI for this parameter ranged from .41 to .60.  The upper bound of the 95%CI would imply that at most 36% of the variance in a single IAT reflects implicit racial attitudes.  However, it is important to note that the model in Figure 1 assumes that the Modern Racism Scale is a perfectly valid measure of consciously accessible attitudes. Any systematic measurement error in the Modern Racism Scale would reduce the amount of variance in the attitude factor that reflects unconscious factors.  Again, the lack of multiple explicit measures makes it impossible to separate systematic measurement error from valid variance in explicit measures.  Thus, the amount of variance in a single IAT that reflects unconscious racial attitudes can range from 0 to 36%.

How Variable are Implicit Racial Attitudes?

The design repeated measurement of implicit attitudes on four occasions.  If recent experiences influence implicit attitudes, we would expect that implicit measures of attitudes on the same occasion are more highly correlated with each other than implicit measures taken on different occasions.  Given the low validity of implicit attitude measures, I examined this question with constrained parameters. By estimating a single parameter, the model has more power to reveal a consistent relationship between implicit measures that were obtained during the same testing session.  Neither the two IATs, nor the IAT and the evaluative priming task (EP) showed significant occasion-specific variance.  Although this finding may be due to low power to detect occasion specific variation, this finding suggests that most of the variance in an IAT is due to stable variation and random measurement error.

Conclusion

Cunningham et al. (2001) conducted a rigorous psychometric study of the Implicit Association Test.  The original article reported results that could be reproduced.  The authors correctly interpret their results as evidence that a single IAT has low reliability. However, they falsely imply that their results provide evidence that the IAT and other implicit measures are valid measures of an implicit form of racism that is not consciously accessible.  My new analysis shows that their results are consistent with this hypothesis, if one assumes that the Modern Racism Scale is a perfectly valid measure of consciously accessible racial attitudes.  Under this assumption, about 25% (95%CI 16-36) of the variance in a single IAT would reflect implicit attitudes.  However, it is rather unlikely that the Modern Racism Scale is a perfect measure of explicit racial attitudes, and the amount of variance in performance on the IAT that reflects unconscious racism is likely to be smaller. Another important finding that was implicit, but not explicitly mentioned, in the original model is that there is no evidence for situation-specific variation in implicit attitudes. At least over the one-month period of the study, racial attitudes remained stable and did not vary as a function of naturally occurring events that might influence racial attitudes (e.g., positive or negative intergroup contact).  This finding may explain why experimental manipulations of implicit attitudes also often produce very small effects (Joy Gaba & Nosek, 2010).

One surprising finding was that the IAT showed no systematic measurement error in this model. This would imply that repeated measures of the IAT could be used to measure racial attitudes with high validity.  Unfortunately, most studies with the IAT rely on a single testing situation and ignore that most of the variance in a single IAT is measurement error.  To improve research on racial attitudes and prejudice, social psychologists should use multiple explicit and implicit measures and use structural equation models to examine which variance components of a measurement model of racial attitudes predict actual behavior.

Validity of the Implicit Association Test as a Measure of Implicit Attitudes

This blog post reports the results of an analysis of correlations among 4 explicit and 3 implicit attitude measures published by Ranganath, Tucker, and Nosek (2008).

Original article:
Kate A. Ranganath, Colin Tucker Smith, & Brian A. Nosek (2008). Distinguishing automatic and controlled components of attitudes from direct and indirect measurement methods. Journal of Experimental Social Psychology 44 (2008) 386–396; doi:10.1016/j.jesp.2006.12.008

Abstract
Distinct automatic and controlled processes are presumed to influence social evaluation. Most empirical approaches examine automatic processes using indirect methods, and controlled processes using direct methods. We distinguished processes from measurement methods to test whether a process distinction is more useful than a measurement distinction for taxonomies of attitudes. Results from two studies suggest that automatic components of attitudes can be measured directly. Direct measures of automatic attitudes were reports of gut reactions (Study 1) and behavioral performance in a speeded self-report task (Study 2). Confirmatory factor analyses comparing two factor models revealed better fits when self-reports of gut reactions and speeded self-reports shared a factor with automatic measures versus sharing a factor with controlled self-report measures. Thus, distinguishing attitudes by the processes they are presumed to measure (automatic versus controlled) is more meaningful than distinguishing based on the directness of measurement.

Description of Original Study

Study 1 measured relative attitudes towards heterosexuals and homosexuals with seven measures; four explicit measures and three reaction time tasks. Specifically, the four explicit measures were

Actual = Participants were asked to report their “actual feelings” towards gay and straight people when given enough time for full consideration on a scale ranging from 1=very negative to 8 = very positive.

Gut = Participants were asked to report their “gut reaction” towards gay and straight people when given enough time for full consideration on a scale ranging from 1=very negative to 8 = very positive.

Time0 and Time5: A second explicit rating task assessed an “attitude timeline”. Participants reported their attitudes toward the two groups at multiple time points: (1) instant reaction, (2) reaction a split-second later, (3) reaction after 1 s, (4) reaction after 5 s, and (5) reaction when given enough time to think fully. Only the first (Time0) and the last (Time5) rating were included in the model.

The three reaction time measures were the Implicit Association Test (IAT), the Go-NoGo Association Test (GNAT), and a Four-Category Sorting Paired Features Task (SPF). All three measures use differences in response times to measure attitudes.

Table A1 in the Appendix reported the correlations among the seven tasks.

IAT 1
GNAT .36 1
SPF .26 .18 1
GUT .23 .33 .12 1
Actual .16 .31 .01 .65 1
Time0 .19 .31 .16 .85 .50 1
Time5 .01 .24 .01 .54 .81 .50 1

The authors tested a variety of structural equation models. The best fitting model, preferred by the authors, was a model with three correlated latent factors. “In this three-factor model, self-reported gut feelings (GutFeeling, Instant Feeling) comprised their own attitude factor distinct from a factor comprised of the indirect, automatic measures (IAT, GNAT, SPF) and from a factor comprised of the direct, controlled measures (Actual Feeling, Fully Considered Feeling). The data were an excellent fit (chi^2(12) = 10.8).

The authors then state “while self-reported gut feelings were more similar to the indirect measures than to the other self-reported attitude measures, there was some unique variance in self-reported gut feelings that was distinct from both.” (p. 391) and they go on to speculate that “one possibility is that these reports are a self-theory that has some but not complete correspondence with automatic evaluations” (p. 391). The also consider the possibility that “measures like the IAT, GNAT, and SPF partly assess automatic evaluations that are “experienced” and amenable to introspective report, and partly evaluations that are not” (p. 391). But they favor the hypothesis that “self-report of ‘gut feelings’ is a meaningful account of some components of automatic evaluation” (p. 391). The interpret these results as strong support for their “contention that a taxonomy of attitudes by measurement features is not as effective as one that distinguishes by presumed component processes” (p. 391). The conclusion reiterates this point. “The present studies suggest that attitudes have distinct but related automatic and controlled factors contributing to social evaluation and that parsing attitudes by underlying processes is superior to parsing attitude measures by measurement features” (p. 393). Surprisingly, the author do not mention the three-factor model in the Discussion and rather claim support for a two-factor model that distinguishes processes rather than measures (explicit vs. implicit). “In both studies, model comparison using confimatory factor analysis showed the data were better fit to a two-factor model distinguishing automatic and controlled components of attitudes than to a model distinguishing attitudes by whether they were measured directly or indirectly” (p. 393). The authors then suggest that some explicit measures (ratings of gut reactions) can measure automatic attitudes. “These findings suggest that direct measures can be devised to capture automatic components of attitudes despite suggestions that indirect measures are essential for such assessments” (p. 393).

New Analysis 

The main problem with this article is that the author never report parameter estimates for the model. Depending on the pattern of correlations among the three factors and factor loadings, the interpretation of the results can change. I first tried to fit the three-factor model to the covariance matrix (setting variances to 1) to the published correlation matrix. MPLUS7.1 showed some problems with negative residual variance for Actual. Also the model had one less degree of freedom than the published model. However, fixing the residual variance of actual did not solve the problem. I then proceeded to fit my own model. The model is essentially the same model as the three-factor model with the exception that I modeled the correlation among the three-latent factor with a single higher-order factor. This factor represents variation in common causes that influences all attitude measures. The problem of negative variance in the actual measure was solved by allowing for an extra correlation between the actual and gut ratings. As seen in the correlation table, these two explicit measures correlated more highly with each other (r = .65) than the corresponding T0 and T5 measures (rs = .54, .50). As in the original article, model fit was good (see Figure). Figure 1 shows for the first time the parameter estimates of the model.

attitude-multi-method

 

The loadings of the explicit measures on the primary latent factors are above .80. For single item measures, this implies that these ratings are essentially measuring the same construct with some random error. Thus, the latent factors can be interpreted as explicit ratings of affective responses immediately or after some reflection. The loadings of these two factors on the higher order factor show that reflective and immediate responses are strongly influenced by the common factor. This is not surprising. Reflection may alter the immediate response somewhat, but it is unlikely to reverse or dramatically change the response a few seconds later. Interestingly, the immediate response has a higher loading on the attitude factor, although in this small sample the differences in loadings is not significant (chi^2(1) = 0.22. The third primary factor represents the shared variance among the three reaction time measures. It also loads on the general attitude factor, but the loading is weaker than the loading for the explicit measures. The parameter estimates suggest that about 25% of the variance is explained by the common attitude (.51^2) and 75% is unique to the reaction time measures. This variance component can be interpreted as unique variance in implicit measures. The factor loadings of the three reaction time measures are also relevant. The loading of the IAT suggests that only 28% (.53^2) of the observed variance in the IAT reflects the effect of causal factors that influence reaction time measures of attitudes. As some of this variance is also shared with explicit measures, only 21% ((.86*.53)^2) of the variance in the IAT represents the variance in the implicit attitude factor This has important implications for the use of the IAT to examine potential effects of implicit attitudes on behavior. Even if implicit attitudes had a strong effect on a behavior (r = .5), the correlation between IAT scores and the behavior only would be r = .86*.53*.5 = .23. A sample size of N = 146 participants would be needed to have 80% power to provide significant evidence for such a relationship (p < .05, two-tailed). Given a more modest effect of attitudes on behavior, r = .86*.53*.30 = .14, the sample size would need to be larger (N = 398). As many studies of implicit attitudes and behavior used smaller samples, we would expect many non-significant results, unless non-significant results remain unreported and published results report inflated effect sizes. One solution to the problem of low power in studies of implicit attitudes would be the use of multiple implicit attitude measures. This study suggests that a battery of different reaction time tasks can be used to remove random and task specific measurement error. Such a multi-method approach to the measurement of implicit attitudes is highly recommended for future studies because it would also help to interpret results of studies in which implicit attitudes do not influence behavior. If a set of implicit measures show convergent validity, this finding would indicate that implicit attitudes did not influence the behavior. In contrast, a null-result with a single implicit measure may simply show that the measure failed to measure implicit attitudes.

Conclusion

This article reported some interesting data, but failed to report the actual results. This analysis of the data showed that explicit measures are highly correlated with each other and show discriminant validity from implicit, reaction time measures. The new analysis also made it possible to estimate the amount of variance in the Implicit Association Test that reflects variance that is not shared with explicit measures but shared with other implicit measures. The estimate of 20% suggests that most of the variance in the IAT is due to factors other than implicit attitudes and that the test cannot be used to diagnose individuals. Whether the 20% of variance that is uniquely shared with other implicit measures reflects unconscious attitudes or method variance that is common to reaction time tasks remains unclear. The model also implies that predictive validity of a single IAT for prejudice behaviors is expected to be small to moderate (r < .30), which means large samples are needed to study the effect of implicit attitudes on behavior.