All posts by Dr. R

About Dr. R

Since Cohen (1962) published his famous article on statistical power in psychological journals, statistical power has not increased. The R-Index makes it possible f to distinguish studies with high power (good science) and studies with low power (bad science). Protect yourself from bad science and check the R-Index before you believe statistical results.

The Validity of Well-Being Measures: A Multiple-Indicator–Multiple-Rater Model

Zou, C., Schimmack, U., & Gere J. (2013). The Validity of Well-Being Measures: A Multiple-Indicator–Multiple-Rater Model. Psychological Assessment, 25(4), 1247–1254.

ABSTRACT

In the subjective indicators tradition, well-being is defined as a match between an individual’s actual life and his or her ideal life. Common well-being indicators are life-satisfaction judgments, domain satis- faction judgments, and measures of positive and negative affect (hedonic balance). These well-being indicators are routinely used to study well-being, but a formal measurement model of well-being is lacking. This article introduces a measurement model of well-being and examines the validity of self-ratings and informant ratings of well-being. Participants were 335 families (1 student with 2 parents, N = 1,005). The main findings were that (a) self-ratings and informant ratings are equally valid, (b) global life-satisfaction judgments and averaged domain satisfaction judgments are about equally valid, and (c) about 1/3 of the variance in a single indicator is valid. The main implication is that researchers should demonstrate convergent validity across multiple indicators by multiple raters.

Keywords: life satisfaction, affect, self-reports, informant-reports, multitrait–multimethod

Well-being is an important goal for many people, thus, social scientists from a variety of disciplines study well-being. A major problem for well-being scientists is that well-being is difficult to define and measure (Diener, Lucas, Schimmack, & Helliwell, 2009). These difficulties may threaten the validity of well-being measures. The aim of the present study is to examine the validity of the most commonly used measures of well-being.

A measure is valid if it measures what it is intended to measure. This definition of validity implies that it is important to define a construct (i.e., what is being measured?) before it is possible to evaluate the validity of a measure (Schimmack, 2010). Unfortu- nately, there is no agreement about the definition of the term well-being (Diener et al., 2009). It is therefore necessary to explain how we define the term well-being before we can examine the validity of well-being measures. We agree with philosophical arguments that well-being is a subjective concept (Diener, 1984; Sumner, 1996; see Diener, Suh, Lucas, & Smith, 1999, for a detailed discussion). A key criterion of a subjective definition of well-being is that the evaluation has to take the subjective values, motives, and ideals of individuals into account; that is, is his or her life going well for him or her? Accordingly, we define well-being as a match between an individual’s actual life and his or her ideal life. This definition is consistent with the prevalent definition of well-being in the social indicators tradition (Andrews & Withey, 1976; Cantril, 1965; Diener, 1984; Veenhoven & Jonkers, 1984). This definition of well-being led to the creation of subjective well-being indicators such as life-satisfaction judgments (Diener, 1984). These measures are routinely used to make inferences about the determinants of well-being. These inferences implicitly assume that well-being measures are valid, but the literature on the validity of these measures is sparse and controversial (Schwarz & Strack, 1999; Schimmack & Oishi, 2005; Schneider & Schimmack, 2009). Since there is no gold standard to validate well-being measures, convergent validity between self-ratings and informant ratings of well-being has been used as the primary evidence for the validity of well-being measures (Diener et al., 2009). However, a major limitation of previous studies is that they did not provide quanti- tative information about the amount of valid variance in different well-being measures (cf. Schneider & Schimmack, 2009). Our study addresses this problem and provides the first quantitative estimates of the amount of valid variance in the most widely used measures of well-being.

One problem in the estimation of effect sizes is that estimates based on small samples are imprecise because sampling error is substantial. To obtain data from a large sample, we used a round- robin design. In this design, participants are both targets and informants, thus, increasing the number of targets. To ensure that informants have valid information about targets’ well-being, we used families as units of analysis. Specifically, we recruited uni- versity students and their biological parents (see Table 1).

A round-robin design creates two problems for a standard struc- tural equation model. First, observations are not independent be- cause participants are recruited as triads rather than as individuals. Second, the distinction between the three raters (student, mother, & father) does not provide information about the validity of self-ratings because self-ratings are a function of rater and target (i.e., the diagonal in Table 1).

To overcome these problems, we made use of advanced features in the structural equation modeling program Mplus 5.0 (Muthén & Muthén, 2007). First, we used the CLUSTER command to obtain adjusted standard errors and fit indices that take the interdepen- dence among family members into account. Second, we rearranged the data to create variables with self-ratings (see Table 2). This creates missing data in the diagonal of the traditional round-robin design. To analyze these data with missing values we used the MODEL = COMPLEX function of Mplus (Muthén & Muthén, 2007). Thus, our model included 16 (4 raters X 4 measures) observed variables.

A Measurement Model of Well-Being

Quantitative estimates of validity require a formal measurement model in which variation in well-being (the match between indi- viduals’ actual and ideal lives) is an unobserved cause that pro- duces variation in observed well-being measures (e.g., self-ratings of life-satisfaction; cf. Schimmack, 2010). Our measurement model of well-being (see Figure 1) is similar to Diener et al.’s (1999) theoretical model of well-being. It is also related to the causal systems model of subjective well-being (Busseri & Sadava, 2011). In this model, positive affect and negative affect are distinct affective experiences. For most people, feeling good and not feeling bad is an important part of an ideal life, and the balance of positive versus negative affect serves as an important basis for life-satisfaction judgments (Schimmack, Radhakrishnan, Oishi, Dzokoto, & Ahadi, 2002; Suh, Diener, Oishi, & Triandis, 1998). Consistent with these assumptions, positive affect and negative affect are distinct components of hedonic balance (using a forma- tive measurement model), and hedonic balance influences well- being. The formative measurement model of hedonic balance makes no assumptions about the correlation between its compo- nents. As prior research often reveals a moderate negative corre- lation between positive affect and negative affect, our model allows for the two components to correlate with each other (Diener, Smith, & Fujita, 1995; Gere & Schimmack, 2011). The well-being factor is identified by two satisfaction measures, global life-satisfaction judgments and averaged domain satisfaction judgments. Prior studies often relied exclusively on global life-satisfaction judgments (Lucas, Diener, & Suh, 1996; Walker & Schimmack, 2008). The problem with this approach is that global life-satisfaction judgments can be influenced by focusing illusions (Kahneman, Krueger, Schkade, Schwarz, & Stone, 2006; but see Schimmack & Oishi, 2005). Focusing illusions could produce systematic measurement error in global life-satisfaction judgments that could attenuate the influence of hedonic balance on well- being. To address this concern, our model included averaged domain satisfaction judgments as a second indicator of well-being. As averaged domain satisfaction judgments are not susceptible to focusing illusions, the focusing illusion hypothesis predicts that averaged domain satisfaction judgments have a higher loading on the well-being factor (i.e., are more valid) than global life- satisfaction judgments.

Figure 1 does not show how our model incorporated systematic rater biases. For each rater, we created a single bias factor. This factor represents general evaluative biases in self-ratings and rat- ings of others that influence personality and well-being ratings (Anusic, Schimmack, Pinkus & Lockwood, 2009; Kim, Schim- mack, & Oishi, 2012; Schimmack, Schupp, & Wagner, 2008).

The Present Study

Model fit was assessed using standard criteria of acceptable model fit such as a comparative fit index (CFI) < .95, root-mean- square error of approximation (RMSEA) < .06, and standardized root-mean-square residual (RMSR) < .08 (Schermelleh-Engel, Moosbrugger, & Muller, 2003). Due to the large sample size of the present data (N = 1,005), tests of model comparison using p-values will often lead to misleading results (cf. Raftery, 1995). Therefore, we used the Bayesian information criterion (BIC) for model comparisons. Models with lower BIC values are preferable because they are more parsimonious. This is especially important in new research areas because small effects are less likely to replicate. Following Raftery’s (1995) standards, a difference in BIC values greater than 10 can be interpreted as very strong evidence to support the model with the lower BIC value.

Method

Participants were 335 students at the University of Toronto and their parents (335 triads; N = 1,005). Of the 335 students, 235 were women and 100 were men, and the age ranged from 17 to 30 years (Mage = 19.56, SD = 2.23). The age of mothers ranged from 37 to 63 years (Mage = 48.25, SD = 5.08). The age of fathers ranged from 38 to 72 years (Mage = 51.67, SD = 5.67). Students were required to be living with both of their biological parents so that each member of the family had good knowledge of one another. Students from the university took part in the study for either $25 or course credit. Their parents each received $25 for participating in the study. Two hundred thirty-five students came to the laboratory with their parents to complete the study. One hundred students and their parents completed the study in their homes.

Participants who came into the laboratory filled out consent forms, and these participants were seated in separate rooms to ensure that reports were made independently. They filled out a series of questionnaires about themselves and about the other two members of their families. They were then debriefed and thanked for their participation. Students who took the questionnaires home met with a researcher who gave them detailed instructions and the questionnaire packages. Participants were asked to fill out the questionnaires in separate rooms and refrain from talking about their responses until all members of the family have completed the questionnaire. Each family member received an envelope, into which the family member placed his or her own completed ques- tionnaire, and he or she sealed the envelope and signed it across the flap. Once the questionnaire packages were completed, partici- pants returned the questionnaire packages, and they were debriefed and thanked for their participation.

Measures

Since well-being is defined as an evaluation of an individual’s actual life, the assessment of well-being has to be retrospective. For this reason, we asked participants to think about the past 6 months when answering the questions. Additionally, since global judgments of life satisfaction can be influenced by temporarily accessible information (Schimmack & Oishi, 2005; Schwarz & Strack, 1999), the global self-ratings of life satisfaction were assessed first.

Global life evaluation. For the global evaluative judgments, the first three items of the Satisfaction With Life Scale were used (SWLS; Diener, Emmons, Larsen, & Griffin, 1985). The items ask participants to evaluate their lives on a 7-point Likert scale ranging from 1 (strongly disagree) to 7 (strongly agree). The first three items (“In most ways my life is close to my ideal”; “The conditions of my life are excellent”; “I am satisfied with my life”) were chosen because they have been shown to have better psychometric prop- erties than the last two items of the scale (Oishi, 2006). Consistent with prior studies, the internal consistency of the three-item scale was good, alphas > .80 (C= .83 for students; C= .89 for mothers; C = .89 for fathers). The items for the informant reports were virtually the same, but the wording was changed to an informant report format (e.g., Kim et al., 2012). Informants were instructed to fill out the scale from the target’s perspective. For example, students serving as informants for their father would rate “In most ways my father thinks that his life is close to his ideal.” Ratings were made on 7-point Likert scales. The internal consistency of informant-ratings was similar to the internal consistency of self- ratings (ranged from C = .85 to C = .93).

Averaged domain satisfaction. Domain satisfaction was as- sessed with single-item indicators for six important life domains, using satisfaction judgments (I am satisfied with..). The life do- mains were romantic life, work/academic life, health, recreational life, housing, and friendships. Responses were made on 7-point Likert scales ranging from 1 (strongly disagree) to 7 (strongly agree). The domains were chosen based on previous studies show- ing that these domains are rated as moderately to very important (Schimmack, Diener, & Oishi, 2002). We averaged these items to obtain an alternative measure of life evaluations. The informant version of the questionnaire changed the stem from “I am . . . ” to “My son/daughter/mother/father is . . . ” and “my” to “his/her.”

Positive and negative affect. Positive and  negative  affect were assessed using the Hedonic Balance Scale (Schimmack et al., 2002). The scale has three items for positive affect (pleasant, positive, good) and three items for negative affect (unpleasant, negative, bad). The items for positive and negative affect were averaged separately to create composites for positive and negative affect, respectively. All of the self-ratings for positive affect had a reliability of over .80 (C = .82 for students; C = .85 for mothers; C = .85 for fathers). Similarly, all of the self-ratings for negative affect had a reliability of over .75 (C = .80 for students; C = .75 for mothers; C = .78 for fathers). For the informant reports, “. . . how often do you experience the following feelings?” was re- placed with “. . . how often does your mother/father/son/daughter experience the following feelings?” All of the informant reports had reliabilities of over .75 (range from C = .75 to C = .89).

Results

Multitrait–Multimethod Matrix

Table 3 shows the correlations among the 16 variables created by crossing the four indicators (life satisfaction, domain satisfac- tion, positive affect, and negative affect) with the four raters (self, student informant, mother informant, and father informant). Note that since the self cannot also serve as the informant for the self, correlations between self-reports and informant reports are based on 66% of all observations. The correlations between the self- report measures were based on 100% of the observations.

Correlations between the same construct assessed with different methods (i.e., convergent validity coefficients) are bolded. All of the convergent validity coefficients were significantly greater than zero and exceeded a minimum value of r = .25. Convergent validity correlations for affective indicators (positive affect and negative affect) were lower than correlations for the evaluative indicators (life satisfaction and domain satisfaction). These find- ings replicate the results of a meta-analysis (Schneider & Schim- mack, 2009).

Table 3 can also be used to examine whether each indicator measures well-being in a slightly different manner. Twenty-two out of 24 cross-indicator– cross-rater correlations were weaker than the convergent validity coefficients, indicating that the dif- ferent indicators have unique variance. This finding replicates Lucas et al.’s (1996). However, Table 3 also shows that all well-being measures are related to each other. This pattern of results is consistent with the assumption that all measures reflect a common construct.Table 3 also shows stronger same-rater correlations than cross- rater correlations. This pattern is consistent with our assumption that ratings by a single rater are influenced by an evaluative bias (Anusic et al., 2009; Campbell & Fiske, 1959). Most important, Table 3 provides new information about informant–informant agreement. One notable pattern in the data is that the correlations between informant ratings by mothers (mother informant) and fathers (father informant) were stronger than correlations of infor- mant ratings by parents with those by students as informants. There are two possible explanations for this pattern. First, it is possible that students’ informant reports are less valid than par- ents’ informant ratings. However, this interpretation of the data is inconsistent with the finding that self-ratings were more highly correlated with students’ informant ratings than with parents’ informant ratings. Therefore, we favor the second explanation that parents’ informant ratings share method variance. This interpreta- tion is also consistent with other multirater studies that have demonstrated shared method variance between parents’ ratings of their children’s personality (Funder, Kolar, & Blackman, 1995).

Structural Equation Modeling

We fitted the measurement model in Figure 1 to our data. In the first model, we did not constrain coefficients. This model served as the base-model for model comparisons to more parsimonious models with constrained coefficients. The first model with uncon- strained coefficients had acceptable fit to the data, x2(df = 78) = 104.41, CFI = 0.995, RMSEA = 0.018, standardized root-mean- square residual (SRMR) = 0.026; BIC = 31,102. Factor loadings of ratings by different raters of the same measure (e.g., life- satisfaction) showed very similar loadings. We therefore specified a model that constrained factor loadings and residuals for the four raters to be equal. This model implies that ratings by different raters are equally valid. The model with constrained parameters maintained good fit and had a lower (i.e., superior) BIC value, x2(df = 102) = 148.18, CFI = 0.991, RMSEA = 0.021, SRMR = 0.041; BIC = 30,993. In the next model, we constrained the loadings on the rater-specific bias factors to be equal across raters. Again, model fit remained acceptable, and BIC decreased, indicating that rater bias is similar across raters x2(df = 117) = 188.48, CFI = 0.986, RMSEA = 0.025, SRMR = 0.068; BIC = 30,936. We retained this model as the final model. The parameter estimates of the final model and their 95% confidence intervals are listed in Table 4. For ease of interpretation, the main parameter estimates are also included in Figure 1.

The main finding was that the life-satisfaction factor and the average domain satisfaction factor had very  high  loadings  on  the well-being factor. Thus, our results provide no support for the hypothesis that focusing illusions undermine the validity of global life-satisfaction judgments. We also found a very strong effect of hedonic balance on the well-being factor. Yet, all three measures of well-being had significant residual variances, indicating that the measures are not redundant. Most important, about 20% of the variance in well-being was not accounted for by hedonic balance. This suggests that affective measures and evaluative judgments can show divergent patterns of correlations with predictor variables.

The factor loadings of the observed variables on the factor representing the shared variance among raters (e.g., self-ratings of life satisfaction [LS] on LS factor) can be interpreted as validity coefficients for specific constructs (e.g., the validity of a self-rating of life-satisfaction as a measure of life-satisfaction; cf. Schimmack, 2010). The validity coefficients of the four types of indicators were very similar (see Table 4). The validity coefficients suggest that about one third (29% to 38%) of the variance in a single indicator by a single rater (e.g., self-ratings of life- satisfaction) is valid variance.

It is important to keep in mind that these estimates examine the validity of a single rater with regard to a specific measure of well-being rather than the validity of these measures as measures of well-being. To examine the validity of specific measures as measures of the well-being factor in our measurement model, we need to estimate indirect effects of the well-being factor on specific measures. For example, self-ratings of life satisfaction load at .60 on the life satisfaction factor. However, this does not mean that self-ratings of life satisfaction capture 36% (.6*.6) of valid variance of well-being, because life satisfaction is not a perfect indicator of well-being. Based on our model, the life satisfaction factor loads at .96 on the well-being factor. We also need to take this measurement error into account to examine the validity of self- ratings of life satisfaction in assessing well-being (.96*.60 = .58, valid variance = 33%).

Discussion

Our study provides the first quantitative estimates of the validity of various well-being measures using a theoretically grounded model of well-being. Our main findings were that (a) about one third of the variance in a single well-being indicator is valid variance, (b) self-ratings are neither significantly more nor less valid than ratings by a single well-acquainted informant, (c) a large portion of the valid variance in a specific type of indicator is shared across indicators, and (d) hedonic balance and evaluative judgments have some unique variance.

We found no support for the focusing illusion hypothesis. If the distinction between hedonic balance and global life-satisfaction judgments were caused by a focusing illusion, the factor loading of life satisfaction on well-being should have been lower than the factor loading of the average domain satisfaction judgment. However, the actual results showed a slightly reversed pattern. This suggests that unique variance in evaluative judgments reflects valid well-being variance because individuals do not rely exclusively on hedonic balance to evaluate their lives. This finding provides empirical support for philosophical arguments against purely hedonistic definitions of well-being (Sumner, 1996). At the same time, the overlap between evaluative judgments and hedonic balance is substantial, indicating that positive experiences make an important contribution to well-being for most individuals. Another noteworthy finding was that global life-satisfaction judgments and averaged domain satisfaction judgments were approximately equally valid. This finding contradicts previous findings that averaged domain satisfaction judgments were more valid in a study with friends as informants (Schneider & Schimmack,
2010). Future research needs to examine whether the type of informant is a moderator. For example, it is possible that global life-satisfaction judgments are more difficult to make, which gives family members an advantage over friends. Subsequently, we discuss the main implications of our findings for the use of well-being measures in the assessment of individuals’ well-being and for the use of well-being measures in policy decisions.

Validity of Well-Being Indicators

Our results suggest that about one third of the variance in a
single well-being indicator by a single rater is valid variance. This
finding has important implications for the interpretation of studies
that rely on a single well-being indicator as a measure of wellbeing.
For example, many important findings about well-being are
based on a single global life-satisfaction rating in the German
Socio-Economic Panel (e.g., Lucas & Schimmack, 2009). It is
well-known that observed effect sizes in these studies are attenuated
by random measurement error and that it would be desirable
to correct effect size estimates for unreliability (Schmidt & Hunter,
1996). However, systematic measurement error can further attenuate
observed effect sizes. Schimmack (2010) proposed that quantitative
estimates of validity could be used to disattenuate observed
effect sizes for invalidity. To illustrate the implications of correcting
for invalidity in well-being indicators, we use Kahneman et
al.’s (2006) finding that household income was a moderate predictor
of self-reported life-satisfaction (r .32). Our findings
suggest that this observed relationship underestimates the relationship
between household income and well-being. To disattenuate
the observed relationship, the observed correlation has to be divided
by the validity coefficient (i.e., .96 .60 .58). Thus, the
corrected estimate of the true effect size would increase to r .56
(.32/.58), which is considered a strong effect size (Cohen, 1992).
Researchers may be reluctant to trust adjusted effect sizes because
they rely on assumptions about validity. However, the common
practice of relying on observed relationships as estimates of
effect sizes also relies on an implicit assumption, namely, that the
observed measure is perfectly valid. In comparison to an assumption of 100% valid variance in a single global life-satisfaction judgment, our estimate of about one-third valid variance is more realistic and supported by empirical evidence. Nevertheless, our findings should only be treated as a first estimate and a benchmark for future studies. Future research needs to replicate our findings and examine moderating factors of validity in well-being measures.

Self-Reports Versus Informant Reports

Schneider and Schimmack (2009) noted that previous studies failed to compare the validity of self-ratings and informant ratings. Our results suggest that self-ratings and ratings by a single well- acquainted informant are approximately equally valid. While this is a surprising finding given the subjective nature of well-being, it is not uncommon in personality psychology to find evidence of equal or sometimes greater validity in informant ratings than self-ratings. For instance, informant reports of personality often provide better predictive validity than self-reports (e.g., Kolar, Funder, & Colvin, 1996). Since we did not have any outcome measure of well-being (e.g., suicide) in the present study, we could not test for the predictive validity of self- and informant reports. However, this is an important avenue for future research. To our knowledge, no study has compared self-ratings and informant ratings using life-events that are known to influence well-being such as marriage, divorce, or unemployment (Diener, Lucas, & Scollon, 2006).

Informant ratings also have an important advantage over self- ratings. Namely, it is possible to obtain ratings from multiple informants, but there is only one self to provide self-ratings. Aggregation of informant ratings can substantially increase the validity of informant ratings. We computed well-being indicators for single raters and multiple raters using the following weights (Well-Being = 1.5 Life Satisfaction + 1.5 Domain Satisfaction + 2 Positive Affect – 1 Negative Affect) and computed the corre- lation with the well-being factor in Figure 1. The correlations were r = .62 for self-ratings, r = .77 for an aggregate of three informant ratings, and r = .81 for an aggregate of all four ratings. Although the difference between .62 and .77 may not seem impressive, it implies that aggregation across raters can increase the amount of valid variance from one third to two thirds of the observed vari- ance. This finding suggests that clinicians can benefit considerably from obtaining well-being measures from multiple informants to assess individuals’ well-being.

Limitations

Our study has numerous limitations. The use of a convenience sample from a specific population means that the generalizability of our findings needs to be examined in samples drawn from other populations. However, our results are broadly consistent with meta-analytic findings (Schneider & Schimmack, 2009). Another limitation was that parents are not independent raters and appear to share rating biases. In the future, it would be desirable to obtain ratings from independent raters (e.g., friends & parents). Finally, our conclusions are limited by the assumptions of our model. While it is possible to fit other models to our data in Table 3 (e.g., Busseri & Sadava, 2011), the alternative models each have their own limitations. Future studies should test these alternative models to examine if they may reveal different or unique findings from the present study. We encourage readers to fit alternative models to the correlation matrix in Table 3 and examine whether these model provide better fit to our data. We consider our model merely as a plausible first attempt to create a measurement model of well- being that can underpin empirical studies of well-being.

Conclusions

Although the study of happiness has been of great interest to many researchers and the general public, the validity of well-being measures has not improved for the past 50 years (Schneider & Schimmack, 2009). In order for well-being researchers to provide accurate information about the determinants of well-being, it is crucial to use a valid method to assess well-being. If invalid measures are used, findings that rely on such measures will also lack validity. From the current study, we found that only about one third of the variance in a self-report measure of well-being is valid. In order to increase the validity of well-being measures, multiple methods of well-being should be used. When better measures are used, researchers can also be more confident that their findings can be trusted.

References

Andrews, F. M., & Withey, S. B. (1976). Social indicators of well-being: America’s perception of life quality. New York, NY: Plenum.

Anusic, I., Schimmack, U., Pinkus, R. T., & Lockwood, P. (2009). The nature and structure of correlations among Big Five ratings: The halo- alpha-beta model. Journal of Personality and Social Psychology, 97, 1142–1156. doi:10.1037/a0017159

Busseri, M. A., & Sadava, S. W. (2011). A review of the tripartite structure of subjective well-being: Implications for conceptualization, operation- alization, analysis, and synthesis. Personality and Social Psychology Review, 15, 290 –314. doi:10.1177/1088868310391271

Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait–multimethod matrix. Psychological Bulletin, 56, 81–105. doi:10.1037/h0046016

Cantril, H. (1965). The pattern of human concerns (Vol. 4). New Bruns- wick, NJ: Rutgers University Press.

Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155–159. doi:10.1037/0033-2909.112.1.155

Diener, E. (1984). Subjective well-being. Psychological Bulletin, 95, 542–

575. doi:10.1037/0033-2909.95.3.542

Diener, E., Emmons, R. A., Larsen, R. J., & Griffin, S. (1985). The Satisfaction With Life Scale. Journal of Personality Assessment, 49, 71–75. doi:10.1207/s15327752jpa4901_13

Diener, E., Lucas, R. E., Schimmack, U., & Helliwell, J. F. (2009). Well-being for public policy. New York, NY: Oxford University Press. doi:10.1093/acprof:oso/9780195334074.001.0001

Diener, E., Lucas, R. E., & Scollon, C. N. (2006). Beyond the hedonic treadmill: Revising the adaptation theory of well-being. American Psy- chologist, 61, 305–314.

Diener, E., Smith, H., & Fujita, F. (1995). The personality structure of affect. Journal of Personality and Social Psychology, 69, 130 –141. doi:10.1037/0022-3514.69.1.130

Diener, E., Suh, E. M., Lucas, R. E., & Smith, H. L. (1999). Subjective well-being: Three decades of progress. Psychological Bulletin,  125, 276 –302. Funder, D. C., Kolar, D. C., & Blackman, M. C. (1995). Agreement among judges of personality: Interpersonal-relations, similarity, and acquain- tanceship. Journal of Personality and Social Psychology, 69, 656 – 672. doi:10.1037/0022-3514.69.4.656

Funder, D. C., Kolar, D. C., & Blackman, M. C. (1995). Agreement among judges of personality: Interpersonal-relations, similarity, and acquain- tanceship. Journal of Personality and Social Psychology, 69, 656 – 672. doi:10.1037/0022-3514.69.4.656

Gere, J., & Schimmack, U. (2011). A multi-occasion multi-rater model of affective dispositions and affective well-being. Journal of Happiness Studies, 12, 931–945. doi:10.1007/s10902-010-9237-3

Kahneman, D., Krueger, A. B., Schkade, D., Schwarz, N., & Stone, A. A. (2006). Would you be happier if you were richer? A focusing illusion. Science, 312, 1908 –1910. doi:10.1126/science.1129688

Kim, H., Schimmack, U., & Oishi, S. (2012). Cultural differences in self- and other-evaluations of well-being: A study of European and Asian Canadians. Journal of Personality and Social Psychology, 102, 856 – 873. doi:10.1037/a0026803

Kolar, D. W., Funder, D. C., & Colvin, C. R. (1996). Comparing the accuracy of personality judgments by the self and knowledgeable others. Journal of Personality, 64, 311–337. doi:10.1111/j.1467-6494.1996

.tb00513.x

Lucas, R. E., Diener, E., & Suh, E. (1996). Discriminant validity of well-being measures. Journal of Personality and Social Psychology, 71, 616 – 628. doi:10.1037/0022-3514.71.3.616

Lucas, R. E., & Schimmack, U. (2009). Income and well-being. How big is the gap between the rich and the poor? Journal of Research in Personality, 43, 75–78. doi:10.1016/j.jrp.2008.09.004

Muthén, L. K., & Muthén, B. O. (2007). Mplus user’s guide (5th ed.). Los Angeles, CA: Muthén & Muthén.

Oishi, S. (2006). The concept of life satisfaction across cultures: An IRT analysis. Journal of Research in Personality, 40, 411– 423. doi:10.1016/ j.jrp.2005.02.002

Raftery, A. E. (1995). Bayesian model selection in social research. Soci- ological Methodology, 25, 111–164. doi:10.2307/271063

Schermelleh-Engel, K., Moosbrugger, H., & Muller, H. (2003). Evaluating the fit of structural equation models: Tests of significance and descrip- tive goodness-of-fit measures. Methods of Psychological Research, 8, 23–74.

Schimmack, U. (2010). What multi-method data tell us about construct validity. European Journal of Personality, 24, 241–257. doi:10.1002/ per.771

Schimmack, U., Diener, E., & Oishi, S. (2002). Life-satisfaction is a momentary judgement and a stable personality characteristic: The use of chronically accessible and stable sources. Journal of Personality, 70, 345–384. doi:10.1111/1467-6494.05008

Schimmack, U., & Oishi, S. (2005). The influence of chronically and temporarily accessible information on life satisfaction judgments. Jour- nal of Personality and Social Psychology, 89, 395– 406. doi:10.1037/ 0022-3514.89.3.395

Schimmack, U., Radhakrishnan, P., Oishi, S., Dzokoto, V., & Ahadi, S. (2002). Culture, personality, and subjective well-being: Integrating pro- cess models of life satisfaction. Journal of Personality and Social Psychology, 82, 582–593.

Schimmack, U., Schupp, J., & Wagner, G. G. (2008). The influence of environment and personality on the affective and cognitive component of subjective well-being. Social Indicators Research, 89, 41– 60. doi: 10.1007/s11205-007-9230-3

Schmidt, F. L., & Hunter, J. E. (1996). Measurement error in psychological research: Lessons from 26 research scenarios. Psychological Methods, 1, 199 –223. doi:10.1037/1082-989X.1.2.199

Schneider, L., & Schimmack, U. (2009). Self-informant agreement in well-being ratings: A meta-analysis. Social Indicators Research, 94, 363–376. doi:10.1007/s11205-009-9440-y

Schneider, L., & Schimmack, U. (2010). Examining sources of self- informant agreement in life-satisfaction judgments. Journal of Research in Personality, 44, 207–212. doi:10.1016/j.jrp.2010.01.004

Schwarz, N., & Strack, F. (1999). Reports of subjective well-being: Judg- mental processes and their methodological implications. In D. Kahne- man, E. Diener, & N. Schwarz (Eds.), Well-being: The foundations of hedonic psychology (pp. 61– 84). New York, NY: Russell-Sage.

Suh, E., Diener, E.,Oishi, S., & Triandis, H. C. (1998). The shifting basis of life satisfaction judgments across cultures: Emotions versus norms. Journal of Personality and Social Psychology, 74, 482– 493.

Sumner, L. W. (1996). Welfare, happiness, and ethics. New York, NY: Oxford University Press.

Veenhoven, R., & Jonkers, T. (1984). Conditions of happiness (Vol. 2). Dordrecht, the Netherlands: Reidel.

Walker, S. S., & Schimmack, U. (2008). Validity of a happiness implicit association test as a measure of subjective well-being. Journal of Re- search in Personality, 42, 490 – 497. doi:10.1016/j.jrp.2007.07.005

The Validation Crisis in Psychology

Most published psychological measures are unvalid.  (subtitle)
*unvalid = the validity of the measure is un-known.

Introduction

8 years ago, psychologists started to realize that they have a replication crisis. Many published results do not replicate in honest replication attempts that allow the data to decide whether a hypothesis is true or false.

The replication crisis is sometimes attributed to the lack of replication studies before 2011. However, this is not the case. Most published results were replicated successfully. However, these successes were entirely predictable from the fact that only successful replications would be published (Sterling, 1959). These sham replication studies provided illusory evidence for theories that have been discredited over the past eight years by credible replication studies.

New initiatives that are called open science are likely to improve the replicability of psychological science in the future, although progress towards this goal is painfully slow.

This blog post addresses another problem in psychological science. I call it the validation crisis. Replicability is only one necessary feature of a healthy science. Another necessary feature of a healthy science is the use of valid measures. This feature of a healthy science is as obvious as the need for replicability. To test theories that relate theoretical constructs to each other (e.g., construct A influences construct B for individuals drawn from population P under conditions C), it is necessary to have valid measures of constructs. However, it is unclear which criteria a measure has to fulfill to have construct validity. Thus, even successful and replicable tests of a theory may be false because the measures that were used lacked construct validity.

Construct Validity

The classic article on “Construct Validity” was written by two giants in psychology; Cronbach and Meehl (1955). Every graduate student of psychology and surely every psychologists who published a psychological measure should be familiar with this article.

The article was the result of an APA task force that tried to establish criteria, now called psychometric properties, for tests to be published. The result of this project was the creation of the construct “Construct validity”

The chief innovation in the Committee’s report was the term construct validity. (p. 281).

Cronbach and Meehl provide their own definition of this construct.

Construct validation is involved whenever a test is to be interpreted
as a measure of some attribute or quality which is not “operationally
defined” (p. 282).

In modern language, construct validity is the relationship between variation in observed test scores and a latent variable that reflects corresponding variation in a theoretical construct (Schimmack, 2010).

Thinking about construct validity in this way makes it immediately obvious why it is much easier to demonstrate predictive validity, which is the relationship between observed tests scores and observed criterion scores than to establish construct validity, which is the relationship between observed test scores and a latent, unobserved variable. To demonstrate predictive validity, one can simply obtain scores on a measure and a criterion and compute the correlation between the two variables. The correlation coefficient shows the amount of predictive validity of the measure. However, because constructs are not observable, it is impossible to use simple correlations to examine construct validity.

The problem of construct validation can be illustrated with the development of IQ scores. IQ scores can have predictive validity (e.g., performance in graduate school) without making any claims about the construct that is being measured (IQ tests measure whatever they measure and what they measure predicts important outcomes). However, IQ tests are often treated as measures of intelligence. For IQ tests to be valid measures of intelligence, it is necessary to define the construct of intelligence and to demonstrate that observed IQ scores are related to unobserved variation in intelligence. Thus, construct validation requires clear definitions of constructs that are independent of the measure that is being validated. Without clear definition of constructs, the meaning of a measure reverts essentially to “whatever the measure is measuring,” as in the old saying “Intelligence is whatever IQ tests are measuring. This saying shows the problem of research with measures that have no clear construct and no construct validity.

In conclusion, the challenge in construct validation research is to relate a specific measure to a well-defined construct and to establish that variation in test scores are related to variation in the construct.

What are Constructs

Construct validation starts with an assumption. Individuals are assumed to have an attribute, today we may say personality trait. Personality traits are typically not directly observable (e.g., kindness rather than height), but systematic observation suggests that the attribute exists (some people are kinder than others across time and situations). The first step is to develop a measure of this attribute (e.g., a self-report measure “How kind are you?”). If the test is valid, variation in the observed scores on the measure should be related to the personality trait.

A construct is some postulated attribute of people, assumed to be reflected in test performance (p. 283).

The term “reflected” is consistent with a latent variable model, where unobserved traits are reflected in observable indicators. In fact, Cronbach and Meehl argue that factor analysis (not principle component analysis!) provides very important information for construct validity.

We depart from Anastasi at two points. She writes, “The validity of
a psychological test should not be confused with an analysis of the factors
which determine the behavior under consideration.” We, however,
regard such analysis as a most important type of validation. (p. 286).

Factor analysis is useful because factors are unobserved variables and factor loadings show how strongly an observed measure is related to variation in a an unobserved variable; the factor. If multiple measures of a construct are available, they should be positively correlated with each other and factor analysis will extract a common factor. For example, if multiple independent raters agree in their ratings of individuals’ kindness, the common factor in these ratings may correspond to the personality trait kindness, and the factor loadings provide evidence about the degree of construct validity of each measure (Schimmack, 2010).

In conclusion, factor analysis provides useful information about construct validity of measures because factors represent the construct and factor loadings show how strongly an observed measure is related to the construct.

It is clear that factors here function as constructs (p. 287).

Convergent Validity

The term convergent validity was introduced a few years later in another seminal article on validation research by Campbell and Fiske (1959). However, the basic idea of convergent validity was specified by Cronbach and Meehl (1955) in the section “Correlation matrices and factor analysis”

If two tests are presumed to measure the same construct, a correlation between them is predicted (p. 287).

If a trait such as dominance is hypothesized, and the items inquire about behaviors subsumed under this label, then the hypothesis appears to require that these items be generally intercorrelated (p. 288)

Cronbach and Meehl realize the problem of using just two observed measures to examine convergent validity. For example, self-informant correlations are often used in personality psychology to demonstrate validity of self-ratings. However, a correlation of r = .4 between self-ratings and informant ratings is open to very different interpretations. The correlation could reflect very high validity of self-ratings and modest validity of informant ratings or the opposite could be true.

If the obtained correlation departs from the expectation, however, there is no way to know whether the fault lies in test A, test B, or the formulation of the construct. A matrix of intercorrelations often points out profitable ways of dividing the construct into more meaningful parts, factor analysis being
a useful computational method in such studies. (p. 300)

A multi-method approach avoids this problem and factor loadings on a common factor can be interpreted as validity coefficients. More valid measures should have higher loadings than less valid measures. Factor analysis requires a minimum of three observed variables, but more is better. Thus, construct validation requires a multi-method assessment.

Discriminant Validity

The term discriminant validity was also introduced later by Campbell and Fiske (1959). However, Cronbach and Meehl already point out that high or low correlations can support construct validity. Crucial for construct validity is that the correlations are consistent with theoretical expectations.

For example, low correlations between intelligence and happiness do not undermine the validity of an intelligence measure because there is no theoretical expectation that intelligence is related to happiness. In contrast, low correlations between intelligence and job performance would be a problem if the jobs require problem solving skills and intelligence is an ability to solve problems faster or better.

Only if the underlying theory of the trait being measured calls for high item
intercorrelations do the correlations support construct validity (p. 288).

Quantifying Construct Validity

It is rare to see quantitative claims about construct validity. Most articles that claim construct validity of a measure simply state that the measure has demonstrated construct validity as if a test is either valid or invalid. However, the previous discussion already made it clear that construct validity is a quantitative construct because construct validity is the relation between variation in a measure and variation in the construct and this relation can vary . If we use standardized coefficients like factor loadings to assess the construct validity of a measure, construct validity can range from -1 to 1.

Contrary to the current practices, Cronbach and Meehl assumed that most users of measures would be interested in a “construct validity coefficient.”

There is an understandable tendency to seek a “construct validity
coefficient. A numerical statement of the degree of construct validity
would be a statement of the proportion of the test score variance that is
attributable to the construct variable. This numerical estimate can sometimes be arrived at by a factor analysis” (p. 289).

Cronbach and Meehl are well-aware that it is difficult to quantify validity precisely, even if multiple measures of a construct are available because the factor may not be perfectly corresponding with the construct.

Rarely will it be possible to estimate definite “construct saturations,” because no factor corresponding closely to the construct will be available (p. 289).

And nobody today seems to remember Cronbach and Meehl’s (1955) warning that rejection of the null-hypothesis, the test has zero validity, is not the end goal of validation research.

It should be particularly noted that rejecting the null hypothesis does not finish the job of construct validation (p. 290)

The problem is not to conclude that the test “is valid” for measuring- the construct variable. The task is to state as definitely as possible the degree of validity the test is presumed to have (p. 290).

One reason why psychologists may not follow this sensible advice is that estimates of construct validity for many tests are likely to be low (Schimmack, 2010).

The Nomological Net – A Structural Equation Model

Some readers may be familiar with the term “nomological net” that was popularized by Cronbach and Meehl. In modern language a nomological net is essentially a structural equation model.

The laws in a nomological network may relate (a) observable properties
or quantities to each other; or (b) theoretical constructs to observables;
or (c) different theoretical constructs to one another. These “laws”
may be statistical or deterministic.

It is probably no accident that at the same time as Cronbach and Mehl started to think about constructs as separate from observed measures, structural equation model was developed as a combination of factor analysis that made it possible to relate observed variables to variation in unobserved constructs and path analysis that made it possible to relate variation in constructs to each other. Although laws in a nomological network can take on more complex forms than linear relationships, a structural equation model is a nomological network (but a nomological network is not necessarily a structural equation model).

As proper construct validation requires a multi-method approach and demonstration of convergent and discriminant validity, SEM is ideally suited to examine whether the observed correlations among measures in a mulit-trait-multi-method matrix are consistent with theoretical expectations. In this regard, SEM is superior to factor analysis. For example, it is possible to model shared method variance, which is impossible with factor analysis.

Cronbach and Meehl also realize that constructs can change as more information becomes available. It may also occur that the data fail to provide evidence for a construct. In this sense, construct validiation is an ongoing process of improved understanding of unobserved constructs and how they are related to observable measures.

Ideally this iterative process would start with a simple structural equation model that is fitted to some data. If the model does not fit, the model can be modified and tested with new data. Over time, the model would become more complex and more stable because core measures of constructs would establish the construct validity, while peripheral relationships may be modified if new data suggest that theoretical assumptions need to be changed.

When observations will not fit into the network as it stands, the scientist has a certain freedom in selecting where to modify the network (p. 290).

Too often psychologists use SEM only to confirm an assumed nomological network and it is often considered inappropriate to change a nomological network to fit observed data. However, SEM is as much testing of an existing construct as exploration of a new construct.

The example from the natural sciences was the initial definition of gold as having a golden color. However, later it was discovered that the pure metal gold is actually silver or white and that the typical yellow color comes from copper impurities. In the same way, scientific constructs of intelligence can change depending on the data that are observed. For example, the original theory may assume that intelligence is a unidimensional construct (g), but empirical data could show that intelligence is multi-faceted with specific intelligences for specific domains.

However, given the lack of construct validation research in psychology, psychology has seen little progress in the understanding of such basic constructs such as extraversion, self-esteem, or wellbeing. Often these constructs are still assessed with measures that were originally proposed as measures of these constructs, as if divine intervention led to the creation of the best measure of these constructs and future research only confirmed their superiority.

Instead many claims about construct validity are based on conjectures than empirical support by means of nomological networks. This was true in 1955. Unfortunately, it is still true over 50 years later.

For most tests intended to measure constructs, adequate criteria do not exist. This being the case, many such tests have been left unvalidated, or a finespun network of rationalizations has been offered as if it were validation. Rationalization is not construct validation. One who claims that his test reflects a construct cannot maintain his claim in the face of recurrent negative results because these results show that his construct is too loosely defined to yield verifiable inferences (p. 291).

Given the difficulty of defining constructs and finding measures for it, even measures that show promise in the beginning might fail to demonstrate construct validity later and new measures should show higher construct validity than the early measures. However, psychology shows no development in measures of the same construct. The most widely used measure of self-esteem is still Rosenberg’s scale from 1965 and the most widely used measure of wellbieng is still Diener et al.’s scale from 1984. It is not clear how psychology can make progress, if it doesn’t make progress in the development of nomological networks that provide information about constructs and about the construct validity of measures.

Cronbach and Meehl are clear that nomological networks are needed to claim construct validity.

To validate a claim that a test measures a construct, a nomological net surrounding the concept must exist (p. 291).

However, there are few attempts to examine construct validity with structural equation models (Connelly & Ones, 2010; Zou, Schimmack, & Gere, 2013). [please share more if you know some]

One possible reason is that construct validation research may reveal that authors initial constructs need to be modified or their measures have modest validity. For example, McCrae, Zonderman, Costa, Bond, and Paunonen (1996) dismissed structural equation modeling as a useful method to examine the construct validity of Big Five measures because it failed to support their conception of the Big Five as orthogonal dimensions with simple structure.

Recommendations for Users of Psychological Measures

The consumer can accept a test as a measure of a construct only when there is a strong positive fit between predictions and subsequent data. When the evidence from a proper investigation of a published test is essentially negative, it should be reported as a stop sign to discourage use of the test pending a reconciliation of test and construct, or final abandonment of the test (p. 296).

It is very unlikely that all hunches by psychologists lead to the discovery of useful constructs and development of valid tests of these constructs. Given the lack of knowledge about the mind, it is rather more likely that many constructs turn out to be non-existent and that measures have low construct validity.

However, the history of psychological measurement has only seen development of more and more constructs and more and more measures to measure this increasing universe of constructs. Since the 1990s, constructs have doubled because every construct has been split into an explicit and an implicit version of the construct. Presumably, there is even implicit political orientation or gender identity.

The proliferation of constructs and measures is not a sign of a healthy science. Rather it shows the inability of empirical studies to demonstrate that a measure is not valid or that a construct may not exist. This is mostly due to self-serving biases and motivated reasoning of test developers. The gains from a measure that is widely used are immense. Thus, weak evidence is used to claim that a measure is valid and consumers are complicit because they can use these measures to make new discoveries. Even when evidence shows that a measure may not work as intended (e.g.,
Bosson et al., 2000), it is often ignored (Greenwald & Farnham, 2001).

Conclusion

Just like psychologist have started to appreciate replication failures in the past years, they need to embrace validation failures. Some of the measures that are currently used in psychology are likely to have insufficient construct validity. If this was the decade of replication, the 2020s may become the decade of validation, and maybe the 2030s may produce the first replicable studies with valid measures. Maybe this is overly optimistic, given the lack of improvement in validation research since Cronbach and Meehl (1955) outlined a program of construct validation research. Ample citations show that they were successful in introducing the term, but they failed in establishing rigorous criteria of construct validity. The time to change this is now.

The Implicit Association Test at Age 21: No Evidence for Construct Validity

PREPRINT (UNDER REVIEW)

Abstract

The Implicit Association Test (IAT) is 21 years old. Greenwald et al. (1998) proposed that the IAT measures individual differences in implicit social cognition.  This claim requires evidence of construct validity. I review the evidence and show that there is insufficient evidence for this claim.  Most important, I show that few studies were able to test discriminant validity of the IAT as a measure of implicit personality characteristics and that a single-construct model fits multi-method data as well or better than a dual-construct models.  Thus, the IAT appears to be a measure of the same personality characteristics that are measured with explicit measures. I also show that the validity of the IAT varies across personality characteristics. It has low validity as a measure of self-esteem, moderate validity as a measure of racial bias, and high validity as a measure of political orientation.  The existing evidence also suggests that the IAT measures stable characteristics rather than states and has low predictive validity of single behaviors. Based on these findings, it is important that users of the IAT clearly distinguish between implicit measures and implicit constructs. The IAT is an implicit measure, but there is no evidence that it measures implicit constructs.

Keywords:  Personality, Individual Differences, Social Cognition, Measurement, Construct Validity, Convergent Validity, Discriminant Validity, Structural Equation Modeling

The Implicit Association Test at Age 21: No Evidence for Construct Validity

Twenty-one years ago, Greenwald, McGree, and Schwartz (1998) published one of the most influential articles in personality and social psychology.  It is already the 4th most cited article (4582 citations in Web of Science) in the Journal of Personality and Social Psychology and will be number 3 this year. As the title “Measuring Individual Differences in Social Cognition” suggests, the article introduced a new individual difference measure that has been used in hundreds of studies to measure attitudes, stereotypes, self-concepts, well-being, and personality traits. Henceforth, I will refer to these constructs as personality characteristics.

A Critical Evaluation of Greenwald’s (1998) Evidence for Discriminant Validity

The Implicit Association Test (IAT) uses reaction times in classification tasks to measure individual differences in the strength of associations (Nosek et al., 2007).  However, the main purpose of the IAT is not to measure associations or to provide an indirect measure of personality characteristics.  The key constructs that the IAT was designed to measure are individual differences in implicit personality characteristics as suggested in the title of Greenwald et al.’s (1998) seminal article “Measuring Individual Differences in Implicit Cognition.” 

The notion of implicit cognition is based on a conception of human information processing that largely takes place outside of consciousness, and the IAT was supposed to provide a window into the unconscious. “There has been an increased interest in measuring aspects of thinking and feeling that may not be easily accessed or available to consciousness. Innovations in measurement have been undertaken with the purpose of bringing under scrutiny new forms of cognition and emotion that were previously undiscovered” (Nosek, Greenwald, & Banaji, 2007, p. 265). 

Thus, the IAT was not just a new way of measuring the same individual differences that were already measured with self-report measures.  It was designed to measure information that is “simply unreachable, in the same way that memories are sometimes unreachable [by introspection]” (Nosek et al., 2007, p. 266).

The promise to measure individual differences that were not accessible to introspection explains the appeal of the IAT, and many articles used the IAT to make claims about individual differences in implicit forms of self-esteem, prejudice, or craving for drugs. Thus, the hypothesis that the IAT measures something different from self-report measures is a fundamental feature of the construct validity of the IAT. In psychometrics, the science of test validation, this property of a measure is known as discriminant validity (Campbell & Fiske, 1959).  If the IAT is a measure of implicit individual differences that are different from explicit individual differences, the IAT should demonstrate discriminant validity from self-report measures.  Given the popularity of the IAT, one might expect ample evidence for the discriminant validity of the IAT.  However, due to methodological limitations this is actually not the case.

Confusion about Convergent and Discriminant Validity

Greenwald et al.’s seminal article promised a measure of individual differences, but failed to provide evidence for the convergent or discriminant validity of the IAT.  Study 1 with N = 32 participants showed that, on average, participants preferred flowers to insects and musical instruments to weapons. These average tendencies cannot be used to validate the IAT as a measure of individual differences. However, Greenwald et al. (1998) also reported correlations across N = 32 participants between the IAT and explicit measures.  These correlations were low.  Greenwald et al. (1998) suggest that this finding provides evidence of discriminant validity. “This conceptual divergence between the implicit and explicit measures is of course expected from theorization about implicit social cognition” (p. 1470).  However, these low correlations are uninformative because discriminant validity requires a multi-method approach.  As the IAT was the only implicit measure, low correlations with explicit measures may simply show that the IAT has low validity as a measure of individual differences.  

Experiment 2 used the IAT with 17 Korean and 15 Japanese American students to assess their attitudes towards Koreans vs. Japanese.  In this study, Greenwald et al. found “unexpectedly the feeling thermometer explicit rating was more highly correlated with the IAT measure (average r = .59) than it was with another explicit attitude measure, the semantic differential (r = .43)” (p. 1473). This finding actually contradicts the hypothesis that the IAT measures some construct that is not measured with self-ratings because discriminant validity implies higher same-method than cross-method correlations (Campbell & Fiske, 1959).

Study 3 introduced the race IAT to measure prejudice with the IAT with a sample of 26 participants.  In this small sample, IAT scores were only weakly and not significantly correlated with explicit measures.  The authors realize that this finding is open to multiple interpretations. “Although these correlations provide no evidence for convergent validity of the IAT, nevertheless because of the expectation that implicit and explicit measures of attitude are not necessarily correlated-neither do they damage the case for construct validity of the IAT” (p. 1476).  In other words, the low correlations might reflect discriminant validity, but it could also show low convergent validity if the IAT and explicit measures measure the same construct.

The discussion has a section on “Discriminant Validity of IAT Attitude Measures,” although the design of the studies makes it impossible to provide evidence for discriminant validity. Nevertheless, Greenwald et al. (1998) claimed that they provided evidence for the discriminant validity of the IAT as a measure of implicit cognitions. “It is clear that these implicit-explicit correlations should be taken not as evidence for convergence among different methods of measuring attitudes but as evidence for divergence of the constructs represented by implicit versus explicit attitude measures” (p. 1477).   The scientific interpretation of these correlations is that they provide no empirical evidence about the validity of the IAT because multiple measures of a single construct are needed to examine construct validity (Campbell & Fiske, 1959). Thus, unlike most articles that introduce a new measure of individual differences, Greenwald et al. (1998) did not examine the psychometric properties of the IAT.  In this article, I examine whether evidence gathered over the past 21 years has provided evidence of construct validity of the IAT as a measure of implicit personality characteristics.

First Problems for the Construct Validity of the IAT

The IAT was not the first implicit measure in social psychology. Several different measures had been developed to measure self-esteem with implicit measures. A team of personality psychologists conducted the first multi-method validation study of the IAT as a measure of implicit self-esteem (Bosson, Swan, & Pennebaker, 2000).  The main finding in this study was that several implicit measures, including the IAT, had low convergent validity.  However, this finding has been largely ignored and researchers started using the self-esteem IAT as a measure of some implicit form of self-esteem that operates outside of conscious awareness (Greenwald & Farnham, 2000).

At the same time, attitude researchers also found weak correlations between the race IAT and other implicit measures of prejudice. However, this lack of convergent validity was also ignored.  An influential review article by Fazio and Olson (2003) suggested that low correlations might be due to different mechanisms. While it is entirely possible that evaluative priming and the IAT have different mechanisms, it is not relevant for the ability of either measure to be a valid measure of personality characteristics. Explicit ratings probably also rely on a different mechanism as the IAT.  The mechanics of measurement have to be separated from the constructs that the measures aim to measure.

Continued Confusion about Discriminant Validity

Nosek et al. (2007) examined evidence for the construct validity of the IAT at age 7.  The section on convergent and discriminant validity lists a few studies as evidence for discriminant validity.  However, closer inspection of these studies show that they suffer from the same methodological limitation as Greenwald et al.’s (1998) seminal study.  That is, constructs were assessed with a single implicit method; the IAT.  Thus, it was impossible to examine construct validity of the IAT as a measure of implicit personality characteristics.

Take Nosek and Smyth’s (2007) “A Multi-trait-multi-method validation of the Implicit Association Test” as an example. The title clearly alludes to Campbell and Fiske’s approach to construct validation.  The data were 7 explicit ratings and 7 IATs of 7 attitude pairs (e.g., flower vs. insect).  The authors fitted several structural equation models to the data and claimed that a model with separate, yet correlated, explicit and implicit factors fitted the data better than a model with a single factor for each attitude pair.  This claim is invalid because each attitude pair was assessed with a single IAT and parcels were used to correct for unreliability.  This measurement model assumes that all of the reliable variance in an IAT that is not shared with explicit ratings or with IATs of other attitudes reflects implicit individual differences. However, it is also possible that this variance reflects systematic measurement error that is unique to a specific IAT.  A proper multi-method approach requires multiple independent measures of the same construct.   As demonstrated with real multi-method data below, there is consistent evidence that the IAT has systematic method variance that is unique to a specific IAT. 

Nevertheless, Nosek and Smyth’s (2007) multi-attitude study provided some interesting information. The correlation of the 7 means of the IAT and the 7 means of the explicit ratings was r = .86. For example, implicit and explicit measures showed a preference for flowers over insects and a dislike of evolution versus creation.  If implicit measures reflect distinct, unconscious processes, it is not clear why the means correspond to those based on self-reports. However, this finding is easily explained by a single-attitude model, where the mean structure depends on the mean structure of the latent attitude variable.

In sum, Nosek et al.’s claim that the IAT has demonstrated discriminant validity is based on a misunderstanding of Campbell and Fiske’s (1959) approach to construct validation. A proper assessment of construct validity requires demonstration of convergent validity before it is possible to demonstrate discriminant validity, and to demonstrate convergent validity it is necessary to use multiple independent measures of the same construct.  Thus, to demonstrate construct validity of the IAT as a measure of implicit personality characteristics requires multiple independent implicit measures.

First Evidence of Discriminant Validity in a Multi-Method Study

Cunningham, Preacher, and Banaji (2001) reported the results of the first multi-method study of prejudice. Participants were 93 students with complete data. Each student completed a single explicit measure of prejudice, the Modern Racism Scale (McConahay, 1986), and three implicit measures: (a) the standard race IAT (Greenwald et al., 1998), a response window IAT (Cunningham et al., 2001), and a response-window evaluative priming task (Fazio et al., 1986). The assessment was repeated on four occasions two weeks apart.

I used the published correlation matrix to reexamine the claim that a single-factor model does not fit the data. First, I was able to reproduce the model fit of the published dual-attitude model with MPLUS8.2 (original fit: chi2(100, N = 93) = 111.58, p = .20; NNFI = .96; CFI = .97; RMSEA = 0.041 (90% confidence interval: 0.00, 0.071); reproduced fit: chi2(100, N = 93) = 112, CFI = .977, RMSEA = 0.036, 90%CI = .000 to .067.  Thus, the model fit of the reproduced model serves as a comparison standard for the alternative models that I examined next.

The original model is a hierarchical model with an implicit attitude factor as a second-order factor, and method-specific first-order factors. Each first-order factor has four indicators for four repeated measurements with the same method.  This model imposes constraint on the first order loadings because they contribute to the first-order relations among indicators of the same method and to the second order relations of different implicit methods to each other.

An alternative way to model multi-method data are bi-factor models (Chen, West, & Sousa, 2006). A bifactor model allows for all measures to be directly related to the general trait factor that corresponds to the second-order factor in a hierarchical model.  However, bi-factor models may not be identified if there are no method factors. Thus, a first step is to allow for method-specific correlated residuals and to examine whether these correlations are positive.

The model with a single factor and method-specific residual correlations fit the data better than the hierarchical model, chi2(80, N = 93) = 87, CFI = .988, RMSEA = 0.029, 90%CI = .000 to .065.  Inspection of the residual correlations showed high correlations for the Modern Racism scale, but less evidence for method-specific variance for the implicit measures.  The response window IAT had no significant residual correlations.  This explains the high factor loading of the respond window IAT in the hierarchical model.  It does not suggest that this is the most valid measure. Rather, it shows that there is little method specific variance. Fixing these residual correlations to zero, improved model fit, chi2(86, N = 93) = 91, CFI = .991, RMSEA = 0.025, 90%CI = .000 to .062. I then tried to create method factors for the remaining methods. For the IAT, a method factor could only be created for the first three occasions. However, model fit for this model decreased unless occasion 2 was allowed to correlate with occasion 4.  This unexpected finding is unlikely to reflect a real relationship.  Thus, I retained the model with a method factor for the first three occasions only, chi2(89, N = 93) = 97, CFI = .986, RMSEA = 0.029, 90%CI = .000 to .064.  I was able to fit a method factor for evaluative priming, but model fit decreased, chi2(91, N = 93) = 101, CFI = .983, RMSEA = 0.033, 90%CI = .000 to .065, and the first occasion did not load on the method factor. Model fit could be improved by fixing the loading to zero and by allowing for an additional correlation between occasion 1 and 3, chi2(91, N = 93) = 98, CFI = .988, RMSEA = 0.027, 90%CI = .000 to .062.  However, there is no rational for this relationship and I retained the more parsimonious model.  Fitting the measurement model for the modern racism scale also decreased fit, but fit was better than for the model in the original article, chi2(94, N = 93) = 107, CFI = .977, RMSEA = 0.038, 90%CI = .000 to .068.  This was the final model (Figure 1).

The most important results are the factor loadings of the measures on the trait factor. Factor loadings for the Modern racism scale ranged from .35 to .45 (M = .40). Factor loadings for the standard IAT ranged from .43 to .54 (M = .47). Factor loadings for the response window IAT ranged from .41 to .69 (M = .51).  The evaluative priming measures had the lowest factor loadings ranging from .13 to .47 (M = .29).  Thus, there is no evidence that implicit measures are more strongly related to each other than to explicit measures, as stated in the original article.

In terms of absolute validity, all of these validity coefficients are low, suggesting that a single standard IAT measure on a single occasion has .47^2 = 22% valid variance.  Most important, these results suggest that the Modern Racism Scale and the IAT measure a single construct and that the low correlation between implicit and explicit measures reflects low convergent validity rather than high discriminant validity. 

In conclusion, a reexamination of Cunningham et al.’s data shows that the data do not provide evidence of discriminant validity and that the IAT may simply be an alternative measure of the same construct that is being measured with explicit measures like the Modern Racism Scale. Thus, the study provides no evidence for the construct validity of the IAT as a measure of implicit individual differences in race attitudes.

Meta-Analysis of Implicit – Explicit Correlations

Hofmann, Gawronski, Geschwendner, and Le (2005) conducted a meta-analysis of 126 studies that had reported correlations between an IAT and an explicit measure of the same construct. Notably, over one-hundred studies had been conducted without using multiple-implicit measures. The mono-method approach taken in these studies suggests that authors took construct validity of the IAT for granted, and used the IAT as a measure of implicit constructs.  As a result, these studies provide no test of the construct validity of the IAT.

Nevertheless, the meta-analysis produced an interesting result.  Correlations between implicit and explicit measures varied across personality characteristics.  Correlations were lowest for self-esteem, which is consistent with Bosson et al.’s (2000) finding, and highest for simple attitude objects like consumer products (e.g. Pepsi vs. Coke).  Any theory of implicit attitude measures has to explain this finding.  One explanation could be that explicit measures of self-esteem are less valid than explicit-measures of preferences for consumer goods. However, it is also possible that the validity of the IAT varies.  Once more, a comparison of different personality characteristics with multiple methods is needed to test this competing theories.

Problems with Predictive Validity

Ten years after the IAT was published another problem emerged.  Some critics voiced concerns that the IAT, especially the race IAT, lacks predictive validity (Blanton, Jaccard, Klick, Mellers, Mitchell, & Tetlock (2009).  To examine the predictive validity of the IAT, Greenwald and colleagues (2009) published a meta-analysis of IAT-criterion correlations. The key finding was that “for 32 samples with criterion measures involving Black–White interracial behavior, predictive validity of IAT measures significantly exceeded that of self-report measures” (p. 17).  Specifically, the authors reported a correlation of r = .24 for the IAT and a criterion and a correlation of r = .12 for an explicit measure and a criterion, and that these correlations were significantly different from each other.  A few years later, Oswald, Mitchell, Blanton, Jaccard, and Tetlock (2013) published a critical reexamination of the literature and reported different results. “IATs were poor predictors of every criterion category other than brain activity, and the IATs performed no better than simple explicit measures” (p. 171).  The only exception were fMRI studies with extremely small samples that produced extremely large correlations, often exceeding the reliability of the IAT.  It is well known that these correlations are inflated and difficult to replicate (Vul, Harris, Winkielman, & Hashler, 2009).  Moreover, correlations with neural activity are not evidence that IAT scores predict behavior.

More recently, Greenwald and colleagues published a new meta-analysis (Kurdi et al., 2018). This meta-analysis produced weaker criterion correlations than the previous meta-analysis.  The median IAT-criterion correlation was r = .050.  This is also true if the analysis is limited to studies with the race IAT.  After correcting for random measurement error, the authors report on average correlation of r = .14.  However, correction for unreliability yields hypothetical correlations that could be obtained if the IAT were perfectly reliable, which it is not. Thus, for the practical evaluation of the IAT as a measure of individual differences, it is more important how much the actual IAT scores can predict some validation criterion.  With small IAT-criterion correlations around r = .1, large samples would be required to have sufficient power to detect effects, especially incremental effects above and beyond explicit measures. Given that most studies had sample sizes of less than 100 participants, “most studies were vastly underpowered” (Kurdi et al., 2018, p. 1). Thus, it is now clear that IAT scores have low predictive validity, but it is not clear whether IAT scores have any predictive validity, when they have predictive validity, and whether they have predictive validity after controlling for explicit predictors of behavior.

Greenwald et al.’s (2009) 2008 US Election Study

In 2008, a historic event occurred in the United States. US voters had the opportunity to elect the first Black president. Although the outcome is now a historic fact, it was uncertain before the election how much Barak Obama’s racial background would influence White voters.  There was also considerable concern that voters might not reveal their true feelings. This provided a great opportunity to test the validity of implicit measures of racial bias.  If White voters are influenced by racial bias, IAT scores should predict voting intentions above and beyond explicit measures. According to the abstract of the article, the results confirm this prediction. “The implicit race attitude measures (Implicit Association Test and Affect Misattribution Procedure) predicted vote choice independently of the self-report race attitude measures, and also independently of political conservatism and symbolic racism. These findings support construct validity of the implicit measures” (p. 242).

These claims were based on results of multiple regression analyses. “When entered after the self-report measures, the two implicit measures incrementally explained 2.1% of vote intention variance, p=.001, and when political conservativism was also included in the model, “the pair of implicit measures incrementally predicted only 0.6% of voting intention variance, p = .05.”  (p. 247).

I tried to reproduce these results with the published correlation matrix and failed to do so.  A multiple regression analysis with explicit measures, implicit measures, and political orientation as predictors showed non-significant effects for the IAT, b = .002, se = .024, t = .087, p = .930 and the AMP, b = .033, se = .023, t = 1.470, p = .142. I also obtained the raw data from Anthony Greenwald, but I was unable to recreate the sample size of N = 1,057. Instead I obtained a similar sample size of N = 1,035.  Performing the analysis on this sample also produced non-significant results; IAT, b = -.003, se = .044, t = .070, p = .944 and the AMP, b = -.014, se = .042, t = 0.344, p = .731.

To fully explore the relationship among the variables in this valuable dataset, I fitted a structural equation model to the raw data (N = 1,035).  The model had good fit, chi2(9) = 18.27, CFI = .995,  RMSEA = .032 90%CI(.009-.052). As shown in Figure 2, the IAT did not have incremental predictive validity as the residual variance was unrelated to voting. There is also no evidence of discriminant validity because the residuals of the two measures are not correlated. However, the model does show that a ProWhite bias predicts voting above and beyond political orientation.  Thus, the results do support the hypothesis that racial bias influenced voting in the 2008 election.  This bias is reflected in explicit and implicit measures.  Interestingly, the validity coefficients in this study differ from those in Cunningham et al.’s study with undergraduate students.  The factor loadings suggest that the IAT is the most valid measure of racial bias with .59^2 = 36% valid variance as a measure of explicit attitudes. This makes the IAT as valid as the feeling thermometer, which is more valid than the Modern Racism Scale in Cunningham’s study.  This finding has been replicated in subsequent studies (Axt, 2018).   

In conclusion, a reexamination of the 2008 election study shows that the data are entirely consistent with a single-attitude model and that there is no evidence for incremental predictive validity or discriminant validity in these data. However, the study does show some predictive validity of the IAT and convergent validity with explicit measures. Thus, the results provide no construct validity of the IAT as a measure of implicit individual differences, but the results can also be interpreted as evidence for validity as a measure of the same construct that is measured with explicit measures.  This shows that claims about validity vary as a function of the construct that is being measured.  A scale is a good measure of weight, but not of intelligence.  The results here suggest that the race IAT is a moderately valid measure of racial bias, but an invalid measure of implicit bias, which may not even exist because scientific claims about implicit bias require valid measures of implicit bias.

Reexamining a Multi-Trait Multi-Method Study

The most recent and extensive multi-trait multi-method validation study of the IAT was published last year (Bar-Anan & Vianello, 2018).  The abstract claims that the results provide clear support for the validity of the IAT as a measure of implicit cognitions, including implicit self-esteem. “The evidence supports the dual-attitude perspective, bolsters the validation of 6 indirect measures, and clears doubts from countless previous studies that used only one indirect measure to draw conclusions about implicit attitudes” (p. 1264). 

Below I show that these claims are not supported by the data, and that single-attitude models fit the data as well as dual-attitude models. I also show that dual-attitude models show low convergent validity across implicit measures, while IAT variants share method variance because they rely on the same mechanisms to measure attitudes.

Bar-Anan and Vianello (2018) fitted a single model to measures of self-esteem, racial bias, and political orientation. This makes the model extremely complex and produced some questionable results (e.g., the implicit and explicit method factors were highly correlated; some measured had negative loadings on the method factors).  In structural equation modeling, it is good practice to fit smaller models before creating a larger model.  Thus, I first examined construct validity for each domain separately before I fitted a model that integrates models into a single unified model.

Race IAT

I first fitted a dual-attitude model to measures of racial attitudes and included contact as the criterion variable. I did not specify a causal relationship between contact and attitudes because attitudes can influence contact and vice versa.  The dual-attitude model had good fit, chi2(48) = 109.41; CFI = .975; RMSEA = 0.010 (90% confidence interval: 0.007, 0.012).  The best indicator of the explicit factor was the preference rating (Figure 3).  The best indicator of the implicit factor was the BIAT.  However, all IAT-variants had moderate to high loadings on the implicit factor. In contrast, the evaluative priming measure had a low loading on the implicit factor and the AMP had a moderate loading on the explicit factor and no significant loading on the implicit factor.  These results show that Bar-Anan and Vianello’s model failed to distinguish between IAT-specific method variance and method variance for implicit measures in general. The present results show that IAT-variants share little valid variance or method variance with conceptually distinct implicit measures.

Not surprisingly, a single-attitude model with an IAT method factor (Figure 4) also fit the data well, chi2(46) = 112.04; CFI = .973; RMSEA = 0.010 (90% confidence interval: 0.008, 0.013).  Importantly, the model has no shared method variance between conceptually different explicit measures like preference ratings and the Modern Racism Scale (MRS).  The AMP and the EP both are valid measures of attitudes but with relatively modest validity. The BIAT has a validity of .46, with 22% explained variance. This result is more consistent with Cunningham et al. (2001) than Greenwald et al. (2009) data.  The model also shows a clear relationship between contact and less pro-White bias. Finally, the model shows that the IAT method factor is unrelated to contact. Thus, any relationship between IAT scores and contact is explained by the shared variance with explicit measures.

These results show that Bar-Anan and Vianello’s (2018) conclusion are not supported by the data. Although a dual-attitude model can be fitted to the data, it shows low convergent validity across different implicit measures, and a single-attitude model fits the data as well as a dual-attitude model.

Political Orientation

Figure 5 shows the dual-attitude model for political orientation.  The explicit factor is defined by a simple rating of preference for republicans versus democrats, the modern racism scale, the right-wing-authoritarianism scale, and ratings of Hillary Clinton.  The implicit factor is defined by the IAT, the brief IAT, the Go-NoGo Task, and single category IATs.  The remaining two implicit measures, the Affect Misattribution Task, and Evaluative Priming are allowed to load on both factors.  Voting in the previous election is predicted by explicit attitudes.  The model has good fit to the data, chi2(48) = 99.34; CFI = .991; RMSEA = 0.009 (90% confidence interval: 0.006, 0.011).  The loading pattern shows that the AMP and EP load on the implicit factor.  This supports the hypothesis that all implicit measures have convergent validity. However, the loadings for the IATs are much higher. In the dual-attitude framework this would imply that the IAT is a much more valid measure of implicit attitudes than the AMP or EP.  Evidence for discriminant validity is weak. The correlation between the explicit and the implicit factor is r = .89.  The correlation in the original article was r = .91.  Nevertheless, the authors concluded that the data favor the two-factor model because constraining the correlation to 1 reduced model fit.

However, it is possible to fit a single-construct model by allowing for an IAT-variant method factor, chi2(50) = 86.25; CFI = .993; RMSEA = 0.007 (90% confidence interval: 0.005, 0.010).  This model (Figure 6) shows that voting is predicted by a single latent factor that represents political orientation and that simple self-report measures of political orientation are the most valid measure of political orientation.  The IAT shows stronger correlations with explicit measures because it is a more valid measure of political orientation,  .74^2 = 55% valid variance, than the race IAT (22% valid variance).   

Self-Esteem

Figure 7 shows the results for a dual-attitude model of self-esteem.  Model fit was good, although CFI was lower than in the previous model due to weaker factor loadings, chi2(16) = 28.62; CFI = .950; RMSEA = 0.008 (90% confidence interval: 0.003, 0.013).  The model showed a moderate correlation between the explicit and implicit factors, r = .46, which is stronger than in the original article, r = .29, but clearly suggestive of two distinct factors. However, the nature of these two factors is less clear. The implicit factor is defined by the three IAT measures, whereas the AMP and EP have very low loadings on this factor.  This is also true in the original article with loadings of .24 for AMP and .13 for EP.  Thus, the results confirm Bosson’s seminal finding that different implicit measures have low convergent validity. 


As the Implicit Factor was mostly defined by the IAT measures, it was also possible to fit a single-factor model mode with an IAT measurement factor (Figure 8), chi2(16) = 31.50; CFI = .938; RMSEA = 0.009 (90% confidence interval: 0.004, 0.013). However, some of the results of this model are surprising.

According to this model, the validity coefficient of the widely used Rosenberg self-esteem scale is only r = .35, suggesting that only 12% of the variance in the Rosenberg self-esteem scale is valid variance. In addition, the IAT and the BIAT would be equally valid measures of self-esteem.  Thus, previous results of low implicit-explicit correlations for self-esteem (Bosson et al., 2000; Hofmann et al., 2005) would imply low validity of implicit and explicit measures.  This finding would have dramatic implications for the interpretation of low self-esteem-criterion correlations.  A valid self-esteem-criterion correlation of r = .3, would produce only an observed correlation of r = .30*.35 = .11 with the Rosenberg self-esteem scale or the IAT.  Correlations of this magnitude require large samples (N = 782) to have an 80% probability to obtain a significant result with alpha = .05 or N = 1,325 with alpha = .005.  Thus, most studies that tried to predict performance criteria form self-esteem were underpowered.  However, the results of this study are limited by the use of an online sample and the lack of proper criterion variables to examine predictive validity.  The main conclusion from this analysis is that a single-factor model with an IAT method factor fit the data well and that the dual attitude model failed to demonstrate convergent validity across different implicit measures; a finding that replicates Bosson et al. (2000), which Bar-Anan and Vianello do not cite.

A Unified Model

After establishing well-fitting models for each personality characteristic, it is possible to fit a unified model. Importantly, no changes to the individual models should be made because a decrease in fit can be attributed to the new relationships across different personality characteristics.  Without any additional modifications, the overall model in Figure 9 had good fit,  XX.  Correlations among the IAT method factors showed significant positive correlations of the method factor for race with the method factor for self-esteem (r = .4) and political orientation (r = .2), but a negative correlation for the method factors for self-esteem and political orientation (r = -.3).  This pattern of correlations is inconsistent with a simple method factor that is expected to produce positive correlations. Thus, it is impossible to fit a general method factor to different IATs. This finding replicates Nosek and Smyth’s (2007) findings.

Correlations among the personality characteristics replicate the finding with Greenwald et al.’s (2009) data that Republicans are more likely to have a pro-white bias, r = .4.  Political orientation is unrelated to self-esteem, r = .0, but Pro-White bias tends to be positively related to self-esteem, r = .2.  

In conclusion, the present results show that Bar-Anan and Vianello’s claims are not supported by the data.  Their data do not provide clear evidence for discriminant validity of implicit and explicit constructs.  The data are fully consistent with the alternative hypothesis that the IAT and other implicit measures measure the same construct that is being measured with implicit factors. Thus, the data provide no support for the construct validity of the IAT as a measure of implicit personality characteristics.

Validity of the Self-Esteem IAT

Bosson et al. (2000) seminal article raised first concerns about the construct validity of the self-esteem IAT. Since then, other critical articles have been published; none of which are cited in Kurdi et al. (2018). Gawronski, LeBel, and Peters (2007) wrote a PoPS article on the construct validity of implicit self-esteem. They fond no conclusive evidence that(a) the self-esteem IAT measures unconscious self-esteem or that (b) low correlations are due to self-report biases in explicit measures of self-esteem. Walker and Schimmack (2008) used informant ratings to examine predictive validity of the self-esteem IAT. Informant ratings are the most widely used validation criterion in personality research, but have not been used by social psychologists. One advantage of informant ratings is that they also measure general personality characteristics rather than specific behaviors, which ensures higher construct-criterion correlations due to the power of aggregation (Epstein, 1980).  Walker and Schimmack (2008) found that informant ratings of well-being were more strongly correlated with explicit self-ratings of well-being than with a happiness or a self-esteem IAT. 

The most recent and extensive review was conducted by Falk and Heine (2014) who found that “the validity evidence for the IAT in measuring ISE [implicit self-esteem] is strikingly weak” (p. 6).  They also point out that implicit measures of self-esteem “show a remarkably consistent lack of predictive validity” (p. 6).  Thus, an unbiased assessment of the evidence is fully consistent with the analyses of Bar-Anan and Vianello’s data that also found low validity of the self-esteem IAT as a measure of self-esteem.

Currently, a study by Falk, Heine, Takemura, Zhang, and Hsu (2013) provides the most comprehensive examination of convergent and discriminant validity of self-esteem measures. I therefore used structural equation modeling of their data to see how consistent the data are with a dual-attitude model or a single-attitude model.  The biggest advantage of the study was the inclusion of informant ratings of self-esteem, which makes it possible to model method-variance in self-ratings (Anusic et al., 2009).  Previous research showed that self-ratings of self-esteem have convergent validity informant ratings of self-esteem (Simms, Zelazny, Yam, & Gros, 2010; Walker & Schimmack, 2008).  I also included the self-report measures of positive affect and negative affect to examine criterion validity.

It was possible to fit a single-factor model to the data (Figure 10), chi2(67) = 115.85; CFI = .964; RMSEA = 0.050 (90% confidence interval: 0.034, 0.065).  Factor loadings show the highest loadings for self-ratings on the self-competence scale and the Rosenberg self-esteem scale. However, informant ratings also had significant loadings on the self-esteem factor, as did self-ratings on the narcissist personality inventory.  A measure of halo bias in self-ratings of personality (SEL) also had moderate loadings, which confirms previous findings that self-esteem is related to evaluative biases in personality ratings (Anusic et al., 2009).  The false uniqueness measure (FU; Falk et al., 2015) had modest validity. In contrast, the implicit measures had no significant loadings on this factor.  In addition, the residual correlations among the implicit measures were weak and not significant. Given the lack of positive relations among implicit measures it was impossible to fit a dual-attitude model to these data.

It is not clear why Bar-Anan and Vianello’s data failed to show higher validity of explicit measures, but the current results are consistent with moderate validity of explicit self-ratings in the personality literature (Simms et al., 2010). Thus, there is consistent evidence that implicit self-esteem measures have low validity as measures of self-esteem and there is no evidence that they are measures of implicit self-esteem.

Explaining Variability in Explicit-Implicit Correlations

One well-established phenomenon in the literature is that correlations between IAT scores and explicit measures vary across domains (Bar-Anan & Vianello, 2018; Hofmann et al., 2005).  As shown earlier, correlations for political orientation are strong, correlations for racial attitudes are moderate, and correlations for self-esteem are weak.  Greenwald and Banaji (2017) offer a dual-attitude explanation for this finding. “The plausible interpretations of the more common pattern of weak implicit– explicit correlations are that (a) implicit and explicit measures tap distinct constructs or (b) they might be affected differently by situational influences in the research situation (cf. Fazio & Towles-Schwen, 1999; Greenwald et al., 2002) or (c) at least one of the measures, plausibly the self-report measure in many of these cases, lacks validity” (p. 868). 

The evidence presented here offers a different explanation.  IAT-explicit correlations and IAT-criterion correlations increase with the validity of the IAT as a measure of the same personality characteristics that are measured with explicit measures.  Thus, low correlations of the self-esteem IAT with explicit measures of self-esteem show low validity of the self-esteem IAT.  High correlations of the political orientation IAT with explicit measures of political orientation show high validity of the IAT as a measure of political orientation; not implicit political orientation.  Finally, modest correlation between the race IAT and explicit measures of racial bias show moderate validity of the race IAT as a measure of racial bias. However, the validity of the race IAT as a measure of racial bias (not implicit racial bias!) varies considerably across studies. This variation may be due to the variability of racial bias in samples which may be lower in student samples.  Thus, contrary to Greenwald and Banaji’s claims, the problem is not with the explicit measures, but with the IAT.

An important question is why the self-esteem IAT is less valid than the political orientation IAT.  I propose that one cause of variation in the validity of the IAT is related to the proportion of respondents on the two ends of a personality characteristic. To test this hypothesis, I used Bar-Anan and Vianello’s data.  To determine the direction of the IAT score, I used a value of 0 as the neutral point.  As predicted, 90% of participants associated self with good, 78% associated White is good, and 69% associated Democrat with good.  Thus, validity decreases with the proportion of participants who are on one side of the bipolar dimension.

Next, I regressed the preference measure on a simple dichotomous predictor that coded the direction of the IAT.  I standardized the preference measure and report standardized and unstandardized regression coefficients.  Standardized regression coefficients are influenced by the distribution of the predictor variable and should show the expected pattern. In contrast, unstandardized coefficients are not sensitive to the proportions and should not show the pattern. I also added the IAT scores as predictors in a second step to examine the incremental predictive validity that is provided by the reaction times.

The standardized coefficients are consistent with predictions (Table 1). However, the unstandardized coefficients also show the same pattern. Thus, other factors also play a role. The amount of incremental explained variance by reaction times shows no differences between the race and the political orientation task.  Most of the differences in validity are due to the direction of the attitude (4% explained variance for race bias vs. 38% explained variance for political orientation).

Table 1

SE       B = .310, SE = .142; b = .093, se = .043; r2 = .009, Δr2 = .002, z = 1.09

Race    B = .467, SE = .010, b = .193, se = .041, r2 = .041, Δr2 = .060, z = 5.79

PO       B = 1.380, SE = .080, b = .637, se = .037, r2 = .380, Δr2 = .070, z = 7.83

The results show the importance of taking the proportion of respondents with opposing personality characteristics into account. The IAT is least valid when most participants are high or low on a personality characteristic, and it is most valid when participants are split into two equally large groups. 

In conclusion, I provided an alternative explanation of variation in explicit-implicit correlations that is consistent with the data.  Implicit-explicit correlations vary at least partially as a function of the validity of the IAT as a measure of the same construct that is measured with explicit measures, and the validity of the IAT varies as a function of the proportion of respondents who are high versus low on a personality characteristic. As most respondents associate the self with good, and reaction times contribute little to the validity of the IAT, the IAT has particularly low validity as a measure of self-esteem.

The Elusive Malleability of Implicit Attitude Measures

Numerous experimental studies have tried to manipulate situational factors in order to change scores on implicit attitude measures (Lai, Hoffman, & Nosek, 2013).  Many of these studies focused on implicit measures of prejudice in order to develop interventions that could reduce prejudice. However, most studies were limited to brief manipulations with immediate assessment of attitudes (Lai et al., 2013).  The results of these studies are mixed.  In a seminal study, Dasgupta and Greenwald (2001) exposed participants to images of admired Black exemplars and disliked White exemplars. They reported that this manipulation had a large effect on IAT scores. However, these days the results of this study are less convincing because it has become apparent that large effect sizes from small samples often do not replicate (Open Science Collaboration, 2015). Consistent with this skepticism, Joy-Gaba and Nosek (2010) had difficulties replicating this effect with much larger samples and found only an average effect size of d = .08.  With effect sizes of this magnitude, other reports of successful experimental manipulations were extremely underpowered.   Another study with large samples found stronger effects (Lai et al., 2016).  The strongest effect was observed for instruction to fake the IAT.  However, Lai et al. also found that none of these manipulations had lasting effects in a follow-up assessment. This finding suggests that even when changes are observed, they reflect context-specific method variance rather than actual changes in the construct that is being measured. 

This conclusion is also supported by one of the few longitudinal IAT studies. Cunningham et al.’s (2001) multi-method study repeated the measurement of racial bias on four separate occasions.  The model shown in Figure 1 shows no systematic relationships between measures taken on the same occasion, and adding these relationships shows non-significant correlated residuals. Thus, in this sample naturally occurring factors did not change race bias. This finding suggests that the IAT and explicit measures measure stable personality characteristics rather than context-specific states.

Only a few serious intervention studies with the IAT have been conducted (Lai et al., 2013).  The most valuable evidence so far comes from studies that examined the influence of living with an African American roommate on White students’ racial attitudes (Shook & Fazio, 2008; Shook, Hopkins, & Koech, 2016).  One study found effects on an implicit measure, F(1,236) = 4.33, p = .04 (Shook & Fazio, 2008), but not on an explicit measure (Shook, 2007).  The other study found effects on explicit attitudes, F(1,107) = 7.34, p = .008 but no results for implicit measures were reported (Shook, Hopkins, & Koech, 2016). Given the small sample sizes of these studies, inconsistent results are to be expected. 

In conclusion, the existing evidence shows that implicit and explicit attitude measures are highly stable over time (Cunningham et al., 2001). I also concur with Joy-Gaba and Nosek that moving scores on implicit bias measures “may not be as easy as implied by the existing experimental demonstrations” (p. 145), and a multi-method assessment is needed to distinguish effects on specific measures from effects on personality characteristics (Olsen & Fazio, 2003).

Future studies of attitude change need a multi-method approach, powerful interventions, adequate statistical power, and multiple repeated measurements of attitudes to distinguish mere occasion-specific variability (malleability) from real attitude change (Anusic & Schimmack, 2016). Ideally, the study would also include informant ratings. For example, intervention studies with roommates could use African Americans as informants to rate their White roommates’ racial attitudes and behaviors.  The single-attitude model predicts that implicit and explicit measures will show consistent results and that variation in effect sizes is explained by the validity of each measure. 

Discussion

Does the IAT Measure Implicit Constructs?

Construct validation is a difficult and iterative process because scientific evidence can alter the understanding of constructs.  However, construct validation research has to start with a working definition of a construct.  The IAT was introduced as a measure of individual differences in implicit social cognition, and implicit social cognitions were defined as aspects of thinking and feeling that may not be easily accessed or available to consciousness (Nosek, Greenwald, & Banaji, 2007, p. 265).  This definition is vague, but it makes a clear prediction that the IAT should measure personality characteristics that cannot be measured with self-reports.  This leads to the prediction that explicit measures and the IAT have discriminant validity.  To demonstrate discriminant validity, unique variance in the IAT has to be related to other indicators of implicit personality characteristics.  This can be demonstrated with incremental predictive validity or convergent validity with other measures of implicit personality characteristics.  Consistent with this line of reasoning, numerous articles have claimed that the IAT has construct validity as a measure of implicit personality characteristics because it shows incremental predictive validity (Greenwald et al., 2009; Kurti et al., 2018) or because the IAT shows convergent validity with other implicit measures and discriminant validity with explicit measures (Bar-Anan & Vianello, 2018).  I demonstrated that all of these claims were false and that the existing evidence provides no evidence for the construct validity of the IAT as a measure of implicit personality characteristics.  The main problem is that most studies that used the IAT assumed construct validity rather than testing it.  Hundreds of studies used the IAT as a single measure of implicit personality characteristics and made claims about implicit personality traits based on variation in IAT scores.  Thus, hundreds of studies made claims that are not supported by empirical evidence simply because it has not been demonstrated that the IAT measures implicit personality constructs.  In this regard the IAT is not alone.  Aside from the replication crisis in psychology (OSC, 2015), psychological science suffers from an even more serious validation crisis. All empirical claims rest on the validity of measures that are used to test theoretical claims. However, many measures in psychology are used without proper validation evidence.  Personality research is a notable exception.  In response to criticism of low predictive validity (Mischell, 1968), personality psychologists embarked on a program of research that demonstrated predictive validity and convergent validity with informant ratings (Funder, $$$).  Another problem is that psychologists treat validity as a qualitative construct, leading to any evidence of validity to support claims that a measure is valid, as if it were 100% valid. However, most measures in psychology have only moderate validity (Schimmack, 2010). Thus, it is important to quantify validity and to use a multi-method approach to increase validity.  The popularity of the IAT reveals the problems with using measures without proper validation evidence.  Social psychologists have influenced public discourse, if not public policy, about implicit racial bias.  Most of these claims are based on findings with the IAT, assuming that IAT scores reflect implicit bias. As demonstrated here, these claims are not valid because the IAT lacks construct validity as a measure of implicit bias.  In the future, psychologists need to be more careful when they make claims based on new measures with limited knowledge about their validity.  Maybe psychological organizations should provide clear guidelines about minimal standards that need to be met before a measure can be used, just like there are guidelines for validity evidence for personality assessment.  In conclusion, psychology suffers as much from a validation crisis as it suffers from a replication crisis.  Fixing the replication crisis will not improve psychology if replicable results are obtained with invalid measures.

The Silver Lining

Psychologists are often divided into opposing camps (e.g. nature vs. nurture; person vs. situation; the IAT is valid vs. invalid).  Many fans of implicit measures are likely to dislike what I had to say about the IAT.  However, my position is different from previous criticisms of the IAT as being entirely invalid (Oswald et al., 2013).  I have demonstrated with several multi-method studies that the IAT has convergent validity with other measures of some personality characteristics. In some domains this validity is too low to be meaningful.  In other domains, the validity of explicit measures is so high that using the IAT is not necessary. However, for sensitive attitudes like racial attitudes, the IAT offers a promising complementary measure to explicit measures of racial attitudes.  Validity coefficients ranged from 20% to 40%.  As the IAT does not appear to share method variance with explicit measures, it is possible to improve the measurement of racial bias by using explicit and implicit measures and to aggregate scores to obtain a more valid measure of racial bias than either an explicit or an implicit measure can provide.  The IAT may also offer benefits in situations where socially desirable responding is a concern.  Thus, the IAT might complement other measures of personality characteristics. This changes the interpretation of explicit-IAT correlations. Rather than (mis)interpreting low correlations as evidence of discriminant validity, high correlations can reveal convergent validity. Similarly, improvements in implicit measures should produce higher correlations with explicit measures.  How useful the IAT and other implicit measures are for the measurement of other personality characteristics has to be examined on a case by case basis. Just like it is impossible to make generalized statements about the validity of self-reports, the validity of the IAT can vary across personality characteristics.  

Conclusion

Social psychologists have always distrusted self-report, especially for the measurement of sensitive topics like prejudice.  Many attempts were made to measure attitudes and other constructs with indirect methods.  The IAT was a major breakthrough because it has relatively high reliability compared to other methods.  Thus, creating the IAT was a major achievement that should not be underestimated because the IAT lacks construct validity as a measure of implicit personality characteristics. Even creating an indirect measure of attitudes is a formidable feat. However, in the early 1990s, social psychologists were enthralled by work in cognitive psychology that demonstrated unconscious or uncontrollable processes. Implicit measures were based on this work and it seemed reasonable to assume that they might provide a window into the unconscious. However, the processes that are involved in the measurement of personality characteristics with implicit measures are not the personality characteristics that are being measured.  There is nothing implicit about being a Republican or Democrat, gay or straight, or low self-esteem.  Conflating implicit processes in the measurement of personality constructs with implicit personality constructs has created a lot of confusion. It is time to end this confusion. The IAT is an implicit measure of personality with varying validity.  It is not a window into people’s unconscious feelings, attitudes or personalities.

References

Axt, J. R. (2018). The Best Way to Measure Explicit Racial Attitudes Is to Ask About Them. Social Psychological and Personality Science, 9, 896-906. https://doi.org/10.1177/1948550617728995

Anusic, I., & Schimmack, U. (2016). Stability and change of personality traits, self-esteem, and well-being: Introducing the meta-analytic stability and change model of retest correlations. Journal of Personality and Social Psychology, 110(5), 766-781. http://dx.doi.org/10.1037/pspp0000066

Anusic, I., Schimmack, U., Pinkus, R., & Lockwood, P. (2009). The nature and structure of correlations among Big Five ratings: the halo-alpha-beta model. Journal of Personality and Social Psychology, 97 6, 1142-56.

Bar-Anan, Y., & Vianello, M. (2018). A multi-method multi-trait test of the dual-attitude perspective. Journal of Experimental Psychology: General, 147(8), 1264-1272. http://dx.doi.org/10.1037/xge0000383

Blanton, H., Jaccard, J., Klick, J., Mellers, B., Mitchell, G., & Tetlock, P. E. (2009). Strong claims and weak evidence: Reassessing the predictive validity of the IAT. Journal of Applied Psychology, 94(3), 567-582. http://dx.doi.org/10.1037/a0014665

Bosson, J. K., Swann, W. B., Jr., & Pennebaker, J. W. (2000). Stalking the perfect measure of implicit self-esteem: The blind men and the elephant revisited? Journal of Personality and Social Psychology, 79(4), 631-643. http://dx.doi.org/10.1037/0022-3514.79.4.631

Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81-105. http://dx.doi.org/10.1037/h0046016

Chen, F., West, S.G., & Sousa, K.H. (2006) A Comparison of Bifactor and Second-Order Models of Quality of Life, Multivariate Behavioral Research, 41:2, 189-225,
DOI: 10.1207/s15327906mbr4102_5

Cunningham, W. A., Preacher, K. J., & Banaji, M. R. (2001). Implicit attitude measures: Consistency, stability, and convergent validity. Psychological Science, 12, 163-170. http://dx.doi.org/10.1111/1467-9280.00328

Dasgupta, N., & Greenwald, A. G. (2001). On the malleability of automatic attitudes: Combating automatic prejudice with images of admired and disliked individuals. Journal of Personality and Social Psychology, 81, 800–814. doi:10.1037/0022-3514.81.5.800

Epstein, S. (1980). The stability of behavior: II. Implications for psychological research. American Psychologist, 35(9), 790-806. http://dx.doi.org/10.1037/0003-066X.35.9.790

Falk, C. F., Heine, S. J., Takemura, K. , Zhang, C. X. and Hsu, C. (2015). Are Implicit Self-Esteem Measures Valid for Assessing Individual and Cultural Differences. Journal of Personality, 83: 56-68. doi:10.1111/jopy.12082

Falk, C., & Heine, S.J. (2015). What is implicit self-esteem, and does it vary across cultures? Personality and Social Psychology Review, 19, 177-98.

Greenwald, A. G., & Farnham, S. D. (2000). Using the Implicit Association Test to measure self-esteem and self-concept. Journal of Personality and Social Psychology, 79(6), 1022-1038. http://dx.doi.org/10.1037/0022-3514.79.6.1022

Fazio, R. H., & Olson, M. A. (2003). Implicit measures in social cognition. research: Their meaning and use. Annual Review of Psychology, 54, 297–327. http://dx.doi.org/10.1146/annurev.psych.54.101601.145225

Fazio, R.H., Sanbonmatsu, D.M., Powell, M.C., & Kardes, F.R. (1986). On the automatic activation of attitudes. Journal of Personality and Social Psychology, 50, 229–238.

Joy-Gaba, J. A., & Nosek, B. A. (2010). The surprisingly limited malleability of implicit racial evaluations. Social Psychology, 41, 137–146. doi:10.1027/1864-9335/a000020

Gawronski, B., LeBel, E. P., & Peters, K. R. (2007). What do implicit measures tell us?: Scrutinizing the validity of three common assumptions. Perspectives on Psychological Science, 2(2), 181-193. http://dx.doi.org/10.1111/j.1745-6916.2007.00036.x

Greenwald, A.G., McGhee, D.E., & Schwartz, J.L.K. (1998). Measuring individual differences in implicit cognition: The Implicit Association Test. Journal of Personality and Social Psychology, 74, 1464–1480.

Greenwald, A. G., Poehlman, T. A., Uhlmann, E. L., & Banaji, M. R. (2009). Understanding and using the Implicit Association Test: III. Meta-analysis of predictive validity. Journal of Personality and Social Psychology, 97, 17–41. http://dx.doi.org/10.1037/a0015575

Greenwald, A. G., Smith, C. T., Sriram, N., Bar-Anan, Y., & Nosek, B. A. (2009). Race attitude measures predicted vote in the 2008 U. S. Presidential Election. Analyses of Social Issues and Public Policy, 9, 241–253.

Hofmann, W., Gawronski, B., Gschwendner, T., Le, H., & Schmitt, M. (2005). A meta-analysis on the correlation between the Implicit Association Test and explicit self-report measures. Personality and Social Psychology Bulletin, 31, 1369 –1385. http://dx.doi.org/10.1177/0146167205275613

Kurdi, B., Seitchik, A. E., Axt, J. R., Carroll, T. J., Karapetyan, A., Kaushik, N., . . . Banaji, M. R. (2018). Relationship between the Implicit Association Test and intergroup behavior: A meta-analysis. American Psychologist. Advance online publication. http://dx.doi.org/10.1037/amp0000364

Lai, C.K., Hoffman, K.M., & Nosek, B.A. (2013). Reducing Implicit Prejudice. XX

McConahay, J.B. (1986). Modern racism, ambivalence, and the modern racism scale. In J.F. Dovidio & S.L. Gaertner (Eds.), Prejudice, discrimination, and racism (pp. 91–125). Orlando, FL: Academic Press

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), 1-8.

Oswald, F. L., Mitchell, G., Blanton, H., Jaccard, J., & Tetlock, P. E. (2013). Predicting ethnic and racial discrimination: A meta-analysis of IAT criterion studies. Journal of Personality and Social Psychology, 105(2), 171-192. http://dx.doi.org/10.1037/a0032734

Pelham, B. W., & Swann, W. B. (1989). From self-conceptions to self-worth: On the sources and structure of global self-esteem. Journal of Personality and Social Psychology, 57, 672– 680

Rosenberg, M. (1965). Society and the Adolescent Self-image. Princeton, NJ: Princeton University Press.

Schneider, D. J. (1973). Implicit personality theory: A review. Psychological Bulletin, 79(5), 294-309. http://dx.doi.org/10.1037/h0034496

Simms, L.J., Zelazny, K., Yam, W.H., & Gros, D.F. (2010). Self-informant Agreement for Personality and Evaluative Person Descriptors: Comparing Methods for Creating Informant Measures. European Journal of Personality, 24 3, 207-221.

Vul, E, Harris, C, Winkielman, P., & Pashler, (2009). Puzzlingly High Correlations in fMRI Studies of Emotion, Personality, and Social Cognition, Perspectives on Psycholical Science, 4, 274-90. doi: 10.1111/j.1745-6924.2009.01125.x.

Walker, S. S., & Schimmack, U. (2008). Validity of a happiness implicit association test as a measure of subjective well-being. Journal of Research in Personality, 42(2), 490-497. http://dx.doi.org/10.1016/j.jrp.2007.07.005

The race IAT: A Case Study of The Validity Crisis in Psychology:

Good science requires valid measures. This statement is hardly controversial. Not surprisingly, all authors of some psychological measure claim that their measure is valid. However, validation research is expensive and difficult to publish in prestigious journals. As a result, psychological science has a validity crisis. Many measures are used in hundreds of articles without clear definitions of constructs and without quantitative information about their validity (Schimmack, 2010).

The Implicit Association Test (AT) is no exception. The IAT was introduced in 1998 with strong and highly replicable evidence that average attitudes towards objects pairs (e.g., flowers vs. spiders) can be measured with reaction times in a classification task (Greenwald et al., 1998). Although the title of the article promised a measure of individual differences, the main evidence in the article were mean differences between groups. Thus, the original article provided little evidence that the IAT is a valid measure of individual differences.

The use of the IAT as a measure of individual differences in attitudes requires scientific evidence that tests scores are linked to variation in attitudes. Key evidence for the validity of a test are reliability, convergent validity, discriminant validity, and incremental predictive validity (Campbell & Fiske, 1959).

The validity of the IAT as a measure of attitudes has to be examined on a case by case basis because the link between associations and attitudes can vary depending on the attitude object. For attitude objects like pop drinks, Coke vs. Pepsi, associations may be strongly related to attitudes. In fact, the IAT has good predictive validity for choices between two pop drinks (Hofmann, Gawronski, Gschwendner, & Schmitt, 2005). However, it lacks convergent validity when it is used to measure self-esteem (Bosson & Swan, & Pennebaker, 2000).

The IAT is best known as a measure of prejudice, racial bias, or attitudes of White Americans towards African Americans. On the one hand, the inventor of the IAT, Greenwald, argues that the race IAT has predictive validity (Greenwald et al., 2009). Others take issue with the evidence: “Implicit Association Test scores did not permit prediction of individual-level behaviors” (Blanton et al., 2009, p. 567); “the IAT provides little insight into who will discriminate against whom, and provides no more insight than explicit measures of bias” (Oswald et al., 2013).

Nine years later, Greenwald and colleagues present a new meta-analysis of predictive validity of the IAT (Kurdi et al., 2018) based on 217 research reports and a total sample size of N = 36,071 participants. The results of this meta-analysis are reported in the abstract.

We found significant implicit– criterion correlations (ICCs) and explicit– criterion correlations (ECCs), with unique contributions of implicit (beta = .14) and explicit measures (beta = .11) revealed by structural equation modeling.

The problem with meta-analyses is that they aggregate information with diverse methods, measures, and criterion variables, and the meta-analysis showed high variability in predictive validity. Thus, the headline finding does not provide information about the predictive validity of the race IAT. As noted by the authors, “Statistically, the high degree of heterogeneity suggests that any single point estimate of the implicit– criterion relationship would be misleading” (p. 7).

Another problem of meta-analysis is that it is difficult to find reliable moderator variables if original studies have small samples and large sampling error. As a result, a non-significant moderator effect cannot be interpreted as evidence that results are homogeneous. Thus, a better way to examine the predictive validity of the race IAT is to limit the meta-analysis to studies that used the race IAT.

Another problem of small studies is that they introduce a lot of noise because point estimates are biased by sampling error. Stanley, Jarrell, and Doucouliagos (2010) made the ingenious suggestion to limit meta-analysis to the top 10% of studies with the largest sample sizes. As these studies have small sampling error to begin with, aggregating them will produce estimates with even smaller sampling error and inclusion of many small studies with high heterogeneity is not necessary. A smaller number of studies also makes it easier to evaluate the quality of studies and to examine sources of heterogeneity across studies. I used this approach to examine the predictive validity of the race IAT using the studies included in Kurdi et al.’s (2018) meta-analysis (data).

Description of the Data

The datafile contained the variable groupStemCat2 that coded the groups compared in the IAT. Only studies classified as groupStemCat2 == “African American and Africans” were selected, leaving 1328 entries (rows). Next, I selected only studies with an IAT-criterion correlation, leaving 1004 entries. Next, I selected only entries with a minimum sample size of N = 100, leaving 235 entries (more than 10%).

The 235 entries were based on 21 studies, indicating that the meta-analysis coded, on average, more than 10 different effects for each study.

The median IAT-criterion correlation across all 235 studies was r = .070. In comparison, the median r for the 769 studies with N < 100 was r = .044. Thus, selecting for studies with large N did not reduce the effect size estimate.

When I first computed the median for each study and then the median across studies, I obtained a similar median correlation of r = .065. There was no significant correlation between sample size and median ICC-criterion correlation across the 21 studies, r = .12. Thus, there is no evidence of publication bias.

I now review the 21 studies in decreasing order of the median IAT-criterion correlation. I evaluate the quality of the studies with 1 to 5 stars ranging from lowest to highest quality. As some studies were not intended to be validation studies, this evaluation does not reflect the quality of a study per se. The evaluation is based on the ability of a study to validate the IAT as a measure of racial bias.

1. * Ma et al. (Study 2), N = 303, r = .34

Ma et al. (2012) used several IATs to predict voting intentions in the 2012 US presidential election. Importantly, Study 2 did not include the race IAT that was used in Study 1 (#15, median r = .03). Instead, the race IAT was modified to include pictures of the two candidates Obama and Romney. Although it is interesting that an IAT that requires race classifications of candidates predicted voting intentions, this study cannot be used to claim that the race IAT as a measure of racial bias has predictive validity because the IAT measures specific attitudes towards candidates rather than attitudes towards African Americans in general.

2. *** Knowles et al., N = 285, r = .26

This study used the race IAT to predict voting intentions and endorsement of Obama’s health care reforms. The main finding was that the race IAT was a significant predictor of voting intentions (Odds Ratio = .61; r = .20) and that this relationship remained significant after including the Modern Racism scale as predictor (Odds Ratio = .67, effect size r = .15). The correlation is similar to the result obtained in the next study with a larger sample.

3. ***** Greenwald et al. (2009), N = 1,057, r = .17

The most conclusive results come from Greenwald et al.’s (2009) study with the largest sample size of all studies. In a sample of N = 1,057 participants, the race IAT predicted voting intentions in the 2008 US election (Obama vs. McCain), r = .17. However, in a model that included political orientation as predictor of voting intentions, only explicit attitude measures added incremental predictive validity, b = .10, SE = .03, t = 3.98, but the IAT did not, b = .00, SE = .02, t = 0.18.

4. * Cooper et al., N = 178, r = .12

The sample size in the meta-analysis does not match the sample size of the original study. Although 269 patients were involved, the race IAT was administered to 40 primary care clinicians. Thus, predictive validity can only be assessed on a small sample of N = 40 physicians who provided independent IAT scores. Table 3 lists seven dependent variables and shows two significant results (p = .02, p = .02) for Black patients.

5. * Biernat et al. (Study 1), N = 136, r = .10

Study 1 included the race IAT and donations to a Black vs. other student organizations as the criterion variable. The negative relationship was not significant (effect size r = .05). The meta-analysis also included the shifting standard variable (effect size r = .14). Shifting standards refers to the extent to which participants shifted standards in their judgments of Black versus White targets’ academic ability. The main point of the article was that shifting standards rather than implicit attitude measures predict racial bias in actual behavior. “In three studies, the tendency to shift standards was uncorrelated with other measures of prejudice but predicted reduced allocation of funds to a Black student organization.” Thus, it seems debatable to use shifting standards as a validation criterion for the race IAT because the key criterion variable were the donations, while shifting standards were a competing indirect measure of prejudice.

6. ** Zhang et al. (Study 2), N = 196, r = .10

This study examined thought listings after participants watched a crime committed by a Black offender on Law and Order. “Across two programs, no statistically significant relations between the nature of the thoughts and the scores on IAT were found, F(2, 85) = 2.4, p < .11 for program 1, and F(2, 84) = 1.98, p < .53 for program 2.” The main limitation of this study is that thought listings are not a real social behavior. As the effect size for this study is close to the median, excluding it has no notable effect on the final result.

7. * Ashburn et al., N = 300, r = .09

The title of this article is “Race and the psychological health of African Americans.” The sample consists of 300 African American participants. Although it is interesting to examine racial attitudes of African Americans, this study does not address the question whether the race IAT is a valid measure of prejudice against African Americans.

8. *** Eno et al. (Study 1), N = 105, r = .09

This article examines responses to a movie set during the Civil Rights Era; “Remember the Titans.” After watching the movie, participants made several ratings about interpretations of events. Only one event, attributing Emma’s actions to an accident, showed a significant correlation with the IAT, r = .20, but attributions to racism also showed a correlation in the same direction, r = .10. For the other events, attributions had the same non-significant effect size, Girls interests r = .12, Girls race, r = .07; Brick racism, r = -.10, Brick Black coach’s actions, r = -.10.

9. *** Aberson & Haag, N = 153, r = .07

Abserson and Haag administered the race IAT to 153 participants and asked questions about quantity and quality of contact with African Americans. They found non-significant correlations with quantity, r = -.12 and quality, r = -.10, and a significant positive correlation with the interaction, r = .17. The positive interaction effect suggests that individuals with low contact, which implies low quality contact as well, are not different from individuals with frequent high quality contact.

10. *Hagiwara et al., N = 106, r = .07

This study is another study of Black patients and non-Black physician. The main limitation is that there were only 14 physicians and only 2 were White.

11. **** Bar-Anan & Nosek, N = 397, r = .06

This study used contact as a validation criterion. The race IAT showed a correlation of r = -.14 with group contact. , N in the range from 492-647. The Brief IAT showed practically the same relationship, r = -.13. The appendix reports that contact was more strongly correlated with the explicit measures; thermometer r = .27, preference r = .31. Using structural equation modeling, as recommended by Greenwald and colleagues, I found no evidence that the IAT has unique predictive validity in the prediction of contact when explicit measures were included as predictors, b = .03, SE = .07, t = 0.37.

12. *** Aberson & Gaffney, N = 386, median r = .05

This study related the race IAT to measures of positive and negative contact, r = .10, r = -.01, respectively. Correlations with an explicit measure were considerably stronger, r = .38, r = -.35, respectively. These results mirror the results presented above.

13. * Orey et al., N = 386, median r = .04

This study examined racial attitudes among Black respondents. Although this is an interesting question, the data cannot be used to examine the predictive validity of the race IAT as a measure of prejudice.

14. * Krieger et al., N = 708, median r = .04

This study used the race IAT with 442 Black participants and criterion measures of perceived discrimination and health. Although this is a worthwhile research topic, the results cannot be used to evaluate the validity of the race IAT as a measure of prejudice.

15. *** Ma et al. (Study 1), N = 335, median r = .03

This study used the race IAT to predict voter intentions in the 2012 presidential election. The study found no significant relationship. “However, neither category-level measures were related to intention to vote for Obama (rs ≤ .06, ps ≥ .26)” (p. 31). The meta-analysis recorded a correlation of r = .045, based on email correspondence with the authors. It is not clear why the race IAT would not predict voting intentions in 2012, when it did predict voting intentions in 2008. One possibility is that Obama was now seen as a an individual rather than as a member of a particular group so that general attitudes towards African Americans no longer influenced voting intentions. No matter what the reason is, this study does not provide evidence for the predictive validity of the race IAT.

16. **** Oliver et al., N = 105, median r = .02

This study was on online study of 543 family and internal medicine physicians. They completed the race IAT and gave treatment recommendations for a hypothetical case. Race of the patient was experimentally manipulated. The abstract states that “physicians possessed explicit and implicit racial biases, but those biases did not predict
treatment recommendations” (p. 177). The sample size in the meta-analysis is smaller because the total sample was broken down into smaller subgroups.

17. * Nosek & Hansen, N = 207, median r = .01

This study did not include a clear validation criterion. The aim was to examine the relationship between the race IAT and cultural knowledge about stereoetypes. “In seven studies (158 samples, N = 107,709), the IAT was reliably and variably related to explicit attitudes, and explicit attitudes accounted for the relationship between the IAT and cultural knowledge.” The cultural knowledge measures were used as criterion variables. A positive relation, r = .10, was obtained for the item “If given the choice, who would most employers choose to hire, a Black American or a White American? (1 definitely White to 7 definitely Black).” A negative relation, r = -.09, was obtained for the item “Who is more likely to be a target of discrimination, a Black American or a White American? (1 definitely White to 7 definitely Black).”

18. *Plant et al., N = 229, median r = .00

This article examined voting intentions in a sample of 229 students. The results are not reported in the article. The meta-analysis reported a positive r = .04 and a negative r = -.04 for two separate entries with different explicit measures, which must be a coding mistake. As voting behavior has been examined in larger and more representative samples (#3, #15), these results can be ignored.

19. *Krieger et al. (2011), N = 503, r = .00

This study recruited 504 African Americans and 501 White Americans. All participants completed the race IAT. However, the study did not include clear validation criteria. The meta-analysis used self-reported experiences of discrimination as validation criterion. However, the important question is whether the race IAT predicts behaviors of people who discriminate, not the experience of victims of discrimination.

20. *Fiedorowicz, N = 257, r = -.01

This study is a dissertation and the validation criterion was religious fundamentalism.

21. *Heider & Skowronski, N = 140, r = -.02

This study separated the measurement of prejudice with the race IAT and the measurement of the criterion variables by several weeks. The criterion was cooperative behavior in a prisoner dilemma game. The results showed that “both the IAT (b = -.21, t = -2.51, p = .013) and the Pro-Black subscore (b = .17, t = 2.10, p = .037) were significant predictors of more cooperation with the Black confederate. However, these results were false and have been corrected (see Carlsson et al., 2018, for a detailed discussion).

Heider, J. D., & Skowronski, J.J. (2011). Addendum to Heider and Skowronski (2007): Improving the predictive validity of the Implicit Association Test. North American Journal of Psychology, 13, 17-20

Discussion

In summary, a detailed examination of the race IAT studies included in the meta-analysis shows considerable heterogeneity in the quality of the studies and their ability to examine the predictive validity of the race IAT. The best study is Greenwald et al.’s (2009) study with a large sample and voting in the Obama vs. McCain race as the criterion variable. However, another voting study failed to replicate these findings in 2012. The second best study was BarAnan and Nosek’s study with intergroup contact as a validation criterion, but it failed to show incremental predictive validity of the IAT.

Studies with physicians show no clear evidence of racial bias. This could be due to the professionalism of physicians and the results should not be generalized to the general population. The remaining studies were considered unsuitable to examine predictive validity. For example, some studies with African American participants did not use the IAT to measure prejudice.

Based on this limited evidence it is impossible to draw strong conclusions about the predictive validity of the race IAT. My assessment of the evidence is rather consistent with the authors of the meta-analysis, who found that “out of the 2,240 ICCs included in this metaanalysis, there were only 24 effect sizes from 13 studies that (a) had the relationship between implicit cognition and behavior as their primary focus” (p. 13).

This confirms my observation in the introduction that psychological science has a validation crisis because researchers rarely conduct validation studies. In fact, despite all the concerns about replicability, the lack of replication studies are much more numerous than validation studies. The consequences of the validation crisis is that psychologists routinely make theoretical claims based on measures with unknown validity. As shown here, this is also true for the IAT. At present, it is impossible to make evidence-based claims about the validity of the IAT because it is unknown what the IAT measures and how well it measures what it measures.

Theoretical Confusion about Implicit Measures

The lack of theoretical understanding of the IAT is evident in Greenwald and Banaji’s (2017) recent article, where they suggest that “implicit cognition influences explicit cognition that, in turn, drives behavior” (Kurdi et al., p. 13). This model would imply that implicit measures like the IAT do not have a direct link to behavior because conscious processes ultimately determine actions. This speculative model is illustrated with Bar-Anan and Nosek’s (#11) data that showed no incremental predictive validity on contact. The model can be transformed into a causal chain by changing the bidiretional path into an assumed causal relationship between implicit and explicit attitudes.

However, it is also possible to change the model into a single factor model, that considers unique variance in implicit and explicit measures as mere method variance.

Thus, any claims about implicit bias and explicit bias is premature because the existing data are consistent with various theoretical models. To make scientific claims about implicit forms of racial bias, it would be necessary to obtain data that can distinguish empirically between single construct and dual-construct models.

Conclusion

The race IAT is 20 years old. It has been used in hundreds of articles to make empirical claims about prejudice. The confusion between measures and constructs has created a public discourse about implicit racial bias that may occur outside of awareness. However, this discourse is removed from the empirical facts. The most important finding of the recent meta-analysis is that a careful search of the literature uncovered only a handful of serious validation studies and that the results of these studies are suggestive at best. Even if future studies would provide more conclusive evidence of incremental predictive validity, this finding would be insufficient to claim that the IAT is a valid measure of implicit bias. The IAT could have incremental predictive validity even if it were just a complementary measure of consciously accessible prejudice that does not share method variance with explicit measures. A multi-method approach is needed to examine the construct validity of the IAT as a measure of implicit race bias. Such evidence simply does not exist. Greenwald and colleagues had 20 years and ample funding to conduct such validation studies, but they failed to do so. In contrast, their articles consistently confuse measures and constructs and give the impression that the IAT measures unconscious processes that are hidden from introspection (“conscious experience provides only a small window into how the mind works”, “click here to discover your hidden thoughts”).

Greenwald and Banaji are well aware that their claims matter. “Research on implicit social cognition has witnessed higher levels of attention both from the general public and from governmental and commercial entities, making regular reporting of what is known an added responsibility” (Kurdi et al., 2018, p. 3). I concur. However, I do not believe that their meta-analysis fulfills this promise. An unbiased assessment of the evidence shows no compelling evidence that the race IAT is a valid measure of implicit racial bias; and without a valid measure of implicit racial bias it is impossible to make scientific statements about implicit racial bias. I think the general public deserves to know this. Unfortunately, there is no need for scientific evidence that prejudice and discrimination still exists. Ideally, psychologists will spend more effort in developing valid measures of racism that can provide trustworthy information about variation across individuals, geographic regions, groups, and time. Many people believe that psychologists are already doing it, but this review of the literature shows that this is not the case. It is high time to actually do what the general public expects from us.

No Incremental Predictive Validity of Implicit Attitude Measures

The general public has accepted the idea of implicit bias; that is, individuals may be prejudice without awareness. For example, in 2018 Starbucks closed their stores for one day to train employees to detect and avoid implicit bias (cf. Schimmack, 2018).

However, among psychological scientists the concept of implicit bias is controversial (Blanton et al., 2009; Schimmack, 2019). The notion of implicit bias is only a scientific construct if it can be observed with scientific methods, and this requires valid measures of implicit bias.

Valid measures of implicit bias require evidence of reliability, convergent validity, discriminant validity, and incremental predictive validity. Proponents of implicit bias claim that measures of implicit bias have demonstrated these properties. Critics are not convinced.

For example, Cunningham, Preacher, and Banaji (2001) conducted a multi-method study and claimed that their results showed convergent validity among implicit measures and that implicit measures correlated more strongly with each other than with explicit measures. However, Schimmack (2019) demonstrated that a model with a single factor fit the data better and that the explicit measures loaded higher on this factor than the evaluative priming measure. This finding challenges the claim that implicit measures possess discriminant validity. That is, the are implicit measures of racial bias, but they are not measures of implicit racial bias.

A forthcoming meta-analysis claims that implicit measures have unique predictive validity (Kurdi et al., 2018). The average effect size for the correlation between an implicit measure and a criterion was r = .14. However, this estimate is based on studies across many different attitude objects and includes implicit measures of stereotypes and identity. Not surprisingly, the predictive validity was heterogeneous. Thus, the average does not provide information about the predictive validity of implicit measures of implicit bias. The most important observation was that sample sizes of many studies were too small to investigate predictive validity given the small expected effect size. Most studies had sample sizes with fewer than 100 participants (see Figure 1).

A notable exception is a study of voting intentions in the historic 2008 presidential election, where US voters had a choice to elect the first Black president, Obama, or the Republican candidate McCain. A major question at that time was how much race and prejudice would influence the vote.

Greenwald, Tucker Smith, Sriram, Bar-Anan, and Nosek (2009) conducted a study to address this question.

They obtained data from N = 1,057 participants who completed online implicit measures and responded to survey questions.

The key outcome variable was a simple dichotomous question about voting intentions. The sample was not a national representative sample as indicated by 84.2% declared votes for Obama versus 15.8% declared votes for McCain.

The predictor variables were two self-report measures of prejudice (feeling-thermometer, Likert scale), two implicit measures (Brief IAT, AMP), the Symbolic Racism Scale, and a measure of political orientation (Conservative vs. Liberal).

The correlation among all measures were reported in Table 1.

The results for the Brief IAT (BIAT) are highlighted. First, the BIAT does predict voting intentions (r = .17). Second, the BIAT shows convergent validity with the second implicit measure; the Affective Missattribution Paradigm (AMP). Third, the IAT also correlates with the explicit measures of racial bias. Most important, the correlations with the implicit AMP are weaker than the correlations with the explicit measures. This finding confirms Schimmack’s (2019) finding that implicit measures lack discriminant validity.

The correlation table does not address the question whether implicit measures have incremental predictive validity. To examine this question, I fit a structural equation model to the reproduced covariance matrix based on the reported correlations and standard deviations using MPLUS8.2. The model shown in Figure 1 had good overall fit, chi2(9, N = 1057) = 15.40, CFI = .997, RMSEA = .026, 90%CI = .000 to .047.

The model shows that explicit and implicit measures of racial bias load on a common factor (att). Whereas the explicit measures share method variance, the residuals of the two implicit measures are not correlated. This confirms the lack of discriminant validity. That is, there is no unique variance shared only by implicit measures. The strongest predictor of voting intentions is political orientation. Symbolic racism is a mixture of conservatism and racial bias, and it has no unique relationship with voting intentions. Racial bias does make a unique contribution to voting intentions, (b = .22, SE = .05, t = 4.4). The blue path shows that the BIAT does have predictive validity above and beyond political orientation, but the effect is indirect. That is, the IAT is a measure of racial bias and racial bias contributes to voter intentions. The red path shows that the BIAT has no unique relationship with voting intentions. The negative coefficient is not significant. Thus, there is no evidence that the unique variance in the BIAT reflects some form of implicit racial bias that influences voting intentions.

In short, these results provide no evidence for the claim that implicit measures tap implicit racial biases. In fact, there is no scientific evidence for the concept of implicit bias, which would require evidence of discriminant validity and incremental validity.

Conclusion

The use of structural equation modeling (SEM) was highly recommended by the authors of the forthcoming meta-analysis (Kurdi et al., 2018). Here I applied SEM used the best data with multiple explicit and implicit measures, an important criterion variable, and a large sample size that is sufficient to detect small relationships. Contrary to the meta-analysis, the results do not support the claim that implicit measures have incremental predictive validity. In addition, the results confirmed Schimmack’s (2019) results that implicit measures lack discriminant validity. Thus, the construct of implicit racial bias lacks empirical support. Implicit measures like the IAT are best considered as implicit measures of racial bias that is also reflected in explicit measures.

With regard to the political question whether racial bias influenced voting in the 2008 election, these results suggest that racial bias did indeed matter. Using only explicit measures would have underestimated the effect of racial bias due to the substantial method variance in these measures. Thus, the IAT can make an important contribution to the measurement of racial bias because it doesn’t share method variance with explicit measures.

In the future, users of implicit measures need to be more careful in their claims about the construct validity of implicit measures. Greenwald et al. (2009) constantly conflate implicit measures of racial bias with measures of implicit racial bias. For example, the title claims “Implicit Race Attitudes Predicted Vote” , the term “Implicit race attitude measure” is ambiguous because it could mean implicit measure or implicit attitude, whereas the term “implicit measures of race attitudes” implies that the measures are implicit but the construct is racial bias; otherwise it would be “implicit measures of implicit racial bias.” The confusion arises from a long tradition in psychology to conflate measures and constructs (e.g., intelligence is whatever an IQ test measures) (Campbell & Fiske, 1959). Structural equation modeling makes it clear that measures (boxes) and constructs (circles) are distinct and that measurement theory is needed to relate measures to constructs. At present, there is clear evidence that implicit measures can measure racial bias, but there is no evidence that attitudes have an explicit and an implicit component. Thus, scientific claims about racial bias do not support the idea that racial bias is implicit. This idea is based on the confusion of measures and constructs in the social cognition literature.

No Discriminant Validity of Implicit and Explicit Prejudice Measures

Abstract

I reexamine Cunningham, Preacher, and Banaji’s claim that explicit and implicit attitude measures have discriminant validity. Contrary to their claim, a single factor model fits the data better than their hierarchical model with an explicit and an implicit attitude factor. I also show that attitudes over the two-month period were stable and not influenced by contextual factors. There is also no evidence that different implicit measures tap different types of unconscious bias. All measures have low validity as measures of prejudice. I conclude that the concept of unconscious or implicit prejudice lacks empirical support because implicit measures do not show discriminant validity from explicit measures.

Keywords:  Prejudice, Attitudes, Multi-Method, Discriminant Validity, Structural Equation Modeling

No Discriminant Validity of Implicit and Explicit Prejudice Measures

An article in Psychological Science (Cunningham, Preacher, & Banaji, 2001) reported the results of a longitudinal multi-method study of prejudice; that is, attitudes towards African Americans.  The article is frequently cited (446 citations in total, 30 citations in 2018 on January 31 in WebofScience) as evidence that explicit and implicit measures of prejudice measure two different constructs.  Explicit measures are assumed to assess consciously accessible and controllable attitudes, whereas implicit measures are assumed to assess uncontrollable aspects of attitudes that may exist outside of conscious awareness. Although the article was published nearly 20 years ago, it remains “the most sophisticated examination of measurement error and the interrelations among various implicit measures” (Fazio & Olson, 2003).  Thus, it provides the single most important empirical evidence for the construct validity of implicit measures of prejudice. Without evidence for discriminant validity, implicit measures might simply be implicit measures of the same construct that is measured by means of self-report measures. Although implicit measures have many advantages over self-report measures, this view suggests that there is no need for a theoretical distinction between explicit and implicit forms of racial bias.

In this article, I reexamine Cunningham’s structural equation model that was used to support the claim that “the two kinds of attitude measures also tap unique sources of variance (Cunningham et al., 2001); a single-factor solution does not fit the data” (p. 170).  To be blunt, I will show that this claim is false.  A single factor model actually does fit the data better than the model reported in the original article. Second, I use the data to examine the contribution of stable traits and situational factors to measures of racial bias.  These results shed new light on the controversial question about the context sensitivity of implicit attitude measures.  Some experimental studies suggest that implicit measures are sensitive to situational factors (Dasgupta). However, effect sizes in these small studies tend to be inflated. A large replication study with thousands of participants found only an effect size of d = .08, suggesting that implicit measures reflect mostly stable individual differences in prejudice and measurement error (Joy-Gaba & Nosek, 2011).

Description of the Design and Measures

Participants were 93 students with complete data. Each student completed a single explicit measure of prejudice, the Modern Racism Scale (McConahay, 1986), and three implicit measures: (a) the standard race IAT (Greenwald, McGhee, & Schwartz, 1998), a response window IAT (Cunningham et al., 2001), and a response window evaluative priming task (Fazio, Sanbonmatsu, Powell, & Kardes, 1986). The assessment was repeated on four occasions two weeks apart.

Reproducing the Original Model

Although it was not common to publish original data in 2001, structural equation modeling does not require access to the original data.  It is possible to reproduce or test alternative models simply based on the correlations and standard deviations.  Fortunately, Cunningham et al. (2001) published this information and I was able to reproduce their model, using MPLUS8.2. Figure 1 shows the parameter estimates. They close correspond to the original results.  The original article reported good model fit, “chi2(100, N = 93) = 111.58, p = .20; NNFI = .96; CFI = .97; RMSEA = 0.041 (90% confidence interval: 0.00, 0.071)” (p. 168).  The model fit for the reproduced model was very similar, chi2(100, N = 93) = 112, CFI = .977, RMSEA = 0.036, 90%CI = .000 to .067.  Thus, the model fit of the reproduced model serves as a comparison standard for the alternative models that I examined next.

Figure 1. Original Model with reproduced parameter estimates based on the published correlations and standard deviations.

Bi-Factor Model

The original model is a hierarchical model with an implicit attitude factor as a second-order factor, and method-specific first order factors. Each first-order factor has four indicators for the four measurement occasions. A hierarchical model imposes constraint on the first order loadings because they contribute to the first-order relations among indicators of the same method and to the second order relations of different implicit methods to each other. An alternative way to model multi-method data are bi-factor models (Chen, West, & Sousa, 2006).

A bifactor model allows for all measures to be directly related to the general trait factor that corresponds to the second-order factor in a hierarchical model.  However, bi-factor models may not be identified if there are no method factors. Thus, a first step is to allow for method-specific correlated residuals and to examine whether these correlations are positive.

The model with a single factor and method-specific residual correlations fit the data better than the hierarchical model, chi2(80, N = 93) = 87, CFI = .988, RMSEA = 0.029, 90%CI = .000 to .065.  Inspection of the residual correlations showed high correlations for the Modern Racism scale, but less evidence for method-specific variance for the implicit measures.  The response window IAT had no significant residual correlations.  This explains the high factor loading of the respond window IAT in the hierarchical model.  It does not suggest that this is the most valid measure. Rather, it shows that there is little method specific variance. Fixing these residual correlations to zero, improved model fit, chi2(86, N = 93) = 91, CFI = .991, RMSEA = 0.025, 90%CI = .000 to .062. I then tried to create method factors for the remaining methods. For the IAT, a method factor could also be created using the first three occasions because the forth occasion did not load on the method factor.  However, model fit decreased unless occasion 2 was allowed to correlate with occasion 4.  This unexpected finding is unlikely to reflect a real relationship.  Thus, I retained the model with a method factor for the first three occasions only, chi2(89, N = 93) = 97, CFI = .986, RMSEA = 0.029, 90%CI = .000 to .064.  I was able to fit a method factor for evaluative priming, but model fit decreased, x2(91, N = 93) = 101, CFI = .983, RMSEA = 0.033, 90%CI = .000 to .065. The first occasion did not load on the method factor. Model fit could be improved by fixing the loading to zero and by allowing for an additional correlation between occasion 1 and 3, chi2(91, N = 93) = 98, CFI = .988, RMSEA = 0.027, 90%CI = .000 to .062.  However, there is no rational for this relationship and I retained the more parsimonious model.  Fitting the measurement model for the modern racism scale also decreased fit, but fit was better than for the model in the original article, x2(94, N = 93) = 107, CFI = .977, RMSEA = 0.038, 90%CI = .000 to .068.  This was the final model (Figure 2).

Figure2.Bi-FactorModel


The most important results are the factor loadings of the measures on the trait factor. Factor loadings for the Modern racism scale ranged from .35 to .45 (M = .40). Factor loadings for the standard IAT ranged from .43 to .54 (M = .47). Factor loadings for the response window IAT ranged from .41 to .69 (M = .51).  The evaluative priming measures had the lowest factor loadings ranging from .13 to .47 (M = .29).  In terms of absolute validity, all of these validity coefficients are low, suggesting that a single standard IAT measure on a single occasion has.47^2 = 22% valid variance.  Most important, these results suggest that the Modern Racism Scale and the IAT measure a single construct and that the low correlation between implicit and explicit measures reflects low convergent validity rather than high discriminant validity. 

Context Sensitivity

The model in Figure 2 assumes that prejudice is stable over the two-month period of the study and that there are no systematic changes in prejudice levels. To test this assumption, I tested a model with correlated residuals among measures taken at the same occasion.  Model fit improved, chi2(70, N = 93) = 75, CFI = .991, RMSEA = 0.027, 90%CI = .000 to .066.  However, the pattern of residual correlations did not reveal evidence for state variance.  For time 1, the IAT was correlated with the RW-IAT and evaluative priming, but the latter two were not correlated. In addition, evaluative priming was negatively related to modern racism.  At time 2, none of the correlations were significant, and fixing them to zero improved model fit, chi2(76, N = 93) = 78, CFI = .996, RMSEA = 0.016, 90%CI = .000 to .060.   At time 3, the two IAT measures were negatively correlated, but they correlated positively with the modern racism scale.  Fixing the remaining four correlations to zero improved model fit, x2(74, N = 93) = 78, CFI = .993, RMSEA = 0.023, 90%CI = .000 to .060.  At time 4, there were no significant correlations and constraining the correlations to zero did not alter fit, chi2(76, N = 93) = 81, CFI = .991, RMSEA = 0.026, 90%CI = .000 to .064.  These analyses show that there are no systematic changes in prejudice over the course of the study.

Conclusion

A reexamination of Cunningham et al.’s (2001) multi-measure study of racial attitudes challenges the original conclusion that a single factor model does not fit the data.  In fact, a single factor model fits the data better than the original, hierarchical model.  Moreover, the new model shows that the original article falsely suggested that each measure has stable method variance. A careful analysis of residual correlations showed that only the modern racism scale has substantial and stable method variance on all four occasions. Another finding was that implicit measures on the same occasion did not share variance with each other. This finding suggests that prejudice is a stable disposition, at least over a two-month period, and not a malleable state. This is consistent with weak effects of experimental manipulations on IAT scores (Joy Gaba & Nosek, 2010).  

Factor loadings of the two IAT measures on the prejudice factor were slightly higher than those for the Modern Racism Scale.  This might suggest that implicit measures have slightly higher validity than explicit measures. However, this conclusion is limited to the Modern Racism Scale, which tends to show lower convergent validity with the IAT than more direct prejudice measures (Axt, 2018). In addition, the evaluative priming task had lower validity. Thus, validity has to be evaluated for each measure and it is impossible to make general statements about higher or lower validity of implicit versus explicit measures.

The main practical implication of this new look at old data is that claims about implicit racial bias as a distinct form of prejudice is not supported by scientific evidence. Although implicit measures are less susceptible to socially desirable responding, they do not necessarily assess some unconscious form of prejudice.  This is not a criticism of implicit measures like the Implicit Association Test.  The ability to measure prejudice without self-reports is extremely valuable for prejudice researchers. Given the low validity of a single IAT it should not be used for assessment of individuals. However, measurement error is reduced in comparisons of groups of participants and the IAT can reveal important group differences in prejudice levels. However, proponents of the IAT have argued that the IAT also measures some hidden form of prejudice that is not accessible to introspection (Kurdi et al., 2018). This claim requires demonstration of discriminant validity (Campbell & Fiske, 1959), and evidence of discriminant validity is lacking. Evidence for the unique predictive validity of the IAT is also controversial (Kurdi et al., 2018). A meta-analysis suggests that about 1% of the variance in criterion variables is explained by IAT scores. However, the authors also note that most studies were severely underpowered to detect such small effect sizes. Moreover, even unique predictive variance in mono-method studies does not demonstrate that the IAT measures a different construct. I therefore urge prejudice researchers to conduct high-powered multi-method studies to examine the discriminant and predictive validity of implicit prejudice measures.

References

Axt, J. R. (2018). The Best Way to Measure Explicit Racial Attitudes Is to Ask About Them. Social Psychological and Personality Science, 9, 896-906. https://doi.org/10.1177/1948550617728995

Chen, F., West, S.G., & Sousa, K.H. (2006) A Comparison of Bifactor and Second-Order Models of Quality of Life, Multivariate Behavioral Research, 41:2, 189-225,
DOI: 10.1207/s15327906mbr4102_5

Cunningham, W. A., Preacher, K. J., & Banaji, M. R. (2001). Implicit attitude measures: Consistency, stability, and convergent validity. Psychological Science, 12, 163-170. http://dx.doi.org/10.1111/1467-9280.00328

Dasgupta, N., & Greenwald, A. G. (2001). On the malleability of automatic attitudes: Combating automatic prejudice with images of admired and disliked individuals. Journal of Personality and Social Psychology, 81, 800–814. doi:10.1037/0022-3514.81.5.800

Fazio, R.H., Sanbonmatsu, D.M., Powell, M.C., & Kardes, F.R. (1986). On the automatic activation of attitudes. Journal of Personality and Social Psychology, 50, 229–238.

Joy-Gaba, J. A., & Nosek, B. A. (2010). The surprisingly limited malleability of implicit racial evaluations. Social Psychology, 41, 137–146. doi:10.1027/1864-9335/a000020

Greenwald, A.G., McGhee, D.E., & Schwartz, J.L.K. (1998). Measuring individual differences in implicit cognition: The Implicit Association Test. Journal of Personality and Social Psychology, 74, 1464–1480.

Kurdi, B., Seitchik, A. E., Axt, J. R., Carroll, T. J., Karapetyan, A., Kaushik, N., . . . Banaji, M. R. (2018). Relationship between the Implicit Association Test and intergroup behavior: A meta-analysis. American Psychologist. Advance online publication. http://dx.doi.org/10.1037/amp0000364

McConahay, J.B. (1986). Modern racism, ambivalence, and the modern racism scale. In J.F. Dovidio & S.L. Gaertner (Eds.), Prejudice, discrimination, and racism (pp. 91–125). Orlando, FL: Academic Press

A New Look at False Discoveries in the Open Science Collaboration Reproducibility Project

In a groundbreaking article, a team of psychologists replicated 97 published studies with a significant result. The key finding was that only 36% of the 97 significant results could replicated; that is the replication study reproduced a significant result.

One conclusion that can be drawn from this result is that the average success rate in psychology research is around 40%, but journals publish over 90% significant results, which shows that the published record is biased on favor of supporting evidence.

However, the result does not tell us how many of the published results were false positives. In this post, I use the replication studies to estimate the false discovery risk; that is the maximum false discovery rate (Soric, 1989).

Soric demonstrated that the maximum false discovery rate is determined by the discovery rate; that is the percentage of significant results for all statistical tests. The problem is that we typically only see a biased sample of mostly significant results so that the discovery rate is unknown.

Brunner and Schimmack (2018) developed a method, z-curve, that makes it possible to estimate the discovery rate based on the power of the significant results. For example, if a significant result was obtained with 20% power, an average 5 studies are needed to produce a significant result. Thus, the expected value is 5. For false positives, the probabilty of a significant result is alpha, which is typically 5%. So, 20 studies are needed to get one significant result.

Previously I used z-curve for sets of studies published in journals that were selected for significance. Here I use the results of the replication studies from the reproducibility project to estimate the false discovery risk in psychological science; or at least for the three journals that were used for the project (JPSP, JEP-LMC, Psych Science).

The dataset consists of 88 studies. 9 studies were excluded because the replication study was less than ideal (e.g., smaller sample size than original study). Because there is no selection for significance, z-curve used all studies to estimate the weights for different levels of power that could reproduce the observed distribution of z-scores. The first finding is that the proportion of significant results in the reproducbility project, the discovery rate was 38%. This is consistent with the estimated discovery rate based on the power estimates of 40%. This confirms that the published results are an unbiased sample. The other statistics in the figure are less interesting because they focus on the studies that produced a significant result again. For example, the 74% replication rate estimates suggests that the success rate would increase to 74% if only the 35 studies with significant results were replicated again (re-replicated). Soric’s FDR tells us that no more than 9% of the 35 studies with significant results are false discoveries. However, the more interesting question is how many of the 88 studies that were replicated could be false discoveries. This would be an estimate of the false discovery rate in psychology.

Obtaining this estimate is straightforward. We simply can use the weights of the model that do not distinguish between significant and non-significant results. They apply to the whole distribution. This does not change anything about the number of studies that would be needed to produce a significant result. So, we can divide the weights by power and sum them to get the average number of studies that would be required to get 1 significant result for each of the 88 studies. The estimate is 4.18 studies for each significant result, which translates into a discovery rate of 24%. This suggests that experimental psychologists conduct on average 4 studies for every significant result that gets published.

We can then use Soric’s formula and find that a discovery rate of 24% yields a false discovery risk of 17%.

This estimate is somewhat larger than the estimate based on z-curve analysis of the original studies, which was only 10% (see Figure 2).

The reason could be that it is difficult to adjust for the use of questionable research practices. However, it is also possible that problems with some replication studies produced false positives that inflate the FDR estimate based on the replication studies. However, both estimates show that most published results in psychology journals are not false positives.

Although this is good news, it is important to realize that Soric’s FDR focuses on the nil-hypothesis that the population effect size is zero or even in the opposite direction. A bigger concern is that many published results have dramatically inflated effect sizes that may be theoretically or practically irrelevant. Z-curve provides a way to estimate the FDR that treats studies with very low power as false positives. Z-curve is fitted to the data with varying amounts of false positives. If model fit is not much different from the free model, the data provide are consistent with the specified number of false positives. This value is reported in Figure 1 and shows that up to 35% of published results could be false positives if studies with less than 17% power are considered false positives. This estimate changes with the definition of false positives.

Conclusion

In conclusion, this post showed how z-curve can be used to estimate the false discovery risk in psychological science based on a set of unbiased replication studies. As more replication studies are being conducted, z-curve can provide valuable information about the false discovery risk in psychological science.