“For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).
DEFINITION OF REPLICABILITY: In empirical studies with random error variance replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion.
This post presented the first replicability ranking and explains the methodology that is used to estimate the typical power of a significant result published in a journal. The post provides an explanation of the new method to estimate observed power based on the distribution of test statistics converted into absolute z-scores. The method has been developed further to estimate power for a wider range of z-scores by developing a model that allows for heterogeneity in power across tests. A description of the new method will be published when extensive simulation studies are completed.
The R-Index can be used to predict whether a set of published results will replicate in a set of exact replication studies. It combines information about the observed power of the original studies with information about the amount of inflation in observed power due to publication bias (R-Index = Observed Median Power – Inflation). The R-Index has predicted the outcome of actual replication studies.
The Test of Insufficient Variance is the most powerful test of publication bias and/or dishonest reporting practices. It can be used even if only two independent statistical results are available, although power to detect bias increases with the number of studies. After converting test results into z-scores, z-scores are expected to have a variance of one. Unless power is very high, some of these z-scores will not be statistically significant (z .05 two-tailed). If these non-significant results are missing, the variance shrinks, and TIVA detects that the variance is insufficient. The observed variance is compared against the expected variance of 1 with a left-tailed chi-square test. The usefulness of TIVA is illustrated with Bem’s (2011) “Feeling the Future” data.
5. MOST VIEWED POST (with comment by Noble Laureate Daniel Kahneman)
This blog post examines the replicability of priming studies cited in Daniel Kahneman’s popular book “Thinking fast and slow.” The results suggest that many of the cited findings are difficult to replicate.
Stereotype-threat has been used by social psychologists to explain gender differences in math performance. Accordingly, the stereotype that men are better at math than women is threatening to women and threat leads to lower performance. This theory has produced a large number of studies, but a recent meta-analysis showed that the literature suffers from publication bias and dishonest reporting. After correcting for these effects, the stereotype-threat effect was negligible. This blog post shows a low R-Index for the first article that appeared to provide strong support for stereotype-threat. These results show that the R-Index can warn readers and researchers that reported results are too good to be true.
Some Bayesian statisticians have proposed Bayes-Factors to provide evidence for a Null-Hypothesis (i.e., there is no effect). They used Bem’s (2011) “Feeling the Future” data to argue that Bayes-Factors would have demonstrated that extra-sensory perception does not exist. This blog post shows that Bayes-Factors depend on the specification of the alternative hypothesis and that support for the null-hypothesis is often obtained by choosing an unrealistic alternative hypothesis (e.g., there is a 25% probability that effect size is greater than one standard deviation, d > 1). As a result, Bayes-Factors can favor the null-hypothesis when there is an effect, but the effect size is small (d = .2). A Bayes-Factor in favor of the null is more appropriately interpreted as evidence that the alternative hypothesis needs to decrease the probabilities assigned to large effect sizes. The post also shows that Bayes-Factors based on a meta-analysis of Bem’s data provide misleading evidence that an effect is present because Bayesian statistics do not take publication bias and dishonest reporting practices into account.
9. Hidden figures: Replication failures in the stereotype threat literature. A widespread problem is that failed replication studies are often not published. This blog post shows that another problem is that failed replication studies are ignored even when they are published. Selective publishing of confirmatory results undermines the credibility of science and claims about the importance of stereotype threat to explain gender differences in mathematics.
10. My journey towards estimation of replicability. In this blog post I explain how I got interested in statistical power and replicability and how I developed statistical methods to reveal selection bias and to estimate replicability.
Zou, C., Schimmack, U., & Gere J. (2013). The Validity of Well-Being Measures: A Multiple-Indicator–Multiple-Rater Model. Psychological Assessment, 25(4), 1247–1254.
In the subjective indicators tradition, well-being is defined as a match between an individual’s actual life and his or her ideal life. Common well-being indicators are life-satisfaction judgments, domain satis- faction judgments, and measures of positive and negative affect (hedonic balance). These well-being indicators are routinely used to study well-being, but a formal measurement model of well-being is lacking. This article introduces a measurement model of well-being and examines the validity of self-ratings and informant ratings of well-being. Participants were 335 families (1 student with 2 parents, N = 1,005). The main findings were that (a) self-ratings and informant ratings are equally valid, (b) global life-satisfaction judgments and averaged domain satisfaction judgments are about equally valid, and (c) about 1/3 of the variance in a single indicator is valid. The main implication is that researchers should demonstrate convergent validity across multiple indicators by multiple raters.
Keywords: life satisfaction, affect,
self-reports, informant-reports, multitrait–multimethod
Well-being is an important
goal for many people, thus, social scientists from a variety of disciplines
study well-being. A major problem for well-being scientists is that well-being
is difficult to define and measure (Diener, Lucas, Schimmack, & Helliwell,2009). These difficulties
may threaten the validity of well-being measures. The aim of the present study
is to examine the validity of the most commonly used measures of well-being.
A measure is valid if it measures what it is intended to measure. This definition of validity implies that it is important to define a construct (i.e., what is being measured?) before it is possible to evaluate the validity of a measure (Schimmack, 2010). Unfortu- nately, there is no agreement about the definition of the term well-being (Diener et al., 2009). It is therefore necessary to explain how we define the term well-being before we can examine the validity of well-being measures. We agree with philosophical arguments that well-being is a subjective concept (Diener, 1984; Sumner, 1996; see Diener, Suh, Lucas, & Smith, 1999, for a detailed discussion). A key criterion of a subjective definition of well-being is that the evaluation has to take the subjective values, motives, and ideals of individuals into account; that is, is his or her life going well for him or her? Accordingly, we define well-being as a match between an individual’s actual life and his or her ideal life. This definition is consistent with the prevalent definition of well-being in the social indicators tradition (Andrews & Withey,1976; Cantril, 1965; Diener, 1984; Veenhoven & Jonkers, 1984). This definition of well-being led to the creation of subjective well-being indicators such as life-satisfaction judgments (Diener,1984). These measures are routinely used to make inferences about the determinants of well-being. These inferences implicitly assume that well-being measures are valid, but the literature on the validity of these measures is sparse and controversial (Schwarz & Strack,1999; Schimmack & Oishi, 2005; Schneider & Schimmack, 2009). Since there is no gold standard to validate well-being measures, convergent validity between self-ratings and informant ratings of well-being has been used as the primary evidence for the validity of well-being measures (Diener et al., 2009). However, a major limitation of previous studies is that they did not provide quanti- tative information about the amount of valid variance in different well-being measures (cf. Schneider & Schimmack, 2009). Our study addresses this problem and provides the first quantitative estimates of the amount of valid variance in the most widely used measures of well-being.
One problem in the estimation of effect sizes is that estimates based on small samples are imprecise because sampling error is substantial. To obtain data from a large sample, we used a round- robin design. In this design, participants are both targets and informants, thus, increasing the number of targets. To ensure that informants have valid information about targets’ well-being, we used families as units of analysis. Specifically, we
recruited uni- versity students and their biological
parents (see Table 1).
A round-robin design
creates two problems for a standard struc- tural equation model. First,
observations are not independent be- cause participants are recruited as triads
rather than as individuals. Second, the distinction between the three raters
(student, mother, & father) does not provide information about the validity
of self-ratings because self-ratings are a function of rater and target (i.e.,
the diagonal in Table 1).
To overcome these problems,
we made use of advanced features in
the structural equation modeling program Mplus 5.0 (Muthén &Muthén, 2007). First, we
used the CLUSTER command to obtain adjusted standard errors and fit indices
that take the interdepen- dence among family
members into account.
Second, we rearranged the data to create variables
with self-ratings (see Table 2). This creates missing data in the diagonal of the
traditional round-robin design. To analyze these data with missing values we
used the MODEL = COMPLEX function of Mplus (Muthén & Muthén,2007). Thus, our model included 16 (4 raters X 4 measures) observed variables.
A Measurement Model
Quantitative estimates of validity require a formal measurement model in which variation in well-being (the match between indi- viduals’ actual and ideal lives) is an unobserved cause that pro- duces variation in observed well-being measures (e.g., self-ratings of life-satisfaction; cf. Schimmack, 2010). Our measurement model of well-being (see Figure 1) is similar to Diener et al.’s(1999) theoretical model of well-being. It is also related to the causal systems model of subjective well-being (Busseri & Sadava,2011). In this model, positive affect and negative affect are distinct affective experiences. For most people, feeling good and not feeling bad is an important part of an ideal life, and the balance of positive versus negative affect serves as an important basis for life-satisfaction judgments (Schimmack, Radhakrishnan, Oishi,Dzokoto, & Ahadi, 2002; Suh, Diener, Oishi, & Triandis, 1998). Consistent with these assumptions, positive affect and negative affect are distinct components of hedonic balance (using a forma- tive measurement model), and hedonic balance influences well- being. The formative measurement model of hedonic balance makes no assumptions about the correlation between its compo- nents. As prior research often reveals a moderate negative corre- lation between positive affect and negative affect, our model allows for the two components to correlate with each other (Diener, Smith, & Fujita, 1995; Gere & Schimmack, 2011). The well-being factor is identified by two satisfaction measures, global life-satisfaction judgments and averaged domain satisfaction judgments. Prior studies often relied exclusively on global life-satisfaction judgments (Lucas, Diener, & Suh, 1996; Walker &Schimmack, 2008). The problem with this approach is that global life-satisfaction judgments can be influenced by focusing illusions (Kahneman, Krueger, Schkade, Schwarz, & Stone, 2006; but see Schimmack & Oishi, 2005). Focusing illusions could produce systematic measurement error in global life-satisfaction judgments that could attenuate the influence of hedonic balance on well- being. To address this concern, our model included averaged domain satisfaction judgments as a second indicator of well-being. As averaged domain satisfaction judgments are not susceptible to focusing illusions, the focusing illusion hypothesis predicts that averaged domain satisfaction judgments have a higher loading on the well-being factor (i.e., are more valid) than global life- satisfaction judgments.
Model fit was assessed
using standard criteria of acceptable model fit such as a comparative fit index
(CFI) < .95, root-mean- square error of approximation (RMSEA) < .06,
and standardized root-mean-square residual (RMSR) < .08
(Schermelleh-Engel,Moosbrugger, & Muller,
2003). Due to the large sample size of the
present data (N = 1,005),
tests of model comparison using p-values
will often lead to misleading results (cf. Raftery, 1995). Therefore,
we used the Bayesian information criterion (BIC) for model comparisons. Models
with lower BIC values are preferable because they are more parsimonious. This
is especially important in new research areas because small effects are less
likely to replicate. Following Raftery’s (1995) standards,
a difference in BIC values greater than 10 can be interpreted as very strong
evidence to support the model with the lower BIC value.
Participants were 335 students at the University of Toronto and their parents (335 triads; N = 1,005). Of the 335 students, 235 were women and 100 were men, and the age ranged from 17 to 30 years (Mage = 19.56, SD = 2.23). The age of mothers ranged from 37 to 63 years (Mage = 48.25, SD = 5.08).
The age of fathers ranged from 38 to 72 years (Mage
= 51.67, SD = 5.67). Students were required to be living with both of
their biological parents so that each member of the family had good knowledge
of one another. Students from the university took part in the study for either
$25 or course credit. Their parents each received $25 for participating in the
study. Two hundred thirty-five students came to the laboratory with their
parents to complete the study. One hundred students and their parents completed
the study in their homes.
Participants who came into the laboratory filled out consent forms, and these participants were seated in separate rooms to ensure that reports were made independently. They filled out a series of questionnaires about themselves and about the other two members of their families. They were then debriefed and thanked for their participation. Students who took the questionnaires home met with a researcher who gave them detailed instructions and the questionnaire packages. Participants were asked to fill out the questionnaires in separate rooms and refrain from talking about their responses until all members of the
family have completed the questionnaire. Each family member received an
envelope, into which the family
member placed his or her own completed ques- tionnaire, and he or she sealed the envelope
and signed it across the flap. Once the questionnaire packages
were completed, partici- pants returned the questionnaire packages, and they were debriefed
and thanked for their participation.
Since well-being is defined as an evaluation of an individual’s actual life, the assessment of well-being has to be retrospective. For this reason, we asked participants to think about the past 6 months when answering the questions. Additionally, since global judgments of life satisfaction can be influenced by temporarily accessible information (Schimmack & Oishi, 2005; Schwarz &Strack, 1999), the global self-ratings of life satisfaction were assessed first.
Global life evaluation.
For the global evaluative judgments, the
first three items of the Satisfaction With Life Scale were used (SWLS; Diener, Emmons, Larsen, &
Griffin, 1985). The items ask participants to evaluate their lives
on a 7-point Likert scale ranging
from 1 (strongly disagree) to 7 (strongly
agree). The first three items (“In
most ways my life is close to my ideal”; “The conditions of my life are excellent”; “I am satisfied with my life”) were chosen because
they have been shown to have better psychometric prop- erties than the last two
items of the scale (Oishi, 2006). Consistent with prior studies, the internal
consistency of the three-item scale was good, alphas > .80 (C= .83 for students; C= .89
for mothers; C = .89 for fathers). The items
for the informant reports were virtually the same, but the wording was changed
to an informant report format (e.g., Kim
et al., 2012). Informants were instructed to fill out the scale from the target’s
perspective. For example, students serving as informants for their father would
rate “In most ways my father thinks that
his life is close to his ideal.” Ratings were made on 7-point Likert
scales. The internal consistency of informant-ratings was similar to the
internal consistency of self- ratings (ranged from C = .85 to C = .93).
Averaged domain satisfaction. Domain satisfaction was as- sessed with single-item indicators for six important life domains, using satisfaction judgments (I am satisfied with..). The life do- mains were romantic life, work/academic life, health, recreational life, housing, and friendships. Responses were made on 7-point Likert scales ranging from 1 (strongly disagree) to 7 (strongly agree). The domains were chosen based on previous studies show- ing that these domains are rated as moderately to very important (Schimmack, Diener, & Oishi, 2002). We averaged these items to obtain an alternative measure of life
evaluations. The informant version of the questionnaire changed the stem from
“I am . . . ” to “My son/daughter/mother/father is . . . ” and “my” to
Positive and negative affect. Positive and negative affect were assessed using the Hedonic Balance Scale (Schimmack et al.,2002). The scale has three items for positive affect (pleasant, positive, good) and three items for negative affect (unpleasant, negative, bad). The items for positive and negative affect were averaged separately to create composites for positive and negative affect, respectively. All of the self-ratings for positive affect had a reliability of over .80 (C = .82 for students; C = .85 for mothers; C = .85 for fathers). Similarly, all of the self-ratings for negative affect had a reliability of over .75 (C = .80 for students; C = .75 for mothers; C = .78 for fathers). For the informant reports, “. . . how often do you experience the following feelings?” was re- placed with “. . . how often does your mother/father/son/daughter experience the following feelings?” All of the informant reports had reliabilities of over .75 (range from C = .75 to C = .89).
Table 3 shows the correlations among the 16 variables created by crossing the four indicators (life satisfaction, domain satisfac- tion, positive affect, and negative affect) with the four raters (self, student informant, mother informant, and father informant). Note that since the self cannot also serve as the informant for the self, correlations between self-reports and informant reports are based on 66% of all observations. The correlations between the self- report measures were based on 100% of the observations.
Correlations between the
same construct assessed with different methods (i.e., convergent validity
coefficients) are bolded. All of the convergent validity coefficients were significantly
greater than zero and exceeded a minimum value of r = .25. Convergent
validity correlations for affective indicators (positive affect and negative
affect) were lower than correlations for the evaluative indicators (life
satisfaction and domain satisfaction). These find- ings replicate the results
of a meta-analysis (Schneider &
Table 3 can also be used to examine whether each indicator measures well-being in a slightly different manner. Twenty-two out of 24 cross-indicator– cross-rater correlations were weaker than the convergent validity coefficients, indicating that the dif- ferent indicators have unique variance. This finding replicates Lucas et al.’s (1996). However, Table 3 also shows that all well-being measures are related to each other. This pattern of results is consistent with the assumption that all measures reflect a common construct.Table 3 also shows stronger same-rater correlations than cross- rater correlations. This pattern is consistent with our assumption that ratings by a single rater are influenced by an evaluative bias (Anusic et al., 2009; Campbell & Fiske, 1959). Most important, Table 3 provides new information about informant–informant agreement. One notable pattern in the data is that the correlations between informant ratings by mothers (mother informant) and fathers (father informant) were stronger than correlations of infor- mant ratings by parents with those by students as informants. There are two possible explanations for this pattern. First, it is possible that students’ informant reports are less valid than par- ents’ informant ratings. However, this interpretation of the data is inconsistent with the finding that self-ratings were more highly correlated with students’ informant ratings than with parents’ informant ratings. Therefore, we favor the second explanation that parents’ informant ratings share method variance. This interpreta- tion is also consistent with other multirater studies that have demonstrated shared method variance between parents’ ratings of their children’s personality (Funder, Kolar, & Blackman, 1995).
Structural Equation Modeling
We fitted the measurement model in Figure 1 to our data. In the first model, we did not constrain coefficients. This model served as the base-model for model comparisons to more parsimonious models with constrained coefficients. The first model with uncon- strained coefficients had acceptable fit to the data, x2(df = 78) = 104.41, CFI = 0.995, RMSEA = 0.018, standardized root-mean- square residual (SRMR) = 0.026; BIC = 31,102. Factor loadings of ratings by different raters of the same measure (e.g., life- satisfaction) showed very similar loadings. We therefore specified a model that constrained factor loadings and residuals for the four raters to be equal. This model implies that ratings by different raters are equally valid. The model with constrained parameters maintained good fit and had a lower (i.e., superior) BIC value, x2(df = 102) = 148.18, CFI = 0.991, RMSEA = 0.021, SRMR = 0.041; BIC = 30,993. In the next model, we constrained the loadings on the rater-specific bias factors to be equal across raters. Again, model fit remained acceptable, and BIC decreased, indicating that rater bias is similar across raters x2(df = 117) = 188.48, CFI = 0.986, RMSEA = 0.025, SRMR = 0.068; BIC = 30,936. We retained this model as the final model. The parameter estimates of the final model and their 95% confidence intervals are listed in Table 4. For ease of interpretation, the main parameter estimates are also included in Figure 1.
The main finding was that the life-satisfaction factor and the average domain satisfaction factor had very high loadings on the well-being factor. Thus, our results provide no support for the hypothesis that focusing illusions undermine the validity of global life-satisfaction judgments. We also found a very strong effect of hedonic balance on the well-being factor. Yet, all three measures of well-being had significant residual variances, indicating that the measures are not redundant. Most important, about 20% of the variance in well-being was not accounted for by hedonic balance. This suggests that affective measures and evaluative judgments can show divergent patterns of correlations with predictor variables.
The factor loadings of the observed variables on the factor representing the shared variance among raters (e.g., self-ratings of life satisfaction [LS] on LS factor) can be interpreted as validity coefficients for specific constructs (e.g., the validity of a self-rating of life-satisfaction as a measure of life-satisfaction; cf. Schimmack, 2010). The validity coefficients of the four types of indicators were very similar (see Table 4). The validity coefficients suggest that about one third (29% to 38%) of the variance in a single indicator by a single rater (e.g., self-ratings of life- satisfaction) is valid variance.
It is important to keep in mind that these estimates examine the validity of a single rater with regard to a specific measure of well-being rather than the validity of these measures as measures of well-being. To examine the validity of specific measures as measures of the well-being factor in our measurement model, we need to estimate indirect effects of the well-being factor on specific measures. For example, self-ratings of life satisfaction load at .60 on the life satisfaction factor. However, this does not mean that self-ratings of life satisfaction capture 36% (.6*.6) of valid variance of well-being, because life satisfaction is not a perfect indicator of well-being. Based on our model, the life satisfaction factor loads at .96 on the well-being factor. We also need to take this measurement error into account to examine the validity of self- ratings of life satisfaction in assessing well-being (.96*.60 = .58, valid variance = 33%).
Our study provides the
first quantitative estimates of the validity
of various well-being measures using a theoretically grounded model of
well-being. Our main findings were that (a) about one third of the variance in
a single well-being indicator is valid variance, (b) self-ratings are neither
significantly more nor less valid than ratings
by a single well-acquainted informant, (c) a large portion of the valid variance in a
specific type of indicator is shared across indicators, and (d) hedonic balance
and evaluative judgments have some unique variance.
We found no support for the focusing illusion hypothesis. If the distinction between hedonic balance and global life-satisfaction judgments were caused by a focusing illusion, the factor loading of life satisfaction on well-being should have been lower than the factor loading of the average domain satisfaction judgment. However, the actual results showed a slightly reversed pattern. This suggests that unique variance in evaluative judgments reflects valid well-being variance because individuals do not rely exclusively on hedonic balance to evaluate their lives. This finding provides empirical support for philosophical arguments against purely hedonistic definitions of well-being (Sumner, 1996). At the same time, the overlap between evaluative judgments and hedonic balance is substantial, indicating that positive experiences make an important contribution to well-being for most individuals. Another noteworthy finding was that global life-satisfaction judgments and averaged domain satisfaction judgments were approximately equally valid. This finding contradicts previous findings that averaged domain satisfaction judgments were more valid in a study with friends as informants (Schneider & Schimmack, 2010). Future research needs to examine whether the type of informant is a moderator. For example, it is possible that global life-satisfaction judgments are more difficult to make, which gives family members an advantage over friends. Subsequently, we discuss the main implications of our findings for the use of well-being measures in the assessment of individuals’ well-being and for the use of well-being measures in policy decisions.
Validity of Well-Being Indicators
Our results suggest that about one third of the variance in a single well-being indicator by a single rater is valid variance. This finding has important implications for the interpretation of studies that rely on a single well-being indicator as a measure of wellbeing. For example, many important findings about well-being are based on a single global life-satisfaction rating in the German Socio-Economic Panel (e.g., Lucas & Schimmack, 2009). It is well-known that observed effect sizes in these studies are attenuated by random measurement error and that it would be desirable to correct effect size estimates for unreliability (Schmidt & Hunter, 1996). However, systematic measurement error can further attenuate observed effect sizes. Schimmack (2010) proposed that quantitative estimates of validity could be used to disattenuate observed effect sizes for invalidity. To illustrate the implications of correcting for invalidity in well-being indicators, we use Kahneman et al.’s (2006) finding that household income was a moderate predictor of self-reported life-satisfaction (r .32). Our findings suggest that this observed relationship underestimates the relationship between household income and well-being. To disattenuate the observed relationship, the observed correlation has to be divided by the validity coefficient (i.e., .96 .60 .58). Thus, the corrected estimate of the true effect size would increase to r .56 (.32/.58), which is considered a strong effect size (Cohen, 1992). Researchers may be reluctant to trust adjusted effect sizes because they rely on assumptions about validity. However, the common practice of relying on observed relationships as estimates of effect sizes also relies on an implicit assumption, namely, that the observed measure is perfectly valid. In comparison to an assumption of 100% valid variance in a single global life-satisfaction judgment, our estimate of about one-third valid variance is more realistic and supported by empirical evidence. Nevertheless, our findings should only be treated as a first estimate and a benchmark for future studies. Future research needs to replicate our findings and examine moderating factors of validity in well-being measures.
Self-Reports Versus Informant Reports
Schimmack (2009) noted that
previous studies failed to compare the validity of self-ratings and informant ratings. Our results suggest that
self-ratings and ratings by a single well- acquainted informant are
approximately equally valid. While this is a surprising finding given the
subjective nature of well-being, it is not uncommon in personality psychology
to find evidence of equal or sometimes greater validity in informant ratings
than self-ratings. For instance, informant reports of personality often provide
better predictive validity than self-reports (e.g., Kolar,Funder, & Colvin, 1996). Since we did not have any outcome measure of
well-being (e.g., suicide) in the present study, we could not test for the predictive validity of self- and
informant reports. However, this is an important avenue for future research. To
our knowledge, no study has compared self-ratings and informant ratings using
life-events that are known to influence well-being
such as marriage, divorce, or unemployment (Diener, Lucas, &Scollon, 2006).
Informant ratings also have
an important advantage over self- ratings. Namely, it
is possible to obtain ratings from multiple informants, but there is only one
self to provide self-ratings. Aggregation of informant
ratings can substantially increase the validity of informant ratings. We
computed well-being indicators for single raters and multiple raters using the
following weights (Well-Being = 1.5 Life Satisfaction + 1.5 Domain
Satisfaction + 2 Positive Affect – 1 Negative Affect) and computed the corre- lation with
the well-being factor in Figure 1. The correlations were r
= .62 for self-ratings, r
= .77 for an aggregate of three informant ratings, and r = .81 for an aggregate of all four ratings. Although the difference between .62 and .77 may not seem
impressive, it implies that aggregation across raters can increase the amount of valid variance from one third to two thirds of the
observed vari- ance. This finding suggests that clinicians can benefit considerably
from obtaining well-being measures from multiple informants to assess
Our study has numerous limitations. The use of a convenience sample from a specific population means that the generalizability of our findings needs to be examined in samples drawn from other populations. However, our results are broadly consistent with meta-analytic findings (Schneider & Schimmack, 2009). Another limitation was that parents are not independent raters and appear to share rating biases. In the future, it would be desirable to obtain ratings from independent raters (e.g., friends & parents). Finally, our conclusions are limited by the assumptions of our model. While it is possible to fit other models to our data in Table 3 (e.g., Busseri & Sadava, 2011), the alternative models each have their own limitations. Future studies should test these alternative models to examine if they may reveal different or unique findings from the present study. We encourage readers to fit alternative models to the correlation matrix in Table 3 and examine whether these model provide better fit to our data. We consider our model merely as a plausible first attempt to create a measurement model of well- being that can underpin empirical studies of well-being.
Although the study of happiness has been of great interest to many researchers and the general public, the validity of well-being measures has not improved for the past 50 years (Schneider &Schimmack, 2009). In order for well-being researchers to provide accurate information about the determinants of well-being, it is crucial to use a valid method to assess well-being. If invalid measures are used, findings that rely on such measures will also lack validity. From the current study, we found that only about one third of the variance in a self-report measure of well-being is valid. In order to increase the validity of well-being measures, multiple methods of well-being should be used. When better measures are used, researchers can also be more confident that their findings can be trusted.
Andrews, F. M., & Withey, S. B. (1976). Social indicators of well-being: America’s
perception of life quality. New York, NY: Plenum.
Anusic, I., Schimmack, U., Pinkus, R. T., &
Lockwood, P. (2009). The nature and structure of correlations among Big Five
ratings: The halo- alpha-beta model. Journal
of Personality and Social Psychology, 97, 1142–1156. doi:10.1037/a0017159
Busseri, M. A., & Sadava, S. W. (2011). A
review of the tripartite structure
of subjective well-being: Implications for conceptualization, operation-
alization, analysis, and synthesis. Personality
and Social Psychology Review, 15, 290 –314. doi:10.1177/1088868310391271
Campbell, D. T., & Fiske, D. W. (1959).
Convergent and discriminant validation by the multitrait–multimethod matrix. Psychological Bulletin, 56, 81–105. doi:10.1037/h0046016
Cantril, H. (1965). The pattern of human concerns (Vol. 4). New Bruns- wick, NJ:
Rutgers University Press.
Diener, E., Emmons,
R. A., Larsen, R. J., & Griffin, S. (1985). The Satisfaction
With Life Scale. Journal of Personality
Assessment, 49, 71–75. doi:10.1207/s15327752jpa4901_13
Diener, E., Lucas, R. E., Schimmack, U., &
Helliwell, J. F. (2009). Well-being for
public policy. New York, NY: Oxford University Press. doi:10.1093/acprof:oso/9780195334074.001.0001
Diener, E., Lucas, R. E., & Scollon, C. N.
(2006). Beyond the hedonic treadmill: Revising the adaptation theory of
well-being. American Psy- chologist, 61, 305–314.
Diener, E., Smith, H., & Fujita, F. (1995).
The personality structure of affect. Journal
of Personality and Social Psychology, 69, 130 –141. doi:10.1037/0022-35184.108.40.206
Diener, E., Suh, E. M., Lucas, R. E., & Smith, H. L. (1999). Subjective well-being: Three decades of progress. Psychological Bulletin, 125, 276 –302. Funder, D. C., Kolar, D. C., & Blackman, M. C. (1995). Agreement among judges of personality: Interpersonal-relations, similarity, and acquain- tanceship. Journal of Personality and Social Psychology, 69, 656 – 672. doi:10.1037/0022-35220.127.116.116
Funder, D. C., Kolar, D. C., & Blackman, M. C. (1995). Agreement among judges of personality: Interpersonal-relations, similarity, and acquain- tanceship. Journal of Personality and Social Psychology, 69, 656 – 672. doi:10.1037/0022-3518.104.22.1686
Gere, J., & Schimmack, U. (2011). A
multi-occasion multi-rater model of affective
dispositions and affective well-being. Journal
of Happiness Studies, 12, 931–945. doi:10.1007/s10902-010-9237-3
Krueger, A. B., Schkade, D., Schwarz, N., & Stone, A. A.
(2006). Would you be happier if you were richer? A focusing illusion. Science, 312, 1908 –1910. doi:10.1126/science.1129688
Kim, H., Schimmack,
U., & Oishi, S. (2012). Cultural differences in self- and other-evaluations
of well-being: A study of European and Asian
Canadians. Journal of Personality and
Social Psychology, 102, 856 – 873. doi:10.1037/a0026803
Kolar, D. W., Funder,
D. C., & Colvin, C. R. (1996). Comparing the accuracy of personality
judgments by the self and knowledgeable others. Journal of
Personality, 64, 311–337. doi:10.1111/j.1467-6494.1996
Lucas, R. E., Diener,
E., & Suh, E. (1996). Discriminant validity of
well-being measures. Journal of
Personality and Social Psychology, 71, 616 – 628.
Lucas, R. E., & Schimmack, U. (2009).
Income and well-being. How big is the gap between the
rich and the poor? Journal of Research in Personality, 43, 75–78. doi:10.1016/j.jrp.2008.09.004
Muthén, L. K., & Muthén, B. O. (2007). Mplus user’s guide (5th ed.). Los Angeles, CA: Muthén &
Oishi, S. (2006). The concept of life
satisfaction across cultures: An IRT analysis. Journal of Research in Personality, 40, 411– 423. doi:10.1016/j.jrp.2005.02.002
Raftery, A. E. (1995). Bayesian model selection
in social research. Soci- ological Methodology, 25, 111–164. doi:10.2307/271063
Schermelleh-Engel, K., Moosbrugger, H., &
Muller, H. (2003). Evaluating the fit of structural
equation models: Tests of significance and descrip-
tive goodness-of-fit measures. Methods of
Psychological Research, 8, 23–74.
Schimmack, U. (2010). What multi-method data
tell us about construct validity. European Journal of Personality, 24, 241–257.
Schimmack, U., Diener, E., & Oishi, S.
(2002). Life-satisfaction is a momentary judgement and a stable personality
characteristic: The use of chronically accessible and stable sources. Journal of Personality, 70, 345–384. doi:10.1111/1467-6494.05008
Schimmack, U., & Oishi, S. (2005). The
influence of chronically and temporarily accessible information on life
satisfaction judgments. Jour- nal of
Personality and Social Psychology, 89, 395– 406. doi:10.1037/0022-3522.214.171.1245
Schimmack, U., Radhakrishnan, P., Oishi, S.,
Dzokoto, V., & Ahadi, S. (2002). Culture, personality, and subjective
well-being: Integrating pro- cess models of life satisfaction. Journal of Personality and Social
Psychology, 82, 582–593.
Schimmack, U., Schupp, J., & Wagner, G. G.
(2008). The influence of environment and personality on the affective and
cognitive component of subjective well-being. Social Indicators Research, 89, 41– 60. doi:10.1007/s11205-007-9230-3
Schmidt, F. L., & Hunter,
J. E. (1996). Measurement error
in psychological research: Lessons
from 26 research scenarios. Psychological Methods,
1, 199 –223. doi:10.1037/1082-989X.1.2.199
Schneider, L., & Schimmack, U. (2009).
Self-informant agreement in well-being ratings: A meta-analysis. Social Indicators Research, 94, 363–376.
Schneider, L., & Schimmack, U. (2010).
Examining sources of self- informant agreement in life-satisfaction judgments. Journal of Research in Personality, 44, 207–212.
Schwarz, N., & Strack, F. (1999). Reports
of subjective well-being: Judg- mental processes and their methodological
implications. In D. Kahne- man, E. Diener, & N. Schwarz (Eds.), Well-being: The foundations of hedonic
psychology (pp. 61– 84). New York, NY: Russell-Sage.
Suh, E., Diener, E.,Oishi, S., & Triandis,
H. C. (1998). The shifting basis of life satisfaction judgments across
cultures: Emotions versus norms. Journal
of Personality and Social Psychology, 74, 482– 493.
Sumner, L. W. (1996). Welfare, happiness, and ethics. New York, NY: Oxford University
Veenhoven, R., & Jonkers, T. (1984). Conditions of happiness (Vol. 2).
Dordrecht, the Netherlands: Reidel.
Walker, S. S., & Schimmack, U. (2008).
Validity of a happiness implicit association test as a measure of subjective
well-being. Journal of Re- search in
Personality, 42, 490 – 497. doi:10.1016/j.jrp.2007.07.005
Most published psychological measures are unvalid. (subtitle) *unvalid = the validity of the measure is un-known.
8 years ago, psychologists started to realize that they have a replication crisis. Many published results do not replicate in honest replication attempts that allow the data to decide whether a hypothesis is true or false.
The replication crisis is sometimes attributed to the lack of replication studies before 2011. However, this is not the case. Most published results were replicated successfully. However, these successes were entirely predictable from the fact that only successful replications would be published (Sterling, 1959). These sham replication studies provided illusory evidence for theories that have been discredited over the past eight years by credible replication studies.
New initiatives that are called open science are likely to improve the replicability of psychological science in the future, although progress towards this goal is painfully slow.
This blog post addresses another problem in psychological science. I call it the validation crisis. Replicability is only one necessary feature of a healthy science. Another necessary feature of a healthy science is the use of valid measures. This feature of a healthy science is as obvious as the need for replicability. To test theories that relate theoretical constructs to each other (e.g., construct A influences construct B for individuals drawn from population P under conditions C), it is necessary to have valid measures of constructs. However, it is unclear which criteria a measure has to fulfill to have construct validity. Thus, even successful and replicable tests of a theory may be false because the measures that were used lacked construct validity.
The classic article on “Construct Validity” was written by two giants in psychology; Cronbach and Meehl (1955). Every graduate student of psychology and surely every psychologists who published a psychological measure should be familiar with this article.
The article was the result of an APA task force that tried to establish criteria, now called psychometric properties, for tests to be published. The result of this project was the creation of the construct “Construct validity”
The chief innovation in the Committee’s report was the term construct validity. (p. 281).
Cronbach and Meehl provide their own definition of this construct.
Construct validation is involved whenever a test is to be interpreted as a measure of some attribute or quality which is not “operationally defined” (p. 282).
In modern language, construct validity is the relationship between variation in observed test scores and a latent variable that reflects corresponding variation in a theoretical construct (Schimmack, 2010).
Thinking about construct validity in this way makes it immediately obvious why it is much easier to demonstrate predictive validity, which is the relationship between observed tests scores and observed criterion scores than to establish construct validity, which is the relationship between observed test scores and a latent, unobserved variable. To demonstrate predictive validity, one can simply obtain scores on a measure and a criterion and compute the correlation between the two variables. The correlation coefficient shows the amount of predictive validity of the measure. However, because constructs are not observable, it is impossible to use simple correlations to examine construct validity.
The problem of construct validation can be illustrated with the development of IQ scores. IQ scores can have predictive validity (e.g., performance in graduate school) without making any claims about the construct that is being measured (IQ tests measure whatever they measure and what they measure predicts important outcomes). However, IQ tests are often treated as measures of intelligence. For IQ tests to be valid measures of intelligence, it is necessary to define the construct of intelligence and to demonstrate that observed IQ scores are related to unobserved variation in intelligence. Thus, construct validation requires clear definitions of constructs that are independent of the measure that is being validated. Without clear definition of constructs, the meaning of a measure reverts essentially to “whatever the measure is measuring,” as in the old saying “Intelligence is whatever IQ tests are measuring. This saying shows the problem of research with measures that have no clear construct and no construct validity.
In conclusion, the challenge in construct validation research is to relate a specific measure to a well-defined construct and to establish that variation in test scores are related to variation in the construct.
What are Constructs
Construct validation starts with an assumption. Individuals are assumed to have an attribute, today we may say personality trait. Personality traits are typically not directly observable (e.g., kindness rather than height), but systematic observation suggests that the attribute exists (some people are kinder than others across time and situations). The first step is to develop a measure of this attribute (e.g., a self-report measure “How kind are you?”). If the test is valid, variation in the observed scores on the measure should be related to the personality trait.
A construct is some postulated attribute of people, assumed to be reflected in test performance (p. 283).
The term “reflected” is consistent with a latent variable model, where unobserved traits are reflected in observable indicators. In fact, Cronbach and Meehl argue that factor analysis (not principle component analysis!) provides very important information for construct validity.
We depart from Anastasi at two points. She writes, “The validity of a psychological test should not be confused with an analysis of the factors which determine the behavior under consideration.” We, however, regard such analysis as a most important type of validation. (p. 286).
Factor analysis is useful because factors are unobserved variables and factor loadings show how strongly an observed measure is related to variation in a an unobserved variable; the factor. If multiple measures of a construct are available, they should be positively correlated with each other and factor analysis will extract a common factor. For example, if multiple independent raters agree in their ratings of individuals’ kindness, the common factor in these ratings may correspond to the personality trait kindness, and the factor loadings provide evidence about the degree of construct validity of each measure (Schimmack, 2010).
In conclusion, factor analysis provides useful information about construct validity of measures because factors represent the construct and factor loadings show how strongly an observed measure is related to the construct.
It is clear that factors here function as constructs (p. 287).
The term convergent validity was introduced a few years later in another seminal article on validation research by Campbell and Fiske (1959). However, the basic idea of convergent validity was specified by Cronbach and Meehl (1955) in the section “Correlation matrices and factor analysis”
If two tests are presumed to measure the same construct, a correlation between them is predicted (p. 287).
If a trait such as dominance is hypothesized, and the items inquire about behaviors subsumed under this label, then the hypothesis appears to require that these items be generally intercorrelated (p. 288)
Cronbach and Meehl realize the problem of using just two observed measures to examine convergent validity. For example, self-informant correlations are often used in personality psychology to demonstrate validity of self-ratings. However, a correlation of r = .4 between self-ratings and informant ratings is open to very different interpretations. The correlation could reflect very high validity of self-ratings and modest validity of informant ratings or the opposite could be true.
If the obtained correlation departs from the expectation, however, there is no way to know whether the fault lies in test A, test B, or the formulation of the construct. A matrix of intercorrelations often points out profitable ways of dividing the construct into more meaningful parts, factor analysis being a useful computational method in such studies. (p. 300)
A multi-method approach avoids this problem and factor loadings on a common factor can be interpreted as validity coefficients. More valid measures should have higher loadings than less valid measures. Factor analysis requires a minimum of three observed variables, but more is better. Thus, construct validation requires a multi-method assessment.
The term discriminant validity was also introduced later by Campbell and Fiske (1959). However, Cronbach and Meehl already point out that high or low correlations can support construct validity. Crucial for construct validity is that the correlations are consistent with theoretical expectations.
For example, low correlations between intelligence and happiness do not undermine the validity of an intelligence measure because there is no theoretical expectation that intelligence is related to happiness. In contrast, low correlations between intelligence and job performance would be a problem if the jobs require problem solving skills and intelligence is an ability to solve problems faster or better.
Only if the underlying theory of the trait being measured calls for high item intercorrelations do the correlations support construct validity (p. 288).
Quantifying Construct Validity
It is rare to see quantitative claims about construct validity. Most articles that claim construct validity of a measure simply state that the measure has demonstrated construct validity as if a test is either valid or invalid. However, the previous discussion already made it clear that construct validity is a quantitative construct because construct validity is the relation between variation in a measure and variation in the construct and this relation can vary . If we use standardized coefficients like factor loadings to assess the construct validity of a measure, construct validity can range from -1 to 1.
Contrary to the current practices, Cronbach and Meehl assumed that most users of measures would be interested in a “construct validity coefficient.”
There is an understandable tendency to seek a “construct validity coefficient. A numerical statement of the degree of construct validity would be a statement of the proportion of the test score variance that is attributable to the construct variable. This numerical estimate can sometimes be arrived at by a factor analysis” (p. 289).
Cronbach and Meehl are well-aware that it is difficult to quantify validity precisely, even if multiple measures of a construct are available because the factor may not be perfectly corresponding with the construct.
Rarely will it be possible to estimate definite “construct saturations,” because no factor corresponding closely to the construct will be available (p. 289).
And nobody today seems to remember Cronbach and Meehl’s (1955) warning that rejection of the null-hypothesis, the test has zero validity, is not the end goal of validation research.
It should be particularly noted that rejecting the null hypothesis does not finish the job of construct validation (p. 290)
The problem is not to conclude that the test “is valid” for measuring- the construct variable. The task is to state as definitely as possible the degree of validity the test is presumed to have (p. 290).
One reason why psychologists may not follow this sensible advice is that estimates of construct validity for many tests are likely to be low (Schimmack, 2010).
The Nomological Net – A Structural Equation Model
Some readers may be familiar with the term “nomological net” that was popularized by Cronbach and Meehl. In modern language a nomological net is essentially a structural equation model.
The laws in a nomological network may relate (a) observable properties or quantities to each other; or (b) theoretical constructs to observables; or (c) different theoretical constructs to one another. These “laws” may be statistical or deterministic.
It is probably no accident that at the same time as Cronbach and Mehl started to think about constructs as separate from observed measures, structural equation model was developed as a combination of factor analysis that made it possible to relate observed variables to variation in unobserved constructs and path analysis that made it possible to relate variation in constructs to each other. Although laws in a nomological network can take on more complex forms than linear relationships, a structural equation model is a nomological network (but a nomological network is not necessarily a structural equation model).
As proper construct validation requires a multi-method approach and demonstration of convergent and discriminant validity, SEM is ideally suited to examine whether the observed correlations among measures in a mulit-trait-multi-method matrix are consistent with theoretical expectations. In this regard, SEM is superior to factor analysis. For example, it is possible to model shared method variance, which is impossible with factor analysis.
Cronbach and Meehl also realize that constructs can change as more information becomes available. It may also occur that the data fail to provide evidence for a construct. In this sense, construct validiation is an ongoing process of improved understanding of unobserved constructs and how they are related to observable measures.
Ideally this iterative process would start with a simple structural equation model that is fitted to some data. If the model does not fit, the model can be modified and tested with new data. Over time, the model would become more complex and more stable because core measures of constructs would establish the construct validity, while peripheral relationships may be modified if new data suggest that theoretical assumptions need to be changed.
When observations will not fit into the network as it stands, the scientist has a certain freedom in selecting where to modify the network (p. 290).
Too often psychologists use SEM only to confirm an assumed nomological network and it is often considered inappropriate to change a nomological network to fit observed data. However, SEM is as much testing of an existing construct as exploration of a new construct.
The example from the natural sciences was the initial definition of gold as having a golden color. However, later it was discovered that the pure metal gold is actually silver or white and that the typical yellow color comes from copper impurities. In the same way, scientific constructs of intelligence can change depending on the data that are observed. For example, the original theory may assume that intelligence is a unidimensional construct (g), but empirical data could show that intelligence is multi-faceted with specific intelligences for specific domains.
However, given the lack of construct validation research in psychology, psychology has seen little progress in the understanding of such basic constructs such as extraversion, self-esteem, or wellbeing. Often these constructs are still assessed with measures that were originally proposed as measures of these constructs, as if divine intervention led to the creation of the best measure of these constructs and future research only confirmed their superiority.
Instead many claims about construct validity are based on conjectures than empirical support by means of nomological networks. This was true in 1955. Unfortunately, it is still true over 50 years later.
For most tests intended to measure constructs, adequate criteria do not exist. This being the case, many such tests have been left unvalidated, or a finespun network of rationalizations has been offered as if it were validation. Rationalization is not construct validation. One who claims that his test reflects a construct cannot maintain his claim in the face of recurrent negative results because these results show that his construct is too loosely defined to yield verifiable inferences (p. 291).
Given the difficulty of defining constructs and finding measures for it, even measures that show promise in the beginning might fail to demonstrate construct validity later and new measures should show higher construct validity than the early measures. However, psychology shows no development in measures of the same construct. The most widely used measure of self-esteem is still Rosenberg’s scale from 1965 and the most widely used measure of wellbieng is still Diener et al.’s scale from 1984. It is not clear how psychology can make progress, if it doesn’t make progress in the development of nomological networks that provide information about constructs and about the construct validity of measures.
Cronbach and Meehl are clear that nomological networks are needed to claim construct validity.
To validate a claim that a test measures a construct, a nomological net surrounding the concept must exist (p. 291).
However, there are few attempts to examine construct validity with structural equation models (Connelly & Ones, 2010; Zou, Schimmack, & Gere, 2013). [please share more if you know some]
One possible reason is that construct validation research may reveal that authors initial constructs need to be modified or their measures have modest validity. For example, McCrae, Zonderman, Costa, Bond, and Paunonen (1996) dismissed structural equation modeling as a useful method to examine the construct validity of Big Five measures because it failed to support their conception of the Big Five as orthogonal dimensions with simple structure.
Recommendations for Users of Psychological Measures
The consumer can accept a test as a measure of a construct only when there is a strong positive fit between predictions and subsequent data. When the evidence from a proper investigation of a published test is essentially negative, it should be reported as a stop sign to discourage use of the test pending a reconciliation of test and construct, or final abandonment of the test (p. 296).
It is very unlikely that all hunches by psychologists lead to the discovery of useful constructs and development of valid tests of these constructs. Given the lack of knowledge about the mind, it is rather more likely that many constructs turn out to be non-existent and that measures have low construct validity.
However, the history of psychological measurement has only seen development of more and more constructs and more and more measures to measure this increasing universe of constructs. Since the 1990s, constructs have doubled because every construct has been split into an explicit and an implicit version of the construct. Presumably, there is even implicit political orientation or gender identity.
The proliferation of constructs and measures is not a sign of a healthy science. Rather it shows the inability of empirical studies to demonstrate that a measure is not valid or that a construct may not exist. This is mostly due to self-serving biases and motivated reasoning of test developers. The gains from a measure that is widely used are immense. Thus, weak evidence is used to claim that a measure is valid and consumers are complicit because they can use these measures to make new discoveries. Even when evidence shows that a measure may not work as intended (e.g., Bosson et al., 2000), it is often ignored (Greenwald & Farnham, 2001).
Just like psychologist have started to appreciate replication failures in the past years, they need to embrace validation failures. Some of the measures that are currently used in psychology are likely to have insufficient construct validity. If this was the decade of replication, the 2020s may become the decade of validation, and maybe the 2030s may produce the first replicable studies with valid measures. Maybe this is overly optimistic, given the lack of improvement in validation research since Cronbach and Meehl (1955) outlined a program of construct validation research. Ample citations show that they were successful in introducing the term, but they failed in establishing rigorous criteria of construct validity. The time to change this is now.
The Implicit Association Test (IAT) is 21 years old.
Greenwald et al. (1998) proposed that the IAT measures individual differences
in implicit social cognition. This claim
requires evidence of construct validity. I review the evidence and show that there
is insufficient evidence for this claim.
Most important, I show that few studies were able to test discriminant
validity of the IAT as a measure of implicit personality characteristics and
that a single-construct model fits multi-method data as well or better than a
dual-construct models. Thus, the IAT
appears to be a measure of the same personality characteristics that are
measured with explicit measures. I also show that the validity of the IAT varies
across personality characteristics. It has low validity as a measure of
self-esteem, moderate validity as a measure of racial bias, and high validity
as a measure of political orientation. The
existing evidence also suggests that the IAT measures stable characteristics
rather than states and has low predictive validity of single behaviors. Based
on these findings, it is important that users of the IAT clearly distinguish
between implicit measures and implicit constructs. The IAT is an implicit measure,
but there is no evidence that it measures implicit constructs.
The Implicit Association Test at Age 21:
No Evidence for Construct Validity
Twenty-one years ago, Greenwald, McGree, and Schwartz (1998)
published one of the most influential articles in personality and social
psychology. It is already the 4th
most cited article (4582 citations in Web of Science) in the Journal of
Personality and Social Psychology and will be number 3 this year. As the title “Measuring
Individual Differences in Social Cognition” suggests, the article introduced a
new individual difference measure that has been used in hundreds of studies to
measure attitudes, stereotypes, self-concepts, well-being, and personality
traits. Henceforth, I will refer to these constructs as personality
A Critical Evaluation of Greenwald’s (1998) Evidence for Discriminant
The Implicit Association Test (IAT) uses reaction times in
classification tasks to measure individual differences in the strength of
associations (Nosek et al., 2007). However,
the main purpose of the IAT is not to measure associations or to provide an indirect
measure of personality characteristics.
The key constructs that the IAT was designed to measure are individual
differences in implicit personality characteristics as suggested in the title
of Greenwald et al.’s (1998) seminal article “Measuring Individual Differences
in Implicit Cognition.”
The notion of implicit cognition is based on a conception of
human information processing that largely takes place outside of consciousness,
and the IAT was supposed to provide a window into the unconscious. “There has
been an increased interest in measuring aspects of thinking and feeling that
may not be easily accessed or available to consciousness. Innovations in
measurement have been undertaken with the purpose of bringing under scrutiny
new forms of cognition and emotion that were previously undiscovered” (Nosek,
Greenwald, & Banaji, 2007, p. 265).
Thus, the IAT was not just a new way of measuring the same
individual differences that were already measured with self-report measures. It was designed to measure information that is
“simply unreachable, in the same way that memories are sometimes unreachable [by
introspection]” (Nosek et al., 2007, p. 266).
The promise to measure individual differences that were not
accessible to introspection explains the appeal of the IAT, and many articles used
the IAT to make claims about individual differences in implicit forms of
self-esteem, prejudice, or craving for drugs. Thus, the hypothesis that the IAT
measures something different from self-report measures is a fundamental feature
of the construct validity of the IAT. In psychometrics, the science of test
validation, this property of a measure is known as discriminant validity
(Campbell & Fiske, 1959). If the IAT
is a measure of implicit individual differences that are different from
explicit individual differences, the IAT should demonstrate discriminant
validity from self-report measures. Given
the popularity of the IAT, one might expect ample evidence for the discriminant
validity of the IAT. However, due to methodological
limitations this is actually not the case.
Confusion about Convergent and Discriminant Validity
Greenwald et al.’s seminal article promised a measure of
individual differences, but failed to provide evidence for the convergent or
discriminant validity of the IAT. Study
1 with N = 32 participants showed that, on average, participants preferred
flowers to insects and musical instruments to weapons. These average tendencies
cannot be used to validate the IAT as a measure of individual differences. However,
Greenwald et al. (1998) also reported correlations across N = 32 participants
between the IAT and explicit measures. These correlations were low. Greenwald et al. (1998) suggest that this
finding provides evidence of discriminant validity. “This conceptual divergence
between the implicit and explicit measures is of course expected from
theorization about implicit social cognition” (p. 1470). However, these low correlations are uninformative
because discriminant validity requires a multi-method approach. As the IAT was the only implicit measure, low
correlations with explicit measures may simply show that the IAT has low
validity as a measure of individual differences.
Experiment 2 used the IAT with 17 Korean and 15 Japanese
American students to assess their attitudes towards Koreans vs. Japanese. In this study, Greenwald et al. found “unexpectedly
the feeling thermometer explicit rating was more highly correlated with the IAT
measure (average r = .59) than it was with another explicit attitude measure,
the semantic differential (r = .43)” (p. 1473). This finding actually
contradicts the hypothesis that the IAT measures some construct that is not
measured with self-ratings because discriminant validity implies higher
same-method than cross-method correlations (Campbell & Fiske, 1959).
Study 3 introduced the race IAT to measure prejudice with
the IAT with a sample of 26 participants.
In this small sample, IAT scores were only weakly and not significantly
correlated with explicit measures. The
authors realize that this finding is open to multiple interpretations.
“Although these correlations provide no evidence for convergent validity of the
IAT, nevertheless because of the expectation that implicit and explicit
measures of attitude are not necessarily correlated-neither do they damage the
case for construct validity of the IAT” (p. 1476). In other words, the low correlations might
reflect discriminant validity, but it could also show low convergent validity
if the IAT and explicit measures measure the same construct.
The discussion has a section on “Discriminant Validity of
IAT Attitude Measures,” although the design of the studies makes it impossible
to provide evidence for discriminant validity. Nevertheless, Greenwald et al.
(1998) claimed that they provided evidence for the discriminant validity of the
IAT as a measure of implicit cognitions. “It is clear that these implicit-explicit
correlations should be taken not as evidence for convergence among different
methods of measuring attitudes but as evidence for divergence of the constructs
represented by implicit versus explicit attitude measures” (p. 1477). The scientific
interpretation of these correlations is that they provide no empirical evidence
about the validity of the IAT because multiple measures of a single construct
are needed to examine construct validity (Campbell & Fiske, 1959). Thus, unlike
most articles that introduce a new measure of individual differences, Greenwald
et al. (1998) did not examine the psychometric properties of the IAT. In this article, I examine whether evidence gathered
over the past 21 years has provided evidence of construct validity of the IAT
as a measure of implicit personality characteristics.
First Problems for the Construct Validity of the IAT
The IAT was not the first implicit measure in social
psychology. Several different measures had been developed to measure
self-esteem with implicit measures. A team of personality psychologists
conducted the first multi-method validation study of the IAT as a measure of
implicit self-esteem (Bosson, Swan, & Pennebaker, 2000). The main finding in this study was that
several implicit measures, including the IAT, had low convergent validity. However, this finding has been largely
ignored and researchers started using the self-esteem IAT as a measure of some
implicit form of self-esteem that operates outside of conscious awareness (Greenwald
& Farnham, 2000).
At the same time, attitude researchers also found weak
correlations between the race IAT and other implicit measures of prejudice.
However, this lack of convergent validity was also ignored. An influential review article by Fazio and
Olson (2003) suggested that low correlations might be due to different
mechanisms. While it is entirely possible that evaluative priming and the IAT
have different mechanisms, it is not relevant for the ability of either measure
to be a valid measure of personality characteristics. Explicit ratings probably
also rely on a different mechanism as the IAT.
The mechanics of measurement have to be separated from the constructs
that the measures aim to measure.
Continued Confusion about Discriminant Validity
Nosek et al. (2007) examined evidence for the construct
validity of the IAT at age 7. The
section on convergent and discriminant validity lists a few studies as evidence
for discriminant validity. However, closer
inspection of these studies show that they suffer from the same methodological
limitation as Greenwald et al.’s (1998) seminal study. That is, constructs were assessed with a
single implicit method; the IAT. Thus,
it was impossible to examine construct validity of the IAT as a measure of
implicit personality characteristics.
Take Nosek and Smyth’s (2007) “A Multi-trait-multi-method validation
of the Implicit Association Test” as an example. The title clearly alludes to
Campbell and Fiske’s approach to construct validation. The data were 7 explicit ratings and 7 IATs
of 7 attitude pairs (e.g., flower vs. insect). The authors fitted several structural equation
models to the data and claimed that a model with separate, yet correlated, explicit
and implicit factors fitted the data better than a model with a single factor
for each attitude pair. This claim is invalid
because each attitude pair was assessed with a single IAT and parcels were used
to correct for unreliability. This
measurement model assumes that all of the reliable variance in an IAT that is
not shared with explicit ratings or with IATs of other attitudes reflects
implicit individual differences. However, it is also possible that this
variance reflects systematic measurement error that is unique to a specific
IAT. A proper multi-method approach
requires multiple independent measures of the same construct. As
demonstrated with real multi-method data below, there is consistent evidence
that the IAT has systematic method variance that is unique to a specific
Nevertheless, Nosek and Smyth’s (2007) multi-attitude study
provided some interesting information. The correlation of the 7 means of the
IAT and the 7 means of the explicit ratings was r = .86. For example, implicit
and explicit measures showed a preference for flowers over insects and a
dislike of evolution versus creation. If
implicit measures reflect distinct, unconscious processes, it is not clear why
the means correspond to those based on self-reports. However, this finding is
easily explained by a single-attitude model, where the mean structure depends
on the mean structure of the latent attitude variable.
In sum, Nosek et al.’s claim that the IAT has demonstrated discriminant
validity is based on a misunderstanding of Campbell and Fiske’s (1959) approach
to construct validation. A proper assessment of construct validity requires demonstration
of convergent validity before it is possible to demonstrate discriminant
validity, and to demonstrate convergent validity it is necessary to use
multiple independent measures of the same construct. Thus, to demonstrate construct validity of
the IAT as a measure of implicit personality characteristics requires multiple
independent implicit measures.
First Evidence of Discriminant Validity in a Multi-Method Study
Cunningham, Preacher, and Banaji (2001) reported the results
of the first multi-method study of prejudice. Participants were 93 students
with complete data. Each student completed a single explicit measure of
prejudice, the Modern Racism Scale (McConahay, 1986), and three implicit
measures: (a) the standard race IAT (Greenwald et al., 1998), a response window
IAT (Cunningham et al., 2001), and a response-window evaluative priming task (Fazio
et al., 1986). The assessment was repeated on four occasions two weeks apart.
I used the published correlation matrix to reexamine the
claim that a single-factor model does not fit the data. First, I was able to reproduce
the model fit of the published dual-attitude model with MPLUS8.2 (original fit:
chi2(100, N = 93) = 111.58, p = .20; NNFI = .96; CFI = .97; RMSEA = 0.041 (90%
confidence interval: 0.00, 0.071); reproduced fit: chi2(100, N = 93) = 112, CFI
= .977, RMSEA = 0.036, 90%CI = .000 to .067.
Thus, the model fit of the reproduced model serves as a comparison
standard for the alternative models that I examined next.
The original model is a hierarchical model with an implicit
attitude factor as a second-order factor, and method-specific first-order
factors. Each first-order factor has four indicators for four repeated
measurements with the same method. This model
imposes constraint on the first order loadings because they contribute to the
first-order relations among indicators of the same method and to the second
order relations of different implicit methods to each other.
An alternative way to model multi-method data are bi-factor
models (Chen, West, & Sousa, 2006). A bifactor model allows for all measures
to be directly related to the general trait factor that corresponds to the second-order
factor in a hierarchical model. However,
bi-factor models may not be identified if there are no method factors. Thus, a
first step is to allow for method-specific correlated residuals and to examine whether
these correlations are positive.
The model with a single factor and method-specific residual correlations fit the data better than the hierarchical model, chi2(80, N = 93) = 87, CFI = .988, RMSEA = 0.029, 90%CI = .000 to .065. Inspection of the residual correlations showed high correlations for the Modern Racism scale, but less evidence for method-specific variance for the implicit measures. The response window IAT had no significant residual correlations. This explains the high factor loading of the respond window IAT in the hierarchical model. It does not suggest that this is the most valid measure. Rather, it shows that there is little method specific variance. Fixing these residual correlations to zero, improved model fit, chi2(86, N = 93) = 91, CFI = .991, RMSEA = 0.025, 90%CI = .000 to .062. I then tried to create method factors for the remaining methods. For the IAT, a method factor could only be created for the first three occasions. However, model fit for this model decreased unless occasion 2 was allowed to correlate with occasion 4. This unexpected finding is unlikely to reflect a real relationship. Thus, I retained the model with a method factor for the first three occasions only, chi2(89, N = 93) = 97, CFI = .986, RMSEA = 0.029, 90%CI = .000 to .064. I was able to fit a method factor for evaluative priming, but model fit decreased, chi2(91, N = 93) = 101, CFI = .983, RMSEA = 0.033, 90%CI = .000 to .065, and the first occasion did not load on the method factor. Model fit could be improved by fixing the loading to zero and by allowing for an additional correlation between occasion 1 and 3, chi2(91, N = 93) = 98, CFI = .988, RMSEA = 0.027, 90%CI = .000 to .062. However, there is no rational for this relationship and I retained the more parsimonious model. Fitting the measurement model for the modern racism scale also decreased fit, but fit was better than for the model in the original article, chi2(94, N = 93) = 107, CFI = .977, RMSEA = 0.038, 90%CI = .000 to .068. This was the final model (Figure 1).
The most important results are the factor loadings of the
measures on the trait factor. Factor loadings for the Modern racism scale
ranged from .35 to .45 (M = .40). Factor loadings for the standard IAT ranged
from .43 to .54 (M = .47). Factor loadings for the response window IAT ranged
from .41 to .69 (M = .51). The
evaluative priming measures had the lowest factor loadings ranging from .13 to
.47 (M = .29). Thus, there is no
evidence that implicit measures are more strongly related to each other than to
explicit measures, as stated in the original article.
In terms of absolute validity, all of these validity
coefficients are low, suggesting that a single standard IAT measure on a single
occasion has .47^2 = 22% valid variance.
Most important, these results suggest that the Modern Racism Scale and
the IAT measure a single construct and that the low correlation between
implicit and explicit measures reflects low convergent validity rather than high
In conclusion, a reexamination of Cunningham et al.’s data
shows that the data do not provide evidence of discriminant validity and that
the IAT may simply be an alternative measure of the same construct that is
being measured with explicit measures like the Modern Racism Scale. Thus, the
study provides no evidence for the construct validity of the IAT as a measure
of implicit individual differences in race attitudes.
Meta-Analysis of Implicit – Explicit Correlations
Hofmann, Gawronski, Geschwendner, and Le (2005) conducted a
meta-analysis of 126 studies that had reported correlations between an IAT and
an explicit measure of the same construct. Notably, over one-hundred studies had
been conducted without using multiple-implicit measures. The mono-method
approach taken in these studies suggests that authors took construct validity
of the IAT for granted, and used the IAT as a measure of implicit constructs. As a result, these studies provide no test of
the construct validity of the IAT.
Nevertheless, the meta-analysis produced an interesting
result. Correlations between implicit
and explicit measures varied across personality characteristics. Correlations were lowest for self-esteem,
which is consistent with Bosson et al.’s (2000) finding, and highest for simple
attitude objects like consumer products (e.g. Pepsi vs. Coke). Any theory of implicit attitude measures has
to explain this finding. One explanation
could be that explicit measures of self-esteem are less valid than explicit-measures
of preferences for consumer goods. However, it is also possible that the validity
of the IAT varies. Once more, a
comparison of different personality characteristics with multiple methods is
needed to test this competing theories.
Problems with Predictive Validity
Ten years after the IAT was published another problem
emerged. Some critics voiced concerns
that the IAT, especially the race IAT, lacks predictive validity (Blanton,
Jaccard, Klick, Mellers, Mitchell, & Tetlock (2009). To examine the predictive validity of the
IAT, Greenwald and colleagues (2009) published a meta-analysis of IAT-criterion
correlations. The key finding was that “for 32 samples with criterion measures
involving Black–White interracial behavior, predictive validity of IAT measures
significantly exceeded that of self-report measures” (p. 17). Specifically, the authors reported a
correlation of r = .24 for the IAT
and a criterion and a correlation of r
= .12 for an explicit measure and a criterion, and that these correlations were
significantly different from each other.
A few years later, Oswald, Mitchell, Blanton, Jaccard, and Tetlock
(2013) published a critical reexamination of the literature and reported
different results. “IATs were poor predictors of every criterion category other
than brain activity, and the IATs performed no better than simple explicit
measures” (p. 171). The only exception
were fMRI studies with extremely small samples that produced extremely large
correlations, often exceeding the reliability of the IAT. It is well known that these correlations are
inflated and difficult to replicate (Vul, Harris, Winkielman, & Hashler,
2009). Moreover, correlations with
neural activity are not evidence that IAT scores predict behavior.
More recently, Greenwald and colleagues published a new
meta-analysis (Kurdi et al., 2018). This meta-analysis produced weaker
criterion correlations than the previous meta-analysis. The median IAT-criterion correlation was r = .050. This is also true if the analysis is limited
to studies with the race IAT. After correcting
for random measurement error, the authors report on average correlation of r = .14.
However, correction for unreliability yields hypothetical correlations
that could be obtained if the IAT were perfectly reliable, which it is not. Thus,
for the practical evaluation of the IAT as a measure of individual differences,
it is more important how much the actual IAT scores can predict some validation
criterion. With small IAT-criterion
correlations around r = .1, large
samples would be required to have sufficient power to detect effects,
especially incremental effects above and beyond explicit measures. Given that
most studies had sample sizes of less than 100 participants, “most studies were
vastly underpowered” (Kurdi et al., 2018, p. 1). Thus, it is now clear that IAT
scores have low predictive validity, but it is not clear whether IAT scores
have any predictive validity, when they have predictive validity, and whether
they have predictive validity after controlling for explicit predictors of
Greenwald et al.’s (2009) 2008 US Election Study
In 2008, a historic event occurred in the United States. US
voters had the opportunity to elect the first Black president. Although the
outcome is now a historic fact, it was uncertain before the election how much
Barak Obama’s racial background would influence White voters. There was also considerable concern that
voters might not reveal their true feelings. This provided a great opportunity
to test the validity of implicit measures of racial bias. If White voters are influenced by racial
bias, IAT scores should predict voting intentions above and beyond explicit
measures. According to the abstract of the article, the results confirm this
prediction. “The implicit race attitude measures (Implicit Association Test and
Affect Misattribution Procedure) predicted vote choice independently of the
self-report race attitude measures, and also independently of political
conservatism and symbolic racism. These findings support construct validity of
the implicit measures” (p. 242).
These claims were based on results of multiple regression
analyses. “When entered after the self-report measures, the two implicit
measures incrementally explained 2.1% of vote intention variance, p=.001, and
when political conservativism was also included in the model, “the pair of
implicit measures incrementally predicted only 0.6% of voting intention
variance, p = .05.” (p. 247).
I tried to reproduce these results with the published
correlation matrix and failed to do so.
A multiple regression analysis with explicit measures, implicit
measures, and political orientation as predictors showed non-significant
effects for the IAT, b = .002, se = .024, t = .087, p = .930 and the AMP, b =
.033, se = .023, t = 1.470, p = .142. I also obtained the raw data from Anthony
Greenwald, but I was unable to recreate the sample size of N = 1,057. Instead I
obtained a similar sample size of N = 1,035. Performing the analysis on this sample also
produced non-significant results; IAT, b = -.003, se = .044, t = .070, p = .944
and the AMP, b = -.014, se = .042, t = 0.344, p = .731.
To fully explore the relationship among the variables in this valuable dataset, I fitted a structural equation model to the raw data (N = 1,035). The model had good fit, chi2(9) = 18.27, CFI = .995, RMSEA = .032 90%CI(.009-.052). As shown in Figure 2, the IAT did not have incremental predictive validity as the residual variance was unrelated to voting. There is also no evidence of discriminant validity because the residuals of the two measures are not correlated. However, the model does show that a ProWhite bias predicts voting above and beyond political orientation. Thus, the results do support the hypothesis that racial bias influenced voting in the 2008 election. This bias is reflected in explicit and implicit measures. Interestingly, the validity coefficients in this study differ from those in Cunningham et al.’s study with undergraduate students. The factor loadings suggest that the IAT is the most valid measure of racial bias with .59^2 = 36% valid variance as a measure of explicit attitudes. This makes the IAT as valid as the feeling thermometer, which is more valid than the Modern Racism Scale in Cunningham’s study. This finding has been replicated in subsequent studies (Axt, 2018).
In conclusion, a reexamination of the 2008 election study shows
that the data are entirely consistent with a single-attitude model and that
there is no evidence for incremental predictive validity or discriminant
validity in these data. However, the study does show some predictive validity
of the IAT and convergent validity with explicit measures. Thus, the results
provide no construct validity of the IAT as a measure of implicit individual differences,
but the results can also be interpreted as evidence for validity as a measure
of the same construct that is measured with explicit measures. This shows that claims about validity vary as
a function of the construct that is being measured. A scale is a good measure of weight, but not
of intelligence. The results here
suggest that the race IAT is a moderately valid measure of racial bias, but an
invalid measure of implicit bias, which may not even exist because scientific
claims about implicit bias require valid measures of implicit bias.
Reexamining a Multi-Trait Multi-Method Study
The most recent and extensive multi-trait multi-method
validation study of the IAT was published last year (Bar-Anan & Vianello,
2018). The abstract claims that the
results provide clear support for the validity of the IAT as a measure of
implicit cognitions, including implicit self-esteem. “The evidence supports the
dual-attitude perspective, bolsters the validation of 6 indirect measures, and
clears doubts from countless previous studies that used only one indirect
measure to draw conclusions about implicit attitudes” (p. 1264).
Below I show that these claims are not supported by the
data, and that single-attitude models fit the data as well as dual-attitude
models. I also show that dual-attitude models show low convergent validity
across implicit measures, while IAT variants share method variance because they
rely on the same mechanisms to measure attitudes.
Bar-Anan and Vianello (2018) fitted a single model to measures
of self-esteem, racial bias, and political orientation. This makes the model
extremely complex and produced some questionable results (e.g., the implicit
and explicit method factors were highly correlated; some measured had negative
loadings on the method factors). In
structural equation modeling, it is good practice to fit smaller models before
creating a larger model. Thus, I first examined
construct validity for each domain separately before I fitted a model that integrates
models into a single unified model.
I first fitted a dual-attitude model to measures of racial attitudes and included contact as the criterion variable. I did not specify a causal relationship between contact and attitudes because attitudes can influence contact and vice versa. The dual-attitude model had good fit, chi2(48) = 109.41; CFI = .975; RMSEA = 0.010 (90% confidence interval: 0.007, 0.012). The best indicator of the explicit factor was the preference rating (Figure 3). The best indicator of the implicit factor was the BIAT. However, all IAT-variants had moderate to high loadings on the implicit factor. In contrast, the evaluative priming measure had a low loading on the implicit factor and the AMP had a moderate loading on the explicit factor and no significant loading on the implicit factor. These results show that Bar-Anan and Vianello’s model failed to distinguish between IAT-specific method variance and method variance for implicit measures in general. The present results show that IAT-variants share little valid variance or method variance with conceptually distinct implicit measures.
Not surprisingly, a single-attitude model with an IAT method factor (Figure 4) also fit the data well, chi2(46) = 112.04; CFI = .973; RMSEA = 0.010 (90% confidence interval: 0.008, 0.013). Importantly, the model has no shared method variance between conceptually different explicit measures like preference ratings and the Modern Racism Scale (MRS). The AMP and the EP both are valid measures of attitudes but with relatively modest validity. The BIAT has a validity of .46, with 22% explained variance. This result is more consistent with Cunningham et al. (2001) than Greenwald et al. (2009) data. The model also shows a clear relationship between contact and less pro-White bias. Finally, the model shows that the IAT method factor is unrelated to contact. Thus, any relationship between IAT scores and contact is explained by the shared variance with explicit measures.
These results show that Bar-Anan and Vianello’s (2018)
conclusion are not supported by the data. Although a dual-attitude model can be
fitted to the data, it shows low convergent validity across different implicit
measures, and a single-attitude model fits the data as well as a dual-attitude
Figure 5 shows the dual-attitude model for political orientation. The explicit factor is defined by a simple rating of preference for republicans versus democrats, the modern racism scale, the right-wing-authoritarianism scale, and ratings of Hillary Clinton. The implicit factor is defined by the IAT, the brief IAT, the Go-NoGo Task, and single category IATs. The remaining two implicit measures, the Affect Misattribution Task, and Evaluative Priming are allowed to load on both factors. Voting in the previous election is predicted by explicit attitudes. The model has good fit to the data, chi2(48) = 99.34; CFI = .991; RMSEA = 0.009 (90% confidence interval: 0.006, 0.011). The loading pattern shows that the AMP and EP load on the implicit factor. This supports the hypothesis that all implicit measures have convergent validity. However, the loadings for the IATs are much higher. In the dual-attitude framework this would imply that the IAT is a much more valid measure of implicit attitudes than the AMP or EP. Evidence for discriminant validity is weak. The correlation between the explicit and the implicit factor is r = .89. The correlation in the original article was r = .91. Nevertheless, the authors concluded that the data favor the two-factor model because constraining the correlation to 1 reduced model fit.
However, it is possible to fit a single-construct model by allowing for an IAT-variant method factor, chi2(50) = 86.25; CFI = .993; RMSEA = 0.007 (90% confidence interval: 0.005, 0.010). This model (Figure 6) shows that voting is predicted by a single latent factor that represents political orientation and that simple self-report measures of political orientation are the most valid measure of political orientation. The IAT shows stronger correlations with explicit measures because it is a more valid measure of political orientation, .74^2 = 55% valid variance, than the race IAT (22% valid variance).
Figure 7 shows the results for a dual-attitude model of
self-esteem. Model fit was good,
although CFI was lower than in the previous model due to weaker factor
loadings, chi2(16) = 28.62; CFI = .950; RMSEA = 0.008 (90% confidence interval:
0.003, 0.013). The model showed a
moderate correlation between the explicit and implicit factors, r = .46, which
is stronger than in the original article, r = .29, but clearly suggestive of
two distinct factors. However, the nature of these two factors is less clear.
The implicit factor is defined by the three IAT measures, whereas the AMP and
EP have very low loadings on this factor.
This is also true in the original article with loadings of .24 for AMP
and .13 for EP. Thus, the results
confirm Bosson’s seminal finding that different implicit measures have low
As the Implicit Factor was mostly defined by the IAT measures, it was also possible to fit a single-factor model mode with an IAT measurement factor (Figure 8), chi2(16) = 31.50; CFI = .938; RMSEA = 0.009 (90% confidence interval: 0.004, 0.013). However, some of the results of this model are surprising.
According to this model, the validity coefficient of the widely used Rosenberg self-esteem scale is only r = .35, suggesting that only 12% of the variance in the Rosenberg self-esteem scale is valid variance. In addition, the IAT and the BIAT would be equally valid measures of self-esteem. Thus, previous results of low implicit-explicit correlations for self-esteem (Bosson et al., 2000; Hofmann et al., 2005) would imply low validity of implicit and explicit measures. This finding would have dramatic implications for the interpretation of low self-esteem-criterion correlations. A valid self-esteem-criterion correlation of r = .3, would produce only an observed correlation of r = .30*.35 = .11 with the Rosenberg self-esteem scale or the IAT. Correlations of this magnitude require large samples (N = 782) to have an 80% probability to obtain a significant result with alpha = .05 or N = 1,325 with alpha = .005. Thus, most studies that tried to predict performance criteria form self-esteem were underpowered. However, the results of this study are limited by the use of an online sample and the lack of proper criterion variables to examine predictive validity. The main conclusion from this analysis is that a single-factor model with an IAT method factor fit the data well and that the dual attitude model failed to demonstrate convergent validity across different implicit measures; a finding that replicates Bosson et al. (2000), which Bar-Anan and Vianello do not cite.
A Unified Model
After establishing well-fitting models for each personality
characteristic, it is possible to fit a unified model. Importantly, no changes
to the individual models should be made because a decrease in fit can be attributed
to the new relationships across different personality characteristics. Without any additional modifications, the
overall model in Figure 9 had good fit, XX. Correlations among the IAT method factors
showed significant positive correlations of the method factor for race with the
method factor for self-esteem (r = .4) and political orientation (r = .2), but
a negative correlation for the method factors for self-esteem and political
orientation (r = -.3). This pattern of
correlations is inconsistent with a simple method factor that is expected to
produce positive correlations. Thus, it is impossible to fit a general method
factor to different IATs. This finding replicates Nosek and Smyth’s (2007)
Correlations among the personality characteristics replicate
the finding with Greenwald et al.’s (2009) data that Republicans are more
likely to have a pro-white bias, r = .4.
Political orientation is unrelated to self-esteem, r = .0, but Pro-White
bias tends to be positively related to self-esteem, r = .2.
In conclusion, the present results show that Bar-Anan and Vianello’s
claims are not supported by the data.
Their data do not provide clear evidence for discriminant validity of
implicit and explicit constructs. The
data are fully consistent with the alternative hypothesis that the IAT and
other implicit measures measure the same construct that is being measured with
implicit factors. Thus, the data provide no support for the construct validity
of the IAT as a measure of implicit personality characteristics.
Validity of the Self-Esteem IAT
Bosson et al. (2000) seminal article raised first concerns
about the construct validity of the self-esteem IAT. Since then, other critical
articles have been published; none of which are cited in Kurdi et al. (2018).
Gawronski, LeBel, and Peters (2007) wrote a PoPS article on the construct
validity of implicit self-esteem. They fond no conclusive evidence that(a) the
self-esteem IAT measures unconscious self-esteem or that (b) low correlations
are due to self-report biases in explicit measures of self-esteem. Walker and
Schimmack (2008) used informant ratings to examine predictive validity of the
self-esteem IAT. Informant ratings are the most widely used validation
criterion in personality research, but have not been used by social psychologists.
One advantage of informant ratings is that they also measure general
personality characteristics rather than specific behaviors, which ensures
higher construct-criterion correlations due to the power of aggregation (Epstein,
1980). Walker and Schimmack (2008) found
that informant ratings of well-being were more strongly correlated with
explicit self-ratings of well-being than with a happiness or a self-esteem
The most recent and extensive review was conducted by Falk
and Heine (2014) who found that “the validity evidence for the IAT in measuring
ISE [implicit self-esteem] is strikingly weak” (p. 6). They also point out that implicit measures of
self-esteem “show a remarkably consistent lack of predictive validity” (p.
6). Thus, an unbiased assessment of the
evidence is fully consistent with the analyses of Bar-Anan and Vianello’s data
that also found low validity of the self-esteem IAT as a measure of self-esteem.
Currently, a study by Falk, Heine, Takemura, Zhang, and Hsu
(2013) provides the most comprehensive examination of convergent and
discriminant validity of self-esteem measures. I therefore used structural
equation modeling of their data to see how consistent the data are with a
dual-attitude model or a single-attitude model.
The biggest advantage of the study was the inclusion of informant ratings
of self-esteem, which makes it possible to model method-variance in
self-ratings (Anusic et al., 2009). Previous
research showed that self-ratings of self-esteem have convergent validity informant
ratings of self-esteem (Simms, Zelazny, Yam, & Gros, 2010; Walker &
Schimmack, 2008). I also included the self-report
measures of positive affect and negative affect to examine criterion validity.
It was possible to fit a single-factor model to the data (Figure 10), chi2(67) = 115.85; CFI = .964; RMSEA = 0.050 (90% confidence interval: 0.034, 0.065). Factor loadings show the highest loadings for self-ratings on the self-competence scale and the Rosenberg self-esteem scale. However, informant ratings also had significant loadings on the self-esteem factor, as did self-ratings on the narcissist personality inventory. A measure of halo bias in self-ratings of personality (SEL) also had moderate loadings, which confirms previous findings that self-esteem is related to evaluative biases in personality ratings (Anusic et al., 2009). The false uniqueness measure (FU; Falk et al., 2015) had modest validity. In contrast, the implicit measures had no significant loadings on this factor. In addition, the residual correlations among the implicit measures were weak and not significant. Given the lack of positive relations among implicit measures it was impossible to fit a dual-attitude model to these data.
It is not clear why Bar-Anan and Vianello’s data failed to
show higher validity of explicit measures, but the current results are
consistent with moderate validity of explicit self-ratings in the personality
literature (Simms et al., 2010). Thus, there is consistent evidence that implicit
self-esteem measures have low validity as measures of self-esteem and there is
no evidence that they are measures of implicit self-esteem.
Explaining Variability in Explicit-Implicit Correlations
One well-established phenomenon in the literature is that
correlations between IAT scores and explicit measures vary across domains
(Bar-Anan & Vianello, 2018; Hofmann et al., 2005). As shown earlier, correlations for political
orientation are strong, correlations for racial attitudes are moderate, and
correlations for self-esteem are weak. Greenwald
and Banaji (2017) offer a dual-attitude explanation for this finding. “The
plausible interpretations of the more common pattern of weak implicit– explicit
correlations are that (a) implicit and explicit measures tap distinct
constructs or (b) they might be affected differently by situational influences in
the research situation (cf. Fazio & Towles-Schwen, 1999; Greenwald et al.,
2002) or (c) at least one of the measures, plausibly the self-report measure in
many of these cases, lacks validity” (p. 868).
The evidence presented here offers a different explanation. IAT-explicit correlations and IAT-criterion
correlations increase with the validity of the IAT as a measure of the same
personality characteristics that are measured with explicit measures. Thus, low correlations of the self-esteem IAT
with explicit measures of self-esteem show low validity of the self-esteem IAT.
High correlations of the political
orientation IAT with explicit measures of political orientation show high
validity of the IAT as a measure of political orientation; not implicit
political orientation. Finally, modest
correlation between the race IAT and explicit measures of racial bias show
moderate validity of the race IAT as a measure of racial bias. However, the
validity of the race IAT as a measure of racial bias (not implicit racial
bias!) varies considerably across studies. This variation may be due to the variability
of racial bias in samples which may be lower in student samples. Thus, contrary to Greenwald and Banaji’s
claims, the problem is not with the explicit measures, but with the IAT.
An important question is why the self-esteem IAT is less
valid than the political orientation IAT.
I propose that one cause of variation in the validity of the IAT is
related to the proportion of respondents on the two ends of a personality
characteristic. To test this hypothesis, I used Bar-Anan and Vianello’s
data. To determine the direction of the
IAT score, I used a value of 0 as the neutral point. As predicted, 90% of participants associated
self with good, 78% associated White is good, and 69% associated Democrat with
good. Thus, validity decreases with the proportion
of participants who are on one side of the bipolar dimension.
Next, I regressed the preference measure on a simple
dichotomous predictor that coded the direction of the IAT. I standardized the preference measure and
report standardized and unstandardized regression coefficients. Standardized regression coefficients are
influenced by the distribution of the predictor variable and should show the
expected pattern. In contrast, unstandardized coefficients are not sensitive to
the proportions and should not show the pattern. I also added the IAT scores as
predictors in a second step to examine the incremental predictive validity that
is provided by the reaction times.
The standardized coefficients are consistent with
predictions (Table 1). However, the unstandardized coefficients also show the
same pattern. Thus, other factors also play a role. The amount of incremental
explained variance by reaction times shows no differences between the race and
the political orientation task. Most of
the differences in validity are due to the direction of the attitude (4% explained
variance for race bias vs. 38% explained variance for political orientation).
SE B = .310, SE = .142; b = .093, se = .043;
r2 = .009, Δr2 = .002, z = 1.09
Race B = .467, SE =
.010, b = .193, se = .041, r2 = .041, Δr2 = .060, z = 5.79
PO B = 1.380, SE = .080, b = .637, se =
.037, r2 = .380, Δr2 = .070, z = 7.83
The results show the importance of taking the proportion of
respondents with opposing personality characteristics into account. The IAT is
least valid when most participants are high or low on a personality
characteristic, and it is most valid when participants are split into two
equally large groups.
In conclusion, I provided an alternative explanation of
variation in explicit-implicit correlations that is consistent with the
data. Implicit-explicit correlations vary
at least partially as a function of the validity of the IAT as a measure of the
same construct that is measured with explicit measures, and the validity of the
IAT varies as a function of the proportion of respondents who are high versus
low on a personality characteristic. As most respondents associate the self
with good, and reaction times contribute little to the validity of the IAT, the
IAT has particularly low validity as a measure of self-esteem.
The Elusive Malleability of Implicit Attitude Measures
Numerous experimental studies have tried to manipulate
situational factors in order to change scores on implicit attitude measures
(Lai, Hoffman, & Nosek, 2013). Many
of these studies focused on implicit measures of prejudice in order to develop
interventions that could reduce prejudice. However, most studies were limited
to brief manipulations with immediate assessment of attitudes (Lai et al.,
2013). The results of these studies are
mixed. In a seminal study, Dasgupta and
Greenwald (2001) exposed participants to images of admired Black exemplars and
disliked White exemplars. They reported that this manipulation had a large
effect on IAT scores. However, these days the results of this study are less
convincing because it has become apparent that large effect sizes from small
samples often do not replicate (Open Science Collaboration, 2015). Consistent
with this skepticism, Joy-Gaba and Nosek (2010) had difficulties replicating
this effect with much larger samples and found only an average effect size of d
= .08. With effect sizes of this
magnitude, other reports of successful experimental manipulations were
extremely underpowered. Another study with large samples found
stronger effects (Lai et al., 2016). The
strongest effect was observed for instruction to fake the IAT. However, Lai et al. also found that none of
these manipulations had lasting effects in a follow-up assessment. This finding
suggests that even when changes are observed, they reflect context-specific
method variance rather than actual changes in the construct that is being
This conclusion is also supported by one of the few
longitudinal IAT studies. Cunningham et al.’s (2001) multi-method study repeated
the measurement of racial bias on four separate occasions. The model shown in Figure 1 shows no systematic
relationships between measures taken on the same occasion, and adding these
relationships shows non-significant correlated residuals. Thus, in this sample
naturally occurring factors did not change race bias. This finding suggests
that the IAT and explicit measures measure stable personality characteristics rather
than context-specific states.
Only a few serious intervention studies with the IAT have
been conducted (Lai et al., 2013). The
most valuable evidence so far comes from studies that examined the influence of
living with an African American roommate on White students’ racial attitudes
(Shook & Fazio, 2008; Shook, Hopkins, & Koech, 2016). One study found effects on an implicit measure,
F(1,236) = 4.33, p = .04 (Shook & Fazio, 2008), but not on an explicit measure
(Shook, 2007). The other study found
effects on explicit attitudes, F(1,107) = 7.34, p = .008 but no results for
implicit measures were reported (Shook, Hopkins, & Koech, 2016). Given the
small sample sizes of these studies, inconsistent results are to be expected.
In conclusion, the existing evidence shows that implicit and
explicit attitude measures are highly stable over time (Cunningham et al.,
2001). I also concur with Joy-Gaba and Nosek that moving scores on implicit
bias measures “may not be as easy as implied by the existing experimental demonstrations”
(p. 145), and a multi-method assessment is needed to distinguish effects on
specific measures from effects on personality characteristics (Olsen &
Future studies of attitude change need a multi-method
approach, powerful interventions, adequate statistical power, and multiple
repeated measurements of attitudes to distinguish mere occasion-specific variability
(malleability) from real attitude change (Anusic & Schimmack, 2016). Ideally,
the study would also include informant ratings. For example, intervention
studies with roommates could use African Americans as informants to rate their
White roommates’ racial attitudes and behaviors. The single-attitude model predicts that
implicit and explicit measures will show consistent results and that variation
in effect sizes is explained by the validity of each measure.
Does the IAT Measure Implicit Constructs?
Construct validation is a difficult and iterative process
because scientific evidence can alter the understanding of constructs. However, construct validation research has to
start with a working definition of a construct.
The IAT was introduced as a measure of individual differences in
implicit social cognition, and implicit social cognitions were defined as aspects
of thinking and feeling that may not be easily accessed or available to
consciousness (Nosek, Greenwald, & Banaji, 2007, p. 265). This definition is vague, but it makes a clear
prediction that the IAT should measure personality characteristics that cannot
be measured with self-reports. This
leads to the prediction that explicit measures and the IAT have discriminant validity. To demonstrate discriminant validity, unique
variance in the IAT has to be related to other indicators of implicit
personality characteristics. This can be
demonstrated with incremental predictive validity or convergent validity with
other measures of implicit personality characteristics. Consistent with this line of reasoning,
numerous articles have claimed that the IAT has construct validity as a measure
of implicit personality characteristics because it shows incremental predictive
validity (Greenwald et al., 2009; Kurti et al., 2018) or because the IAT shows convergent
validity with other implicit measures and discriminant validity with explicit
measures (Bar-Anan & Vianello, 2018).
I demonstrated that all of these claims were false and that the existing
evidence provides no evidence for the construct validity of the IAT as a
measure of implicit personality characteristics. The main problem is that most studies that
used the IAT assumed construct validity rather than testing it. Hundreds of studies used the IAT as a single
measure of implicit personality characteristics and made claims about implicit
personality traits based on variation in IAT scores. Thus, hundreds of studies made claims that
are not supported by empirical evidence simply because it has not been
demonstrated that the IAT measures implicit personality constructs. In this regard the IAT is not alone. Aside from the replication crisis in
psychology (OSC, 2015), psychological science suffers from an even more serious
validation crisis. All empirical claims rest on the validity of measures that
are used to test theoretical claims. However, many measures in psychology are
used without proper validation evidence.
Personality research is a notable exception. In response to criticism of low predictive
validity (Mischell, 1968), personality psychologists embarked on a program of
research that demonstrated predictive validity and convergent validity with
informant ratings (Funder, $$$). Another
problem is that psychologists treat validity as a qualitative construct,
leading to any evidence of validity to support claims that a measure is valid,
as if it were 100% valid. However, most measures in psychology have only
moderate validity (Schimmack, 2010). Thus, it is important to quantify validity
and to use a multi-method approach to increase validity. The popularity of the IAT reveals the problems
with using measures without proper validation evidence. Social psychologists have influenced public discourse,
if not public policy, about implicit racial bias. Most of these claims are based on findings
with the IAT, assuming that IAT scores reflect implicit bias. As demonstrated
here, these claims are not valid because the IAT lacks construct validity as a
measure of implicit bias. In the future,
psychologists need to be more careful when they make claims based on new
measures with limited knowledge about their validity. Maybe psychological organizations should
provide clear guidelines about minimal standards that need to be met before a
measure can be used, just like there are guidelines for validity evidence for
personality assessment. In conclusion, psychology
suffers as much from a validation crisis as it suffers from a replication crisis. Fixing the replication crisis will not
improve psychology if replicable results are obtained with invalid measures.
The Silver Lining
Psychologists are often divided into opposing camps (e.g.
nature vs. nurture; person vs. situation; the IAT is valid vs. invalid). Many fans of implicit measures are likely to
dislike what I had to say about the IAT.
However, my position is different from previous criticisms of the IAT as
being entirely invalid (Oswald et al., 2013).
I have demonstrated with several multi-method studies that the IAT has convergent
validity with other measures of some personality characteristics. In some domains
this validity is too low to be meaningful.
In other domains, the validity of explicit measures is so high that
using the IAT is not necessary. However, for sensitive attitudes like racial
attitudes, the IAT offers a promising complementary measure to explicit
measures of racial attitudes. Validity
coefficients ranged from 20% to 40%. As
the IAT does not appear to share method variance with explicit measures, it is
possible to improve the measurement of racial bias by using explicit and
implicit measures and to aggregate scores to obtain a more valid measure of
racial bias than either an explicit or an implicit measure can provide. The IAT may also offer benefits in situations
where socially desirable responding is a concern. Thus, the IAT might complement other measures
of personality characteristics. This changes the interpretation of explicit-IAT
correlations. Rather than (mis)interpreting low correlations as evidence of discriminant
validity, high correlations can reveal convergent validity. Similarly,
improvements in implicit measures should produce higher correlations with explicit
measures. How useful the IAT and other
implicit measures are for the measurement of other personality characteristics
has to be examined on a case by case basis. Just like it is impossible to make
generalized statements about the validity of self-reports, the validity of the
IAT can vary across personality characteristics.
Social psychologists have always distrusted self-report,
especially for the measurement of sensitive topics like prejudice. Many attempts were made to measure attitudes
and other constructs with indirect methods.
The IAT was a major breakthrough because it has relatively high
reliability compared to other methods. Thus, creating the IAT was a major achievement
that should not be underestimated because the IAT lacks construct validity as a
measure of implicit personality characteristics. Even creating an indirect
measure of attitudes is a formidable feat. However, in the early 1990s, social
psychologists were enthralled by work in cognitive psychology that demonstrated
unconscious or uncontrollable processes. Implicit measures were based on this
work and it seemed reasonable to assume that they might provide a window into
the unconscious. However, the processes that are involved in the measurement of
personality characteristics with implicit measures are not the personality
characteristics that are being measured.
There is nothing implicit about being a Republican or Democrat, gay or
straight, or low self-esteem. Conflating
implicit processes in the measurement of personality constructs with implicit personality
constructs has created a lot of confusion. It is time to end this confusion.
The IAT is an implicit measure of personality with varying validity. It is not a window into people’s unconscious
feelings, attitudes or personalities.
Anusic, I., &
Schimmack, U. (2016). Stability and change of personality traits, self-esteem,
and well-being: Introducing the meta-analytic stability and change model of
retest correlations. Journal of Personality and Social Psychology, 110(5),
Schimmack, U., Pinkus, R., & Lockwood, P. (2009). The nature and structure
of correlations among Big Five ratings: the halo-alpha-beta model. Journal of Personality
and Social Psychology, 97 6, 1142-56.
& Vianello, M. (2018). A multi-method multi-trait test of the dual-attitude
perspective. Journal of Experimental Psychology: General, 147(8), 1264-1272.
Jaccard, J., Klick, J., Mellers, B., Mitchell, G., & Tetlock, P. E. (2009).
Strong claims and weak evidence: Reassessing the predictive validity of the
IAT. Journal of Applied Psychology, 94(3), 567-582.
Bosson, J. K.,
Swann, W. B., Jr., & Pennebaker, J. W. (2000). Stalking the perfect measure
of implicit self-esteem: The blind men and the elephant revisited? Journal of
Personality and Social Psychology, 79(4), 631-643.
Campbell, D. T.,
& Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod
matrix. Psychological Bulletin, 56(2), 81-105.
Chen, F., West,
S.G., & Sousa, K.H. (2006) A Comparison of Bifactor and Second-Order Models
of Quality of Life, Multivariate Behavioral Research, 41:2, 189-225,
A., Preacher, K. J., & Banaji, M. R. (2001). Implicit attitude measures:
Consistency, stability, and convergent validity. Psychological Science, 12, 163-170.
& Greenwald, A. G. (2001). On the malleability of automatic attitudes:
Combating automatic prejudice with images of admired and disliked individuals. Journal of Personality and Social
Psychology, 81, 800–814. doi:10.1037/0022-35126.96.36.1990
Falk, C. F.,
Heine, S. J., Takemura, K. , Zhang, C. X. and Hsu, C. (2015). Are Implicit
Self-Esteem Measures Valid for Assessing Individual and Cultural Differences. Journal
of Personality, 83: 56-68. doi:10.1111/jopy.12082
Falk, C., &
Heine, S.J. (2015). What is implicit self-esteem, and does it vary across
cultures? Personality and Social Psychology Review, 19, 177-98.
Greenwald, A. G.,
& Farnham, S. D. (2000). Using the Implicit Association Test to measure
self-esteem and self-concept. Journal of Personality and Social Psychology,
79(6), 1022-1038. http://dx.doi.org/10.1037/0022-35188.8.131.522
McGhee, D.E., & Schwartz, J.L.K. (1998). Measuring individual differences
in implicit cognition: The Implicit Association Test. Journal of Personality and Social Psychology, 74, 1464–1480.
Greenwald, A. G.,
Poehlman, T. A., Uhlmann, E. L., & Banaji, M. R. (2009). Understanding and
using the Implicit Association Test: III. Meta-analysis of predictive validity.
Journal of Personality and Social Psychology, 97, 17–41.
Greenwald, A. G.,
Smith, C. T., Sriram, N., Bar-Anan, Y., & Nosek, B. A. (2009). Race
attitude measures predicted vote in the 2008 U. S. Presidential Election.
Analyses of Social Issues and Public Policy, 9, 241–253.
Gawronski, B., Gschwendner, T., Le, H., & Schmitt, M. (2005). A
meta-analysis on the correlation between the Implicit Association Test and
explicit self-report measures. Personality and Social Psychology Bulletin, 31,
1369 –1385. http://dx.doi.org/10.1177/0146167205275613
Seitchik, A. E., Axt, J. R., Carroll, T. J., Karapetyan, A., Kaushik, N., . . .
Banaji, M. R. (2018). Relationship between the Implicit Association Test and
intergroup behavior: A meta-analysis. American Psychologist. Advance online
(1986). Modern racism, ambivalence, and the modern racism scale. In J.F. Dovidio
& S.L. Gaertner (Eds.), Prejudice, discrimination, and racism (pp. 91–125).
Orlando, FL: Academic Press
Collaboration. (2015). Estimating the reproducibility of psychological science.
Science, 349(6251), 1-8.
Oswald, F. L.,
Mitchell, G., Blanton, H., Jaccard, J., & Tetlock, P. E. (2013). Predicting
ethnic and racial discrimination: A meta-analysis of IAT criterion studies.
Journal of Personality and Social Psychology, 105(2), 171-192.
Pelham, B. W., & Swann, W. B. (1989). From
self-conceptions to self-worth: On the sources and structure of global
self-esteem. Journal of Personality and Social Psychology, 57, 672– 680
(1965). Society and the Adolescent Self-image. Princeton, NJ: Princeton University
Zelazny, K., Yam, W.H., & Gros, D.F. (2010). Self-informant Agreement for
Personality and Evaluative Person Descriptors: Comparing Methods for Creating
Informant Measures. European Journal of Personality, 24 3, 207-221.
Vul, E, Harris, C, Winkielman, P., & Pashler,
(2009). Puzzlingly High Correlations in fMRI Studies of Emotion,
Personality, and Social Cognition, Perspectives on Psycholical Science, 4, 274-90.
Walker, S. S.,
& Schimmack, U. (2008). Validity of a happiness implicit association test
as a measure of subjective well-being. Journal of Research in Personality,
42(2), 490-497. http://dx.doi.org/10.1016/j.jrp.2007.07.005
Good science requires valid measures. This statement is hardly controversial. Not surprisingly, all authors of some psychological measure claim that their measure is valid. However, validation research is expensive and difficult to publish in prestigious journals. As a result, psychological science has a validity crisis. Many measures are used in hundreds of articles without clear definitions of constructs and without quantitative information about their validity (Schimmack, 2010).
The Implicit Association Test (AT) is no exception. The IAT was introduced in 1998 with strong and highly replicable evidence that average attitudes towards objects pairs (e.g., flowers vs. spiders) can be measured with reaction times in a classification task (Greenwald et al., 1998). Although the title of the article promised a measure of individual differences, the main evidence in the article were mean differences between groups. Thus, the original article provided little evidence that the IAT is a valid measure of individual differences.
The use of the IAT as a measure of individual differences in attitudes requires scientific evidence that tests scores are linked to variation in attitudes. Key evidence for the validity of a test are reliability, convergent validity, discriminant validity, and incremental predictive validity (Campbell & Fiske, 1959).
The validity of the IAT as a measure of attitudes has to be examined on a case by case basis because the link between associations and attitudes can vary depending on the attitude object. For attitude objects like pop drinks, Coke vs. Pepsi, associations may be strongly related to attitudes. In fact, the IAT has good predictive validity for choices between two pop drinks (Hofmann, Gawronski, Gschwendner, & Schmitt, 2005). However, it lacks convergent validity when it is used to measure self-esteem (Bosson & Swan, & Pennebaker, 2000).
The IAT is best known as a measure of prejudice, racial bias, or attitudes of White Americans towards African Americans. On the one hand, the inventor of the IAT, Greenwald, argues that the race IAT has predictive validity (Greenwald et al., 2009). Others take issue with the evidence: “Implicit Association Test scores did not permit prediction of individual-level behaviors” (Blanton et al., 2009, p. 567); “the IAT provides little insight into who will discriminate against whom, and provides no more insight than explicit measures of bias” (Oswald et al., 2013).
Nine years later, Greenwald and colleagues present a new meta-analysis of predictive validity of the IAT (Kurdi et al., 2018) based on 217 research reports and a total sample size of N = 36,071 participants. The results of this meta-analysis are reported in the abstract.
We found significant implicit– criterion correlations (ICCs) and explicit– criterion correlations (ECCs), with unique contributions of implicit (beta = .14) and explicit measures (beta = .11) revealed by structural equation modeling.
The problem with meta-analyses is that they aggregate information with diverse methods, measures, and criterion variables, and the meta-analysis showed high variability in predictive validity. Thus, the headline finding does not provide information about the predictive validity of the race IAT. As noted by the authors, “Statistically, the high degree of heterogeneity suggests that any single point estimate of the implicit– criterion relationship would be misleading” (p. 7).
Another problem of meta-analysis is that it is difficult to find reliable moderator variables if original studies have small samples and large sampling error. As a result, a non-significant moderator effect cannot be interpreted as evidence that results are homogeneous. Thus, a better way to examine the predictive validity of the race IAT is to limit the meta-analysis to studies that used the race IAT.
Another problem of small studies is that they introduce a lot of noise because point estimates are biased by sampling error. Stanley, Jarrell, and Doucouliagos (2010) made the ingenious suggestion to limit meta-analysis to the top 10% of studies with the largest sample sizes. As these studies have small sampling error to begin with, aggregating them will produce estimates with even smaller sampling error and inclusion of many small studies with high heterogeneity is not necessary. A smaller number of studies also makes it easier to evaluate the quality of studies and to examine sources of heterogeneity across studies. I used this approach to examine the predictive validity of the race IAT using the studies included in Kurdi et al.’s (2018) meta-analysis (data).
Description of the Data
The datafile contained the variable groupStemCat2 that coded the groups compared in the IAT. Only studies classified as groupStemCat2 == “African American and Africans” were selected, leaving 1328 entries (rows). Next, I selected only studies with an IAT-criterion correlation, leaving 1004 entries. Next, I selected only entries with a minimum sample size of N = 100, leaving 235 entries (more than 10%).
The 235 entries were based on 21 studies, indicating that the meta-analysis coded, on average, more than 10 different effects for each study.
The median IAT-criterion correlation across all 235 studies was r = .070. In comparison, the median r for the 769 studies with N < 100 was r = .044. Thus, selecting for studies with large N did not reduce the effect size estimate.
When I first computed the median for each study and then the median across studies, I obtained a similar median correlation of r = .065. There was no significant correlation between sample size and median ICC-criterion correlation across the 21 studies, r = .12. Thus, there is no evidence of publication bias.
I now review the 21 studies in decreasing order of the median IAT-criterion correlation. I evaluate the quality of the studies with 1 to 5 stars ranging from lowest to highest quality. As some studies were not intended to be validation studies, this evaluation does not reflect the quality of a study per se. The evaluation is based on the ability of a study to validate the IAT as a measure of racial bias.
1. * Ma et al. (Study 2), N = 303, r = .34
Ma et al. (2012) used several IATs to predict voting intentions in the 2012 US presidential election. Importantly, Study 2 did not include the race IAT that was used in Study 1 (#15, median r = .03). Instead, the race IAT was modified to include pictures of the two candidates Obama and Romney. Although it is interesting that an IAT that requires race classifications of candidates predicted voting intentions, this study cannot be used to claim that the race IAT as a measure of racial bias has predictive validity because the IAT measures specific attitudes towards candidates rather than attitudes towards African Americans in general.
2. *** Knowles et al., N = 285, r = .26
This study used the race IAT to predict voting intentions and endorsement of Obama’s health care reforms. The main finding was that the race IAT was a significant predictor of voting intentions (Odds Ratio = .61; r = .20) and that this relationship remained significant after including the Modern Racism scale as predictor (Odds Ratio = .67, effect size r = .15). The correlation is similar to the result obtained in the next study with a larger sample.
3. ***** Greenwald et al. (2009), N = 1,057, r = .17
The most conclusive results come from Greenwald et al.’s (2009) study with the largest sample size of all studies. In a sample of N = 1,057 participants, the race IAT predicted voting intentions in the 2008 US election (Obama vs. McCain), r = .17. However, in a model that included political orientation as predictor of voting intentions, only explicit attitude measures added incremental predictive validity, b = .10, SE = .03, t = 3.98, but the IAT did not, b = .00, SE = .02, t = 0.18.
4. * Cooper et al., N = 178, r = .12
The sample size in the meta-analysis does not match the sample size of the original study. Although 269 patients were involved, the race IAT was administered to 40 primary care clinicians. Thus, predictive validity can only be assessed on a small sample of N = 40 physicians who provided independent IAT scores. Table 3 lists seven dependent variables and shows two significant results (p = .02, p = .02) for Black patients.
5. * Biernat et al. (Study 1), N = 136, r = .10
Study 1 included the race IAT and donations to a Black vs. other student organizations as the criterion variable. The negative relationship was not significant (effect size r = .05). The meta-analysis also included the shifting standard variable (effect size r = .14). Shifting standards refers to the extent to which participants shifted standards in their judgments of Black versus White targets’ academic ability. The main point of the article was that shifting standards rather than implicit attitude measures predict racial bias in actual behavior. “In three studies, the tendency to shift standards was uncorrelated with other measures of prejudice but predicted reduced allocation of funds to a Black student organization.” Thus, it seems debatable to use shifting standards as a validation criterion for the race IAT because the key criterion variable were the donations, while shifting standards were a competing indirect measure of prejudice.
6. ** Zhang et al. (Study 2), N = 196, r = .10
This study examined thought listings after participants watched a crime committed by a Black offender on Law and Order. “Across two programs, no statistically significant relations between the nature of the thoughts and the scores on IAT were found, F(2, 85) = 2.4, p < .11 for program 1, and F(2, 84) = 1.98, p < .53 for program 2.” The main limitation of this study is that thought listings are not a real social behavior. As the effect size for this study is close to the median, excluding it has no notable effect on the final result.
7. * Ashburn et al., N = 300, r = .09
The title of this article is “Race and the psychological health of African Americans.” The sample consists of 300 African American participants. Although it is interesting to examine racial attitudes of African Americans, this study does not address the question whether the race IAT is a valid measure of prejudice against African Americans.
8. *** Eno et al. (Study 1), N = 105, r = .09
This article examines responses to a movie set during the Civil Rights Era; “Remember the Titans.” After watching the movie, participants made several ratings about interpretations of events. Only one event, attributing Emma’s actions to an accident, showed a significant correlation with the IAT, r = .20, but attributions to racism also showed a correlation in the same direction, r = .10. For the other events, attributions had the same non-significant effect size, Girls interests r = .12, Girls race, r = .07; Brick racism, r = -.10, Brick Black coach’s actions, r = -.10.
9. *** Aberson & Haag, N = 153, r = .07
Abserson and Haag administered the race IAT to 153 participants and asked questions about quantity and quality of contact with African Americans. They found non-significant correlations with quantity, r = -.12 and quality, r = -.10, and a significant positive correlation with the interaction, r = .17. The positive interaction effect suggests that individuals with low contact, which implies low quality contact as well, are not different from individuals with frequent high quality contact.
10. *Hagiwara et al., N = 106, r = .07
This study is another study of Black patients and non-Black physician. The main limitation is that there were only 14 physicians and only 2 were White.
11. **** Bar-Anan & Nosek, N = 397, r = .06
This study used contact as a validation criterion. The race IAT showed a correlation of r = -.14 with group contact. , N in the range from 492-647. The Brief IAT showed practically the same relationship, r = -.13. The appendix reports that contact was more strongly correlated with the explicit measures; thermometer r = .27, preference r = .31. Using structural equation modeling, as recommended by Greenwald and colleagues, I found no evidence that the IAT has unique predictive validity in the prediction of contact when explicit measures were included as predictors, b = .03, SE = .07, t = 0.37.
12. *** Aberson & Gaffney, N = 386, median r = .05
This study related the race IAT to measures of positive and negative contact, r = .10, r = -.01, respectively. Correlations with an explicit measure were considerably stronger, r = .38, r = -.35, respectively. These results mirror the results presented above.
13. * Orey et al., N = 386, median r = .04
This study examined racial attitudes among Black respondents. Although this is an interesting question, the data cannot be used to examine the predictive validity of the race IAT as a measure of prejudice.
14. * Krieger et al., N = 708, median r = .04
This study used the race IAT with 442 Black participants and criterion measures of perceived discrimination and health. Although this is a worthwhile research topic, the results cannot be used to evaluate the validity of the race IAT as a measure of prejudice.
15. *** Ma et al. (Study 1), N = 335, median r = .03
This study used the race IAT to predict voter intentions in the 2012 presidential election. The study found no significant relationship. “However, neither category-level measures were related to intention to vote for Obama (rs ≤ .06, ps ≥ .26)” (p. 31). The meta-analysis recorded a correlation of r = .045, based on email correspondence with the authors. It is not clear why the race IAT would not predict voting intentions in 2012, when it did predict voting intentions in 2008. One possibility is that Obama was now seen as a an individual rather than as a member of a particular group so that general attitudes towards African Americans no longer influenced voting intentions. No matter what the reason is, this study does not provide evidence for the predictive validity of the race IAT.
16. **** Oliver et al., N = 105, median r = .02
This study was on online study of 543 family and internal medicine physicians. They completed the race IAT and gave treatment recommendations for a hypothetical case. Race of the patient was experimentally manipulated. The abstract states that “physicians possessed explicit and implicit racial biases, but those biases did not predict treatment recommendations” (p. 177). The sample size in the meta-analysis is smaller because the total sample was broken down into smaller subgroups.
17. * Nosek & Hansen, N = 207, median r = .01
This study did not include a clear validation criterion. The aim was to examine the relationship between the race IAT and cultural knowledge about stereoetypes. “In seven studies (158 samples, N = 107,709), the IAT was reliably and variably related to explicit attitudes, and explicit attitudes accounted for the relationship between the IAT and cultural knowledge.” The cultural knowledge measures were used as criterion variables. A positive relation, r = .10, was obtained for the item “If given the choice, who would most employers choose to hire, a Black American or a White American? (1 definitely White to 7 definitely Black).” A negative relation, r = -.09, was obtained for the item “Who is more likely to be a target of discrimination, a Black American or a White American? (1 definitely White to 7 definitely Black).”
18. *Plant et al., N = 229, median r = .00
This article examined voting intentions in a sample of 229 students. The results are not reported in the article. The meta-analysis reported a positive r = .04 and a negative r = -.04 for two separate entries with different explicit measures, which must be a coding mistake. As voting behavior has been examined in larger and more representative samples (#3, #15), these results can be ignored.
19. *Krieger et al. (2011), N = 503, r = .00
This study recruited 504 African Americans and 501 White Americans. All participants completed the race IAT. However, the study did not include clear validation criteria. The meta-analysis used self-reported experiences of discrimination as validation criterion. However, the important question is whether the race IAT predicts behaviors of people who discriminate, not the experience of victims of discrimination.
20. *Fiedorowicz, N = 257, r = -.01
This study is a dissertation and the validation criterion was religious fundamentalism.
21. *Heider & Skowronski, N = 140, r = -.02
This study separated the measurement of prejudice with the race IAT and the measurement of the criterion variables by several weeks. The criterion was cooperative behavior in a prisoner dilemma game. The results showed that “both the IAT (b = -.21, t = -2.51, p = .013) and the Pro-Black subscore (b = .17, t = 2.10, p = .037) were significant predictors of more cooperation with the Black confederate. However, these results were false and have been corrected (see Carlsson et al., 2018, for a detailed discussion).
Heider, J. D., & Skowronski, J.J. (2011). Addendum to Heider and Skowronski (2007): Improving the predictive validity of the Implicit Association Test. North American Journal of Psychology, 13, 17-20
In summary, a detailed examination of the race IAT studies included in the meta-analysis shows considerable heterogeneity in the quality of the studies and their ability to examine the predictive validity of the race IAT. The best study is Greenwald et al.’s (2009) study with a large sample and voting in the Obama vs. McCain race as the criterion variable. However, another voting study failed to replicate these findings in 2012. The second best study was BarAnan and Nosek’s study with intergroup contact as a validation criterion, but it failed to show incremental predictive validity of the IAT.
Studies with physicians show no clear evidence of racial bias. This could be due to the professionalism of physicians and the results should not be generalized to the general population. The remaining studies were considered unsuitable to examine predictive validity. For example, some studies with African American participants did not use the IAT to measure prejudice.
Based on this limited evidence it is impossible to draw strong conclusions about the predictive validity of the race IAT. My assessment of the evidence is rather consistent with the authors of the meta-analysis, who found that “out of the 2,240 ICCs included in this metaanalysis, there were only 24 effect sizes from 13 studies that (a) had the relationship between implicit cognition and behavior as their primary focus” (p. 13).
This confirms my observation in the introduction that psychological science has a validation crisis because researchers rarely conduct validation studies. In fact, despite all the concerns about replicability, the lack of replication studies are much more numerous than validation studies. The consequences of the validation crisis is that psychologists routinely make theoretical claims based on measures with unknown validity. As shown here, this is also true for the IAT. At present, it is impossible to make evidence-based claims about the validity of the IAT because it is unknown what the IAT measures and how well it measures what it measures.
Theoretical Confusion about Implicit Measures
The lack of theoretical understanding of the IAT is evident in Greenwald and Banaji’s (2017) recent article, where they suggest that “implicit cognition influences explicit cognition that, in turn, drives behavior” (Kurdi et al., p. 13). This model would imply that implicit measures like the IAT do not have a direct link to behavior because conscious processes ultimately determine actions. This speculative model is illustrated with Bar-Anan and Nosek’s (#11) data that showed no incremental predictive validity on contact. The model can be transformed into a causal chain by changing the bidiretional path into an assumed causal relationship between implicit and explicit attitudes.
However, it is also possible to change the model into a single factor model, that considers unique variance in implicit and explicit measures as mere method variance.
Thus, any claims about implicit bias and explicit bias is premature because the existing data are consistent with various theoretical models. To make scientific claims about implicit forms of racial bias, it would be necessary to obtain data that can distinguish empirically between single construct and dual-construct models.
The race IAT is 20 years old. It has been used in hundreds of articles to make empirical claims about prejudice. The confusion between measures and constructs has created a public discourse about implicit racial bias that may occur outside of awareness. However, this discourse is removed from the empirical facts. The most important finding of the recent meta-analysis is that a careful search of the literature uncovered only a handful of serious validation studies and that the results of these studies are suggestive at best. Even if future studies would provide more conclusive evidence of incremental predictive validity, this finding would be insufficient to claim that the IAT is a valid measure of implicit bias. The IAT could have incremental predictive validity even if it were just a complementary measure of consciously accessible prejudice that does not share method variance with explicit measures. A multi-method approach is needed to examine the construct validity of the IAT as a measure of implicit race bias. Such evidence simply does not exist. Greenwald and colleagues had 20 years and ample funding to conduct such validation studies, but they failed to do so. In contrast, their articles consistently confuse measures and constructs and give the impression that the IAT measures unconscious processes that are hidden from introspection (“conscious experience provides only a small window into how the mind works”, “click here to discover your hidden thoughts”).
Greenwald and Banaji are well aware that their claims matter. “Research on implicit social cognition has witnessed higher levels of attention both from the general public and from governmental and commercial entities, making regular reporting of what is known an added responsibility” (Kurdi et al., 2018, p. 3). I concur. However, I do not believe that their meta-analysis fulfills this promise. An unbiased assessment of the evidence shows no compelling evidence that the race IAT is a valid measure of implicit racial bias; and without a valid measure of implicit racial bias it is impossible to make scientific statements about implicit racial bias. I think the general public deserves to know this. Unfortunately, there is no need for scientific evidence that prejudice and discrimination still exists. Ideally, psychologists will spend more effort in developing valid measures of racism that can provide trustworthy information about variation across individuals, geographic regions, groups, and time. Many people believe that psychologists are already doing it, but this review of the literature shows that this is not the case. It is high time to actually do what the general public expects from us.
The general public has accepted the idea of implicit bias; that is, individuals may be prejudice without awareness. For example, in 2018 Starbucks closed their stores for one day to train employees to detect and avoid implicit bias (cf. Schimmack, 2018).
However, among psychological scientists the concept of implicit bias is controversial (Blanton et al., 2009; Schimmack, 2019). The notion of implicit bias is only a scientific construct if it can be observed with scientific methods, and this requires valid measures of implicit bias.
Valid measures of implicit bias require evidence of reliability, convergent validity, discriminant validity, and incremental predictive validity. Proponents of implicit bias claim that measures of implicit bias have demonstrated these properties. Critics are not convinced.
For example, Cunningham, Preacher, and Banaji (2001) conducted a multi-method study and claimed that their results showed convergent validity among implicit measures and that implicit measures correlated more strongly with each other than with explicit measures. However, Schimmack (2019) demonstrated that a model with a single factor fit the data better and that the explicit measures loaded higher on this factor than the evaluative priming measure. This finding challenges the claim that implicit measures possess discriminant validity. That is, the are implicit measures of racial bias, but they are not measures of implicit racial bias.
A forthcoming meta-analysis claims that implicit measures have unique predictive validity (Kurdi et al., 2018). The average effect size for the correlation between an implicit measure and a criterion was r = .14. However, this estimate is based on studies across many different attitude objects and includes implicit measures of stereotypes and identity. Not surprisingly, the predictive validity was heterogeneous. Thus, the average does not provide information about the predictive validity of implicit measures of implicit bias. The most important observation was that sample sizes of many studies were too small to investigate predictive validity given the small expected effect size. Most studies had sample sizes with fewer than 100 participants (see Figure 1).
A notable exception is a study of voting intentions in the historic 2008 presidential election, where US voters had a choice to elect the first Black president, Obama, or the Republican candidate McCain. A major question at that time was how much race and prejudice would influence the vote.
Greenwald, Tucker Smith, Sriram, Bar-Anan, and Nosek (2009) conducted a study to address this question.
They obtained data from N = 1,057 participants who completed online implicit measures and responded to survey questions.
The key outcome variable was a simple dichotomous question about voting intentions. The sample was not a national representative sample as indicated by 84.2% declared votes for Obama versus 15.8% declared votes for McCain.
The predictor variables were two self-report measures of prejudice (feeling-thermometer, Likert scale), two implicit measures (Brief IAT, AMP), the Symbolic Racism Scale, and a measure of political orientation (Conservative vs. Liberal).
The correlation among all measures were reported in Table 1.
The results for the Brief IAT (BIAT) are highlighted. First, the BIAT does predict voting intentions (r = .17). Second, the BIAT shows convergent validity with the second implicit measure; the Affective Missattribution Paradigm (AMP). Third, the IAT also correlates with the explicit measures of racial bias. Most important, the correlations with the implicit AMP are weaker than the correlations with the explicit measures. This finding confirms Schimmack’s (2019) finding that implicit measures lack discriminant validity.
The correlation table does not address the question whether implicit measures have incremental predictive validity. To examine this question, I fit a structural equation model to the reproduced covariance matrix based on the reported correlations and standard deviations using MPLUS8.2. The model shown in Figure 1 had good overall fit, chi2(9, N = 1057) = 15.40, CFI = .997, RMSEA = .026, 90%CI = .000 to .047.
The model shows that explicit and implicit measures of racial bias load on a common factor (att). Whereas the explicit measures share method variance, the residuals of the two implicit measures are not correlated. This confirms the lack of discriminant validity. That is, there is no unique variance shared only by implicit measures. The strongest predictor of voting intentions is political orientation. Symbolic racism is a mixture of conservatism and racial bias, and it has no unique relationship with voting intentions. Racial bias does make a unique contribution to voting intentions, (b = .22, SE = .05, t = 4.4). The blue path shows that the BIAT does have predictive validity above and beyond political orientation, but the effect is indirect. That is, the IAT is a measure of racial bias and racial bias contributes to voter intentions. The red path shows that the BIAT has no unique relationship with voting intentions. The negative coefficient is not significant. Thus, there is no evidence that the unique variance in the BIAT reflects some form of implicit racial bias that influences voting intentions.
In short, these results provide no evidence for the claim that implicit measures tap implicit racial biases. In fact, there is no scientific evidence for the concept of implicit bias, which would require evidence of discriminant validity and incremental validity.
The use of structural equation modeling (SEM) was highly recommended by the authors of the forthcoming meta-analysis (Kurdi et al., 2018). Here I applied SEM used the best data with multiple explicit and implicit measures, an important criterion variable, and a large sample size that is sufficient to detect small relationships. Contrary to the meta-analysis, the results do not support the claim that implicit measures have incremental predictive validity. In addition, the results confirmed Schimmack’s (2019) results that implicit measures lack discriminant validity. Thus, the construct of implicit racial bias lacks empirical support. Implicit measures like the IAT are best considered as implicit measures of racial bias that is also reflected in explicit measures.
With regard to the political question whether racial bias influenced voting in the 2008 election, these results suggest that racial bias did indeed matter. Using only explicit measures would have underestimated the effect of racial bias due to the substantial method variance in these measures. Thus, the IAT can make an important contribution to the measurement of racial bias because it doesn’t share method variance with explicit measures.
In the future, users of implicit measures need to be more careful in their claims about the construct validity of implicit measures. Greenwald et al. (2009) constantly conflate implicit measures of racial bias with measures of implicit racial bias. For example, the title claims “Implicit Race Attitudes Predicted Vote” , the term “Implicit race attitude measure” is ambiguous because it could mean implicit measure or implicit attitude, whereas the term “implicit measures of race attitudes” implies that the measures are implicit but the construct is racial bias; otherwise it would be “implicit measures of implicit racial bias.” The confusion arises from a long tradition in psychology to conflate measures and constructs (e.g., intelligence is whatever an IQ test measures) (Campbell & Fiske, 1959). Structural equation modeling makes it clear that measures (boxes) and constructs (circles) are distinct and that measurement theory is needed to relate measures to constructs. At present, there is clear evidence that implicit measures can measure racial bias, but there is no evidence that attitudes have an explicit and an implicit component. Thus, scientific claims about racial bias do not support the idea that racial bias is implicit. This idea is based on the confusion of measures and constructs in the social cognition literature.
I reexamine Cunningham, Preacher, and Banaji’s claim that
explicit and implicit attitude measures have discriminant validity. Contrary to
their claim, a single factor model fits the data better than their hierarchical
model with an explicit and an implicit attitude factor. I also show that
attitudes over the two-month period were stable and not influenced by
contextual factors. There is also no evidence that different implicit measures
tap different types of unconscious bias. All measures have low validity as
measures of prejudice. I conclude that the concept of unconscious or implicit prejudice
lacks empirical support because implicit measures do not show discriminant
validity from explicit measures.
No Discriminant Validity of Implicit and Explicit Prejudice Measures
An article in Psychological Science (Cunningham, Preacher, &
Banaji, 2001) reported the results of a longitudinal multi-method study of
prejudice; that is, attitudes towards African Americans. The article is frequently cited (446
citations in total, 30 citations in 2018 on January 31 in WebofScience) as
evidence that explicit and implicit measures of prejudice measure two different
constructs. Explicit measures are assumed
to assess consciously accessible and controllable attitudes, whereas implicit
measures are assumed to assess uncontrollable aspects of attitudes that may
exist outside of conscious awareness. Although the article was published nearly
20 years ago, it remains “the most sophisticated examination of measurement
error and the interrelations among various implicit measures” (Fazio &
Olson, 2003). Thus, it provides the
single most important empirical evidence for the construct validity of implicit
measures of prejudice. Without evidence for discriminant validity, implicit
measures might simply be implicit measures of the same construct that is measured
by means of self-report measures. Although implicit measures have many
advantages over self-report measures, this view suggests that there is no need
for a theoretical distinction between explicit and implicit forms of racial
In this article, I reexamine Cunningham’s structural
equation model that was used to support the claim that “the two kinds of
attitude measures also tap unique sources of variance (Cunningham et al.,
2001); a single-factor solution does not fit the data” (p. 170). To be blunt, I will show that this claim is
false. A single factor model actually
does fit the data better than the model reported in the original article.
Second, I use the data to examine the contribution of stable traits and
situational factors to measures of racial bias.
These results shed new light on the controversial question about the
context sensitivity of implicit attitude measures. Some experimental studies suggest that
implicit measures are sensitive to situational factors (Dasgupta). However, effect
sizes in these small studies tend to be inflated. A large replication study
with thousands of participants found only an effect size of d = .08, suggesting
that implicit measures reflect mostly stable individual differences in
prejudice and measurement error (Joy-Gaba & Nosek, 2011).
Description of the Design and Measures
Participants were 93 students with complete data. Each student completed a single explicit measure of prejudice, the Modern Racism Scale (McConahay, 1986), and three implicit measures: (a) the standard race IAT (Greenwald, McGhee, & Schwartz, 1998), a response window IAT (Cunningham et al., 2001), and a response window evaluative priming task (Fazio, Sanbonmatsu, Powell, & Kardes, 1986). The assessment was repeated on four occasions two weeks apart.
Reproducing the Original Model
Although it was not common to publish original data in 2001, structural equation modeling does not require access to the original data. It is possible to reproduce or test alternative models simply based on the correlations and standard deviations. Fortunately, Cunningham et al. (2001) published this information and I was able to reproduce their model, using MPLUS8.2. Figure 1 shows the parameter estimates. They close correspond to the original results. The original article reported good model fit, “chi2(100, N = 93) = 111.58, p = .20; NNFI = .96; CFI = .97; RMSEA = 0.041 (90% confidence interval: 0.00, 0.071)” (p. 168). The model fit for the reproduced model was very similar, chi2(100, N = 93) = 112, CFI = .977, RMSEA = 0.036, 90%CI = .000 to .067. Thus, the model fit of the reproduced model serves as a comparison standard for the alternative models that I examined next.
Figure 1. Original Model with reproduced parameter estimates based on the published correlations and standard deviations.
The original model is a hierarchical model with an implicit attitude
factor as a second-order factor, and method-specific first order factors. Each
first-order factor has four indicators for the four measurement occasions. A hierarchical
model imposes constraint on the first order loadings because they contribute to
the first-order relations among indicators of the same method and to the second
order relations of different implicit methods to each other. An alternative way
to model multi-method data are bi-factor models (Chen, West, & Sousa,
A bifactor model allows for all measures to be directly
related to the general trait factor that corresponds to the second-order factor
in a hierarchical model. However,
bi-factor models may not be identified if there are no method factors. Thus, a
first step is to allow for method-specific correlated residuals and to examine whether
these correlations are positive.
The model with a single factor and method-specific residual correlations fit the data better than the hierarchical model, chi2(80, N = 93) = 87, CFI = .988, RMSEA = 0.029, 90%CI = .000 to .065. Inspection of the residual correlations showed high correlations for the Modern Racism scale, but less evidence for method-specific variance for the implicit measures. The response window IAT had no significant residual correlations. This explains the high factor loading of the respond window IAT in the hierarchical model. It does not suggest that this is the most valid measure. Rather, it shows that there is little method specific variance. Fixing these residual correlations to zero, improved model fit, chi2(86, N = 93) = 91, CFI = .991, RMSEA = 0.025, 90%CI = .000 to .062. I then tried to create method factors for the remaining methods. For the IAT, a method factor could also be created using the first three occasions because the forth occasion did not load on the method factor. However, model fit decreased unless occasion 2 was allowed to correlate with occasion 4. This unexpected finding is unlikely to reflect a real relationship. Thus, I retained the model with a method factor for the first three occasions only, chi2(89, N = 93) = 97, CFI = .986, RMSEA = 0.029, 90%CI = .000 to .064. I was able to fit a method factor for evaluative priming, but model fit decreased, x2(91, N = 93) = 101, CFI = .983, RMSEA = 0.033, 90%CI = .000 to .065. The first occasion did not load on the method factor. Model fit could be improved by fixing the loading to zero and by allowing for an additional correlation between occasion 1 and 3, chi2(91, N = 93) = 98, CFI = .988, RMSEA = 0.027, 90%CI = .000 to .062. However, there is no rational for this relationship and I retained the more parsimonious model. Fitting the measurement model for the modern racism scale also decreased fit, but fit was better than for the model in the original article, x2(94, N = 93) = 107, CFI = .977, RMSEA = 0.038, 90%CI = .000 to .068. This was the final model (Figure 2).
The most important results are the factor loadings of the measures on the trait factor. Factor loadings for the Modern racism scale ranged from .35 to .45 (M = .40). Factor loadings for the standard IAT ranged from .43 to .54 (M = .47). Factor loadings for the response window IAT ranged from .41 to .69 (M = .51). The evaluative priming measures had the lowest factor loadings ranging from .13 to .47 (M = .29). In terms of absolute validity, all of these validity coefficients are low, suggesting that a single standard IAT measure on a single occasion has.47^2 = 22% valid variance. Most important, these results suggest that the Modern Racism Scale and the IAT measure a single construct and that the low correlation between implicit and explicit measures reflects low convergent validity rather than high discriminant validity.
The model in Figure 2 assumes that prejudice is stable over
the two-month period of the study and that there are no systematic changes in
prejudice levels. To test this assumption, I tested a model with correlated
residuals among measures taken at the same occasion. Model fit improved, chi2(70, N = 93) = 75,
CFI = .991, RMSEA = 0.027, 90%CI = .000 to .066. However, the pattern of residual correlations
did not reveal evidence for state variance.
For time 1, the IAT was correlated with the RW-IAT and evaluative
priming, but the latter two were not correlated. In addition, evaluative
priming was negatively related to modern racism. At time 2, none of the correlations were
significant, and fixing them to zero improved model fit, chi2(76, N = 93) = 78,
CFI = .996, RMSEA = 0.016, 90%CI = .000 to .060. At time 3, the two IAT measures were
negatively correlated, but they correlated positively with the modern racism
scale. Fixing the remaining four correlations
to zero improved model fit, x2(74, N = 93) = 78, CFI = .993, RMSEA = 0.023,
90%CI = .000 to .060. At time 4, there
were no significant correlations and constraining the correlations to zero did
not alter fit, chi2(76, N = 93) = 81, CFI = .991, RMSEA = 0.026, 90%CI = .000
to .064. These analyses show that there
are no systematic changes in prejudice over the course of the study.
A reexamination of Cunningham et al.’s (2001) multi-measure
study of racial attitudes challenges the original conclusion that a single
factor model does not fit the data. In
fact, a single factor model fits the data better than the original,
hierarchical model. Moreover, the new
model shows that the original article falsely suggested that each measure has
stable method variance. A careful analysis of residual correlations showed that
only the modern racism scale has substantial and stable method variance on all
four occasions. Another finding was that implicit measures on the same occasion
did not share variance with each other. This finding suggests that prejudice is
a stable disposition, at least over a two-month period, and not a malleable
state. This is consistent with weak effects of experimental manipulations on
IAT scores (Joy Gaba & Nosek, 2010).
Factor loadings of the two IAT measures on the prejudice
factor were slightly higher than those for the Modern Racism Scale. This might suggest that implicit measures have
slightly higher validity than explicit measures. However, this conclusion is
limited to the Modern Racism Scale, which tends to show lower convergent
validity with the IAT than more direct prejudice measures (Axt, 2018). In
addition, the evaluative priming task had lower validity. Thus, validity has to
be evaluated for each measure and it is impossible to make general statements
about higher or lower validity of implicit versus explicit measures.
The main practical implication of this new look at old data is that claims about implicit racial bias as a distinct form of prejudice is not supported by scientific evidence. Although implicit measures are less susceptible to socially desirable responding, they do not necessarily assess some unconscious form of prejudice. This is not a criticism of implicit measures like the Implicit Association Test. The ability to measure prejudice without self-reports is extremely valuable for prejudice researchers. Given the low validity of a single IAT it should not be used for assessment of individuals. However, measurement error is reduced in comparisons of groups of participants and the IAT can reveal important group differences in prejudice levels. However, proponents of the IAT have argued that the IAT also measures some hidden form of prejudice that is not accessible to introspection (Kurdi et al., 2018). This claim requires demonstration of discriminant validity (Campbell & Fiske, 1959), and evidence of discriminant validity is lacking. Evidence for the unique predictive validity of the IAT is also controversial (Kurdi et al., 2018). A meta-analysis suggests that about 1% of the variance in criterion variables is explained by IAT scores. However, the authors also note that most studies were severely underpowered to detect such small effect sizes. Moreover, even unique predictive variance in mono-method studies does not demonstrate that the IAT measures a different construct. I therefore urge prejudice researchers to conduct high-powered multi-method studies to examine the discriminant and predictive validity of implicit prejudice measures.
Chen, F., West,
S.G., & Sousa, K.H. (2006) A Comparison of Bifactor and Second-Order Models
of Quality of Life, Multivariate Behavioral Research, 41:2, 189-225,
A., Preacher, K. J., & Banaji, M. R. (2001). Implicit attitude measures:
Consistency, stability, and convergent validity. Psychological Science, 12, 163-170. http://dx.doi.org/10.1111/1467-9280.00328
Dasgupta, N., & Greenwald, A. G. (2001). On
the malleability of automatic attitudes: Combating automatic prejudice with
images of admired and disliked individuals. Journal
of Personality and Social Psychology, 81, 800–814.
Fazio, R.H., Sanbonmatsu, D.M., Powell, M.C.,
& Kardes, F.R. (1986). On the automatic activation of attitudes. Journal of Personality and Social
Psychology, 50, 229–238.
Joy-Gaba, J. A.,
& Nosek, B. A. (2010). The
surprisingly limited malleability of implicit racial evaluations. Social Psychology, 41, 137–146. doi:10.1027/1864-9335/a000020
Greenwald, A.G., McGhee, D.E., & Schwartz,
J.L.K. (1998). Measuring individual differences in implicit cognition:
The Implicit Association Test. Journal of
Personality and Social Psychology, 74, 1464–1480.
Kurdi, B., Seitchik, A. E., Axt, J. R., Carroll,
T. J., Karapetyan, A., Kaushik, N., . . . Banaji, M. R. (2018). Relationship
between the Implicit Association Test and intergroup behavior: A meta-analysis.
American Psychologist. Advance
online publication. http://dx.doi.org/10.1037/amp0000364
(1986). Modern racism, ambivalence, and the modern racism scale. In J.F.
Dovidio & S.L. Gaertner (Eds.), Prejudice, discrimination, and racism (pp.
91–125). Orlando, FL: Academic Press
In a groundbreaking article, a team of psychologists replicated 97 published studies with a significant result. The key finding was that only 36% of the 97 significant results could replicated; that is the replication study reproduced a significant result.
One conclusion that can be drawn from this result is that the average success rate in psychology research is around 40%, but journals publish over 90% significant results, which shows that the published record is biased on favor of supporting evidence.
However, the result does not tell us how many of the published results were false positives. In this post, I use the replication studies to estimate the false discovery risk; that is the maximum false discovery rate (Soric, 1989).
Soric demonstrated that the maximum false discovery rate is determined by the discovery rate; that is the percentage of significant results for all statistical tests. The problem is that we typically only see a biased sample of mostly significant results so that the discovery rate is unknown.
Brunner and Schimmack (2018) developed a method, z-curve, that makes it possible to estimate the discovery rate based on the power of the significant results. For example, if a significant result was obtained with 20% power, an average 5 studies are needed to produce a significant result. Thus, the expected value is 5. For false positives, the probabilty of a significant result is alpha, which is typically 5%. So, 20 studies are needed to get one significant result.
Previously I used z-curve for sets of studies published in journals that were selected for significance. Here I use the results of the replication studies from the reproducibility project to estimate the false discovery risk in psychological science; or at least for the three journals that were used for the project (JPSP, JEP-LMC, Psych Science).
The dataset consists of 88 studies. 9 studies were excluded because the replication study was less than ideal (e.g., smaller sample size than original study). Because there is no selection for significance, z-curve used all studies to estimate the weights for different levels of power that could reproduce the observed distribution of z-scores. The first finding is that the proportion of significant results in the reproducbility project, the discovery rate was 38%. This is consistent with the estimated discovery rate based on the power estimates of 40%. This confirms that the published results are an unbiased sample. The other statistics in the figure are less interesting because they focus on the studies that produced a significant result again. For example, the 74% replication rate estimates suggests that the success rate would increase to 74% if only the 35 studies with significant results were replicated again (re-replicated). Soric’s FDR tells us that no more than 9% of the 35 studies with significant results are false discoveries. However, the more interesting question is how many of the 88 studies that were replicated could be false discoveries. This would be an estimate of the false discovery rate in psychology.
Obtaining this estimate is straightforward. We simply can use the weights of the model that do not distinguish between significant and non-significant results. They apply to the whole distribution. This does not change anything about the number of studies that would be needed to produce a significant result. So, we can divide the weights by power and sum them to get the average number of studies that would be required to get 1 significant result for each of the 88 studies. The estimate is 4.18 studies for each significant result, which translates into a discovery rate of 24%. This suggests that experimental psychologists conduct on average 4 studies for every significant result that gets published.
We can then use Soric’s formula and find that a discovery rate of 24% yields a false discovery risk of 17%.
This estimate is somewhat larger than the estimate based on z-curve analysis of the original studies, which was only 10% (see Figure 2).
The reason could be that it is difficult to adjust for the use of questionable research practices. However, it is also possible that problems with some replication studies produced false positives that inflate the FDR estimate based on the replication studies. However, both estimates show that most published results in psychology journals are not false positives.
Although this is good news, it is important to realize that Soric’s FDR focuses on the nil-hypothesis that the population effect size is zero or even in the opposite direction. A bigger concern is that many published results have dramatically inflated effect sizes that may be theoretically or practically irrelevant. Z-curve provides a way to estimate the FDR that treats studies with very low power as false positives. Z-curve is fitted to the data with varying amounts of false positives. If model fit is not much different from the free model, the data provide are consistent with the specified number of false positives. This value is reported in Figure 1 and shows that up to 35% of published results could be false positives if studies with less than 17% power are considered false positives. This estimate changes with the definition of false positives.
In conclusion, this post showed how z-curve can be used to estimate the false discovery risk in psychological science based on a set of unbiased replication studies. As more replication studies are being conducted, z-curve can provide valuable information about the false discovery risk in psychological science.