Information about the replicability of published results is important because empirical results can only be used as evidence if the results can be replicated. However, the replicability of published results in social psychology is doubtful. Brunner and Schimmack (2018) developed a statistical method called z-curve to estimate how replicable a set of significant results are, if the studies were replicated exactly. In a replicability audit, I am applying z-curve to the most cited articles of psychologists to estimate the replicability of their studies.
John A. Bargh
Bargh is an eminent social psychologist (H-Index in WebofScience = 61). He is best known for his claim that unconscious processes have a strong influence on behavior. Some of his most cited article used subliminal or unobtrusive priming to provide evidence for this claim.
Bargh also played a significant role in the replication crisis in psychology. In 2012, a group of researchers failed to replicate his famous “elderly priming” study (Doyen et al., 2012). He responded with a personal attack that was covered in various news reports (Bartlett, 2013). It also triggered a response by psychologist and Nobel Laureate Daniel Kahneman, who wrote an open letter to Bargh (Young, 2012).
“As all of you know, of course, questions have been raised about the robustness of priming results…. your field is now the poster child for doubts about the integrity of psychological research.”
Kahneman also asked Bargh and other social priming researchers to conduct credible replication studies to demonstrate that the effects are real. However, seven years later neither Bargh nor other prominent social priming researchers have presented new evidence that their old findings can be replicated.
Instead other researchers have conducted replication studies and produced further replication failures. As a result, confidence in social priming is decreasing as reflected in Bargh’s citation counts (Figure 1)
Figure 1. John A. Bargh’s citation counts in Web of Science (3/17/19)
In this blog post, I examine the replicability and credibility of John A. Bargh’s published results using a statistical approach; z-curve (Brunner & Schimmack, 2018). ). It is well known that psychology journals only published confirmatory evidence with statistically significant results, p < .05 (Sterling, 1959). This selection for significance is the main cause of the replication crisis in psychology because selection for significance makes it impossible to distinguish results that can be replicated from results that cannot be replicated because selection for significance ensures that all results will be replicated (we never see replication failures).
While selection for significance makes success rates uninformative, the strength of evidence against the null-hypothesis (signal/noise or effect size / sampling error) does provide information about replicability. Studies with higher signal to noise ratios are more likely to replicate. Z-curve uses z-scores as the common metric of signal-to-noise ratio for studies that used different test statistics. The distribution of observed z-scores provides valuable information about the replicability of a set of studies. If most z-scores are close to the criterion for statistical significance (z = 1.96), replicability is low.
Given the requirement to publish significant results, researches had two options how they could meet this goal. One option requires obtaining large samples to reduce sampling error and therewith increase the signal-to-noise ratio. The other solution is to conduct studies with small samples and conduct multiple statistical tests. Multiple testing increases the probability of obtaining a significant results with the help of chance. This strategy is more efficient in producing significant results, but these results are less replicable because a replication study will not be able to capitalize on chance again. The latter strategy is called a questionable research practice (John et al., 2012), and it produces questionable results because it is unknown how much chance contributed to the observed significant result. Z-curve reveals how much a researcher relied on questionable research practices to produce significant results.
I used WebofScience to identify the most cited articles by John A. Bargh (datafile). I then selected empirical articles until the number of coded articles matched the number of citations, resulting in 43 empirical articles (H-Index = 41). The 43 articles reported 111 studies (average 2.6 studies per article). The total number of participants was 7,810 with a median of 56 participants per study. For each study, I identified the most focal hypothesis test (MFHT). The result of the test was converted into an exact p-value and the p-value was then converted into a z-score. The z-scores were submitted to a z-curve analysis to estimate mean power of the 100 results that were significant at p < .05 (two-tailed). Four studies did not produce a significant result. The remaining 7 results were interpreted as evidence with lower standards of significance. Thus, the success rate for 111 reported hypothesis tests was 96%. This is a typical finding in psychology journals (Sterling, 1959).
The z-curve estimate of replicability is 29% with a 95%CI ranging from 15% to 38%. Even at the upper end of the 95% confidence interval this is a low estimate. The average replicability is lower than for social psychology articles in general (44%, Schimmack, 2018) and for other social psychologists. At present, only one audit has produced an even lower estimate (Replicability Audits, 2019).
The histogram of z-values shows the distribution of observed z-scores (blue line) and the predicted density distribution (grey line). The predicted density distribution is also projected into the range of non-significant results. The area under the grey curve is an estimate of the file drawer of studies that need to be conducted to achieve 100% successes if hiding replication failures were the only questionable research practice that is used. The ratio of the area of non-significant results to the area of all significant results (including z-scores greater than 6) is called the File Drawer Ratio. Although this is just a projection, and other questionable practices may have been used, the file drawer ratio of 7.53 suggests that for every published significant result about 7 studies with non-significant results remained unpublished. Moreover, often the null-hypothesis may be false, but the effect size is very small and the result is still difficult to replicate. When the definition of a false positive includes studies with very low power, the false positive estimate increases to 50%. Thus, about half of the published studies are expected to produce replication failures.
Finally, z-curve examines heterogeneity in replicability. Studies with p-values close to .05 are less likely to replicate than studies with p-values less than .0001. This fact is reflected in the replicability estimates for segments of studies that are provided below the x-axis. Without selection for significance, z-scores of 1.96 correspond to 50% replicability. However, we see that selection for significance lowers this value to just 14% replicability. Thus, we would not expect that published results with p-values that are just significant would replicate in actual replication studies. Even z-scores in the range from 3 to 3.5 average only 32% replicability. Thus, only studies with z-scores greater than 3.5 can be considered to provide some empirical evidence for this claim.
Inspection of the datafile shows that z-scores greater than 3.5 were consistently obtained in 2 out of the 43 articles. Both articles used a more powerful within-subject design.
The automatic evaluation effect: Unconditional automatic attitude activation with a pronunciation task (JPSP, 1996)
Subjective aspects of cognitive control at different stages of processing (Attention, Perception, & Psychophysics, 2009).
John A. Bargh’s work on unconscious processes with unobtrusive priming task is at the center of the replication crisis in psychology. This replicability audit suggests that this is not an accident. The low replicability estimate and the large file-drawer estimate suggest that replication failures are to be expected. As a result, published results cannot be interpreted as evidence for these effects.
So far, John Bargh has ignored criticism of his work. In 2017, he published a popular book about his work on unconscious processes. The book did not mention doubts about the reported evidence, while a z-curve analysis showed low replicability of the cited studies (Schimmack, 2017).
Recently, another study by John Bargh failed to replicate (Chabris et al., in press), and Jessy Singal wrote a blog post about this replication failure (Research Digest) and John Bargh wrote a lengthy comment.
In the commentary, Bargh lists several studies that successfully replicated the effect. However, listing studies with significant results does not provide evidence for an effect unless we know how many studies failed to demonstrate the effect and often we do not know this because these studies are not published. Thus, Bargh continues to ignore the pervasive influence of publication bias.
Bargh then suggests that the replication failure was caused by a hidden moderator which invalidates the results of the replication study.
One potentially important difference in procedure is the temperature of the hot cup of coffee that participants held: was the coffee piping hot (so that it was somewhat uncomfortable to hold) or warm (so that it was pleasant to hold)? If the coffee was piping hot, then, according to the theory that motivated W&B, it should not activate the concept of social warmth – a positively valenced, pleasant concept. (“Hot” is not the same as just more “warm”, and actually participates in a quite different metaphor – hot vs. cool – having to do with emotionality.) If anything, an uncomfortably hot cup of coffee might be expected to activate the concept of anger (“hot-headedness”), which is antithetical to social warmth. With this in mind, there are good reasons to suspect that in C&S, the coffee was, for many participants, uncomfortably hot. Indeed, C&S purchased a hot or cold coffee at a coffee shop and then immediately handed that coffee to passersby who volunteered to take the study. Thus, the first few people to hold a hot coffee likely held a piping hot coffee (in contrast, W&B’s coffee shop was several blocks away from the site of the experiment, and they used a microwave for subsequent participants to keep the coffee at a pleasantly warm temperature). Importantly, C&S handed the same cup of coffee to as many as 7 participants before purchasing a new cup. Because of that feature of their procedure, we can check if the physical-to-social warmth effect emerged after the cups were held by the first few participants, at which point the hot coffee (presumably) had gone from piping hot to warm.
He overlooks that his original study produced only weak evidence for the effect with a p-value of .0503, that is technically not below the .05 value for significance. As shown in the z-curve plot, results with a p-value of .0503 have only an average replicability of 13%. Moreover, the 95%CI for the effect size touches 0. Thus, the original study did not rule out that the effect size is extremely small and has no practical significance. To make any claims that the effect of holding a warm cup on affection is theoretically relevant for our understanding of affection would require studies with larger samples and more convincing evidence.
At the end of his commentary, John A. Bargh assures readers that he is purely motivated by a search for the truth.
Let me close by affirming that I share your goal of presenting the public with accurate information as to the state of the scientific evidence on any finding I discuss publicly. I also in good faith seek to give my best advice to the public at all times, again based on the present state of evidence. Your and my assessments of that evidence might differ, but our motivations are the same.
Let me be crystal clear. I have no reasons to doubt that John A. Bargh believes what he says. His conscious mind sees himself as a scientist who employs the scientific method to provide objective evidence. However, Bargh himself would be the first to acknowledge that our conscious mind is not fully aware of the actual causes of human behavior. I submit that his response to criticism of his work shows that he is less capable of being objective than he thinks he his. I would be happy to be proven wrong in a response by John A. Bargh to my scientific criticism of his work. So far, eminent social psychologists have preferred to remain silent about the results of their replicability audits.
It is nearly certain that I made some mistakes in the coding of John A. Bargh’s articles. However, it is important to distinguish consequential and inconsequential mistakes. I am confident that I did not make consequential errors that would alter the main conclusions of this audit. However, control is better than trust and everybody can audit this audit. The data are openly available and the data can be submitted to a z-curve analysis using a shinny app. Thus, this replicability audit is fully transparent and open to revision.
Many psychologists do not take this work seriously because it has not been peer-reviewed. However, nothing is stopping them from conducting a peer-review of this work and to publish the results of their review as a commentary here or elsewhere. Thus, the lack of peer-review is not a reflection of the quality of this work, but rather a reflection of the unwillingness of social psychologists to take criticism of their work seriously.
If you found this audit interesting, you might also be interested in other replicability audits of eminent social psychologists.
It is easy to say that science is self-correcting. The notion of a self-correcting science is based on the naive model of science as an objective process that incorporates new information and updates beliefs about the world depending on the available evidence. When new information suggests that old beliefs are false, the old beliefs are replaced by new beliefs.
It has been a while since I read Kuhn’s book on paradigm shifts, but I do remember that a main point of the book was that science doesn’t work this way for a number of reasons.
Thus, self-correction cannot be taken for granted. Rather, it is an attribute that needs to be demonstrated for a discipline to be an actual science. If psychological science wants to be a science, there should be empirical evidence that it is self-correcting.
One piece of evidence for self-correction is that theories that are in doubt receive fewer citations. Fortunately, modern software like the database WebofScience makes it very easy to count citations by year of publication.
In the past years, research on social priming has come under attack. Several replication studies failed to replicate key findings in this literature. In 2012, Nobel Laureate Daniel Kahneman wrote an open letter to John A. Bargh calling social priming “the poster child of doubts about doubts about the integrity of psychological research.” (cf. Train Wreck blog post). I have demonstrated with statistical methods that many of the published results in this literature were obtained with questionable research methods that inflate the risk of false positive results (Before You Know It).
If science is self-correcting, we should see a decrease in citations of social priming articles.
John A. Bargh
The graph below shows the citations of John A. Bargh’s articles by year. 2019 does not count because it just started. 2018 citation are still added but at a very low rate. So, the 2018 data can be interpreted.
The graph shows that John A. Bargh’s citation counts still increased after 2012, when Kahneman published the open letter. However, publishing is a slow process and many articles published in 2013 and 2014 had been written before 2012. Starting with 2015, we see a decrease in citations and this decrease continues to 2018. The decrease seems to be accelerating with a drop by 200 citations from 2017 to 2018.
In conclusion, there is some evidence of self-correction in psychology. However, Bargh may be an exception because an open letter by a Nobel Laureate is a rare and powerful impetus for self-correction.
Dijksterhuis is also known for work on unconscious processes and social priming. Importantly, a large replication study failed to replicate his professor-priming results in 2018 (Registered Replication Report).
The increase in citation counts stalled in 2011, even before the citation counts of John A. Bargh started to decrease. However, there was no clear decrease in the years from 2012 to 2017, while citation counts decreased by over 100 citations in 2018. Thus, there are some signs of self-correction here as well.
The work by Fritz Strack was also featured in Kahneman’s book. There have been two registered replication reports of work by Fritz Strack and both failed to replicate the original results (facial feedback, item-order effects).
Strack’s citation counts increased dramatically after 2012. However in 2018 they decreased by 150 counts. We need the 2019 data to see whether this is a blip or the beginning of a downward trend.
Susan T. Fiske
To make sure that the trends for social priming researchers are not just general trends we need a control condition. I picked Susan T. Fiske because she is an eminent social psychologist, but her work is different from social priming experiments. Here work is also more replicable than work by social priming researchers (social psychologists’ replicability rankings).
Fiske’s graph shows no decrease in 2018. Thus, the decreases seen for social priming researchers do not reflect a general trend in social psychology.
This blog post shows how citation counts can be used to examine whether psychological science is self-correcting, which is an essential feature of a science. There are some positive signs that the recent replication crisis in social psychology has triggered a process of self-correction. I suggest that further investigation of changes in citation counts are a fruitful area of research for meta-psychologists.
This blog post is based on a commentary that was published in the European Journal of Personality Psychology in 2012. Republishing it as a blog post makes it openly accessible.
The Utility of Network Analysis for Personality Psychology ULRICH SCHIMMACK and JUDITH GERE European Journal of Personality, 26: 446–447 (2012) DOI: 10.1002/per.1876
We note that network analysis provides some new opportunities but also has some limitations: (i) network analysis relies on observed measures such as single items or scale scores; (ii) it is a descriptive method and, as such, cannot test causal hypotheses; and (iii) it does not test the influence of outside forces on the network, such as dispositional influences on behaviour. We recommend structural equation modelling as a superior method that overcomes limitations of exploratory factor analysis and network analysis.
Cramer et al. (2012) introduce network analysis (NA) as a new statistical tool for the study of personality that addresses some limitations of exploratory factor analysis (EFA). We concur with the authors that NA provides valuable new opportunities but feel forced by the situational pressure of a 1000 word limit to focus on some potential limitations of NA.
We also compare NA to structural equation modelling (SEM) because we agree with the authors that SEM is currently the most powerful statistical method for the testing of competing (causal) theories of personality.
One limitation of EFA and NA is that these methods rely on observed measures to examine relationships between personality constructs. For example, Cramer et al. (2012) apply NA to correlations among ratings of single items. The authors recognize this limitation but do not present an alternative to this suboptimal approach.
A major advantage of SEM is that it allows researchers to create measurement models that can remove random and systematic measurement error from observed measures of personality constructs. Measurement models of multimethod data are particularly helpful to separate perception and rater biases from actual personality traits (e.g. Gere & Schimmack, 2011; Schimmack, 2010).
Our second concern is that NA is presented as a statistical tool that can test dynamic process models of personality. Yet, NA is a descriptive method that provides graphical representations of patterns in correlation matrices. Thus, NA is akin to other descriptive methods (e.g. multidimensional scaling, cluster analysis and principal component analysis) that reveal patterns in complex data. These descriptive methods make no assumptions about causality. In contrast, SEM forces researchers to make a priori assumptions about causal processes and provides information about the ability of a causal theory to explain the observed pattern of correlations. Thus, we recommend SEM for theory testing and do not think it is appropriate to use NA for this purpose.
Specifically, we think it is questionable to make inferences about the Big Five model based on network graphs. Cramer et al. (2012) highlight the ability to visualize the centrality of items in a network as a major strength of NA. However, factor loading patterns and communalities in EFA provide similar information. In our opinion, the authors go beyond the statistical method of NA when they propose that activation of central components will increase the chances that neighbouring components will also become more activated. This assumption is problematic for several reasons.
First, it is not clear what the authors mean by the notion of activation of personality components. Second, the connections in a network graph are not causal paths. An item could be central because it is influenced by many personality components (e.g. life satisfaction is influenced by neuroticism, extraversion, agreeableness and conscientiousness) or because it is the cause of neighbouring items (life satisfaction influences neuroticism, extraversion, agreeableness and conscientiousness). Researchers interested in testing causal relationships should collect data that are informative about causality (e.g. twin data) and use SEM to test whether the data favour one causal theory over another.
We are also concerned about the suggestion of Cramer et al. (2012) that NA provides an alternative account of classic personality constructs such as extraversion and neuroticism. It is important to make clear that this alternative view challenges the core assumption of many personality theories that behaviour is influenced by personality dispositions.
That is, whereas the conception of neuroticism as a personality trait assumes that neuroticism has causal force (Funder, 1991), the conceptualization of neuroticism as a personality component implies that it does not have causal force. The authors compare personality constructs such as neuroticism with the concept of a flock. The term flock in the expression a flock of birds does not refer to an independent entity that exists apart from the individual birds, and it makes no sense to attribute the gathering of birds to the causal effect of flocking (the birds are gathered in the same place because they are a flock of birds). We prefer to compare neuroticism with the causal force of seasonal changes that make individual birds flock together.
Since we published this commentary, network models have become even more popular to make claims about important constructs like depression and other constructs. So far, we have only seen pretty pictures of item clusters, but no evidence that network models provide new insights into the causes of depression or dynamic developments over time. The reason is that the statistical tool is merely descriptive, whereas the articles talk a lot about things that go well beyond the empirical contribution of plotting correlations or partial correlations. In this regard, network articles remind me of the old days in personality psychology, where researchers told stories about their principle components. Instead researchers interested in individual differences should learn how to use structural equation modeling to test causality and to study stability and change of personality traits and states. Unfortunately, learning structural equation modeling is a bit more difficult than network analysis which requires no theory and does not test model fit. Maybe that is the reason for the popularity of network models. Easy to do and pretty pictures. Who can resist.
One of the most famous experiments in psychology is Schachter and Singer’s experiment that was used to support the two-factor theory of emotions: emotions is sympathetic arousal plus cognition about the cause of the arousal (see Dror, 2017, Reisenzein, 2017, for historic reviews).
The classic article “Cognitive, social, and physiological determinants of emotional state” has been cited 2,799 times in WebofScience, and is a textbook classic.
Schachter and Wheeler (1962) summarize the “take-home message” of Schacthter and Singer (1962).
In their study of cognitive and physiological determinants of emotional states, Schachter and Singer (1962) have demonstrated that cognitive processes play a major role in the development of emotional states” (p. 121).
The “demonstration” was an experiment in which participants were injected with epinephrine to create a state of arousal or a placebo. This manipulation was crossed with a confederate who either displayed euphoric or angry behavior.
Schachter and Wheeler summarize the key findings.
In experimental situations designed to make subjects euphoric, those subjects who received injections of epinephrine were, on a variety of indices, somewhat more euphoric than subjects who received a placebo injection.
Similarly, in situations designed to make subjects angry and irritated, those who received epinephrine were somewhat angrier than subjects who received placebo.
[Note the discrepancy between the claim “play a major role” and “somewhat more”]
The proceed to make clear that this pattern, although expected, could also have been produced by chance alone.
In both sets of conditions, however, these differences between epinephrine and placebo subjects were significant, at best, at borderline levels of statistical significance.
[Not the discrepancy between “demonstrated” and “borderline significance”]
Schachter and Wheeler conducted another test of the two-factor theory. The study was essentially a conceptual replication and an extension of Schachter and Singer. The replication part of the study was that participants were again injected with a placebo or epinephrine. It is a conceptual replication because the target emotion was amusement, rather than anger or euphoria. Finally, the extension was a third condition in which participants were injected with Chlorpromazine; a sedative. This should suppress activation of sympathetic arousal and dampen amusement.
One dependent variable were observer ratings of amusement. As shown in Table 3, the means were in the predicted direction, but the difference between placebo and epinephrine conditions was not significant.
Ratings of the film were additional dependent variables. Means are again in the same direction, but p-values are not reported and the text mentions that some differences were significant only at borderline levels. The pattern makes clear that this would be the case for the contrasts of the Chlorpromazine condition with the other conditions, but not for the epinephrine – placebo contrast.
Based on these underwhelming and non-significant results, the authors concluded
The overall pattern of experimental results of this study and the Schachter and Singer (1962) experiment gives consistent support to a general formulation of emotion as a function of a state of physiological arousal and of an appropriate cognition (p. 127).
This claim is false. The replication study actually confirmed that an epinephrine injection seems to have no statistically reliable influence on the intensity of emotions.
Dorr (2017) made an interesting historical observation that Schachter was angry (presumably, without injection of epinephrine) that editors added non-significant to some of the results in the Schachter and Singer (1962) article.
“Since the paper has appeared students have tittered at me, my colleagues look down at their plates.” The most serious issue, among several, was that Tables 6–9 were totally misleading. The “notation ‘ns’ in the p column,” as Schachter explained, “is meaningless. Nothing was tested” (Schachter, S., 1962, Schachter to R. Solomon, May 3, 1962).” (Dorr, 2017)
Nothing was tested and nothing was proven, but a theory was born and it lives on in the imagination of hundreds of contemporary psychologists. The failure to provide evidence for it in Schachter and Wheeler was largely ignored. The article has been cited only 145 times compared to 2,799 for Schachter and Singer.
One reason for the impact of Schachter and Singer is that it was published in Psychological Review, while Schachter and Wheeler was published in Journal ol Abnormal and Social Psychology, which later became the Journal of Personality and Social Psychology.
Psychological Review is the journal where a select few psychologists can make sweeping claims with very little evidence, in the hope that other researchers will provide evidence for it. Given that psychology only publishes confirmatory evidence, every Psychological Review is a self-fulfilling prophecy, and every proposed theory will receive empirical support (even if only with marginal significance), and will live forever.
So, what are the take-home messages from this blog post.
The two-factor theory of emotions was never empirically supported.
Just because it was published in Psych Review, doesn’t mean it is true.
Psychology is not an evidence-based science, until it stops worshiping historically important articles as evidence for some eternal truth.
It is not bullying if the target of scientific criticism is deceased.
Zou, C., Schimmack, U., & Gere J. (2013). The Validity of Well-Being Measures: A Multiple-Indicator–Multiple-Rater Model. Psychological Assessment, 25(4), 1247–1254.
In the subjective indicators tradition, well-being is defined as a match between an individual’s actual life and his or her ideal life. Common well-being indicators are life-satisfaction judgments, domain satis- faction judgments, and measures of positive and negative affect (hedonic balance). These well-being indicators are routinely used to study well-being, but a formal measurement model of well-being is lacking. This article introduces a measurement model of well-being and examines the validity of self-ratings and informant ratings of well-being. Participants were 335 families (1 student with 2 parents, N = 1,005). The main findings were that (a) self-ratings and informant ratings are equally valid, (b) global life-satisfaction judgments and averaged domain satisfaction judgments are about equally valid, and (c) about 1/3 of the variance in a single indicator is valid. The main implication is that researchers should demonstrate convergent validity across multiple indicators by multiple raters.
Keywords: life satisfaction, affect,
self-reports, informant-reports, multitrait–multimethod
Well-being is an important
goal for many people, thus, social scientists from a variety of disciplines
study well-being. A major problem for well-being scientists is that well-being
is difficult to define and measure (Diener, Lucas, Schimmack, & Helliwell,2009). These difficulties
may threaten the validity of well-being measures. The aim of the present study
is to examine the validity of the most commonly used measures of well-being.
A measure is valid if it measures what it is intended to measure. This definition of validity implies that it is important to define a construct (i.e., what is being measured?) before it is possible to evaluate the validity of a measure (Schimmack, 2010). Unfortu- nately, there is no agreement about the definition of the term well-being (Diener et al., 2009). It is therefore necessary to explain how we define the term well-being before we can examine the validity of well-being measures. We agree with philosophical arguments that well-being is a subjective concept (Diener, 1984; Sumner, 1996; see Diener, Suh, Lucas, & Smith, 1999, for a detailed discussion). A key criterion of a subjective definition of well-being is that the evaluation has to take the subjective values, motives, and ideals of individuals into account; that is, is his or her life going well for him or her? Accordingly, we define well-being as a match between an individual’s actual life and his or her ideal life. This definition is consistent with the prevalent definition of well-being in the social indicators tradition (Andrews & Withey,1976; Cantril, 1965; Diener, 1984; Veenhoven & Jonkers, 1984). This definition of well-being led to the creation of subjective well-being indicators such as life-satisfaction judgments (Diener,1984). These measures are routinely used to make inferences about the determinants of well-being. These inferences implicitly assume that well-being measures are valid, but the literature on the validity of these measures is sparse and controversial (Schwarz & Strack,1999; Schimmack & Oishi, 2005; Schneider & Schimmack, 2009). Since there is no gold standard to validate well-being measures, convergent validity between self-ratings and informant ratings of well-being has been used as the primary evidence for the validity of well-being measures (Diener et al., 2009). However, a major limitation of previous studies is that they did not provide quanti- tative information about the amount of valid variance in different well-being measures (cf. Schneider & Schimmack, 2009). Our study addresses this problem and provides the first quantitative estimates of the amount of valid variance in the most widely used measures of well-being.
One problem in the estimation of effect sizes is that estimates based on small samples are imprecise because sampling error is substantial. To obtain data from a large sample, we used a round- robin design. In this design, participants are both targets and informants, thus, increasing the number of targets. To ensure that informants have valid information about targets’ well-being, we used families as units of analysis. Specifically, we
recruited uni- versity students and their biological
parents (see Table 1).
A round-robin design
creates two problems for a standard struc- tural equation model. First,
observations are not independent be- cause participants are recruited as triads
rather than as individuals. Second, the distinction between the three raters
(student, mother, & father) does not provide information about the validity
of self-ratings because self-ratings are a function of rater and target (i.e.,
the diagonal in Table 1).
To overcome these problems,
we made use of advanced features in
the structural equation modeling program Mplus 5.0 (Muthén &Muthén, 2007). First, we
used the CLUSTER command to obtain adjusted standard errors and fit indices
that take the interdepen- dence among family
members into account.
Second, we rearranged the data to create variables
with self-ratings (see Table 2). This creates missing data in the diagonal of the
traditional round-robin design. To analyze these data with missing values we
used the MODEL = COMPLEX function of Mplus (Muthén & Muthén,2007). Thus, our model included 16 (4 raters X 4 measures) observed variables.
A Measurement Model
Quantitative estimates of validity require a formal measurement model in which variation in well-being (the match between indi- viduals’ actual and ideal lives) is an unobserved cause that pro- duces variation in observed well-being measures (e.g., self-ratings of life-satisfaction; cf. Schimmack, 2010). Our measurement model of well-being (see Figure 1) is similar to Diener et al.’s(1999) theoretical model of well-being. It is also related to the causal systems model of subjective well-being (Busseri & Sadava,2011). In this model, positive affect and negative affect are distinct affective experiences. For most people, feeling good and not feeling bad is an important part of an ideal life, and the balance of positive versus negative affect serves as an important basis for life-satisfaction judgments (Schimmack, Radhakrishnan, Oishi,Dzokoto, & Ahadi, 2002; Suh, Diener, Oishi, & Triandis, 1998). Consistent with these assumptions, positive affect and negative affect are distinct components of hedonic balance (using a forma- tive measurement model), and hedonic balance influences well- being. The formative measurement model of hedonic balance makes no assumptions about the correlation between its compo- nents. As prior research often reveals a moderate negative corre- lation between positive affect and negative affect, our model allows for the two components to correlate with each other (Diener, Smith, & Fujita, 1995; Gere & Schimmack, 2011). The well-being factor is identified by two satisfaction measures, global life-satisfaction judgments and averaged domain satisfaction judgments. Prior studies often relied exclusively on global life-satisfaction judgments (Lucas, Diener, & Suh, 1996; Walker &Schimmack, 2008). The problem with this approach is that global life-satisfaction judgments can be influenced by focusing illusions (Kahneman, Krueger, Schkade, Schwarz, & Stone, 2006; but see Schimmack & Oishi, 2005). Focusing illusions could produce systematic measurement error in global life-satisfaction judgments that could attenuate the influence of hedonic balance on well- being. To address this concern, our model included averaged domain satisfaction judgments as a second indicator of well-being. As averaged domain satisfaction judgments are not susceptible to focusing illusions, the focusing illusion hypothesis predicts that averaged domain satisfaction judgments have a higher loading on the well-being factor (i.e., are more valid) than global life- satisfaction judgments.
Model fit was assessed
using standard criteria of acceptable model fit such as a comparative fit index
(CFI) < .95, root-mean- square error of approximation (RMSEA) < .06,
and standardized root-mean-square residual (RMSR) < .08
(Schermelleh-Engel,Moosbrugger, & Muller,
2003). Due to the large sample size of the
present data (N = 1,005),
tests of model comparison using p-values
will often lead to misleading results (cf. Raftery, 1995). Therefore,
we used the Bayesian information criterion (BIC) for model comparisons. Models
with lower BIC values are preferable because they are more parsimonious. This
is especially important in new research areas because small effects are less
likely to replicate. Following Raftery’s (1995) standards,
a difference in BIC values greater than 10 can be interpreted as very strong
evidence to support the model with the lower BIC value.
Participants were 335 students at the University of Toronto and their parents (335 triads; N = 1,005). Of the 335 students, 235 were women and 100 were men, and the age ranged from 17 to 30 years (Mage = 19.56, SD = 2.23). The age of mothers ranged from 37 to 63 years (Mage = 48.25, SD = 5.08).
The age of fathers ranged from 38 to 72 years (Mage
= 51.67, SD = 5.67). Students were required to be living with both of
their biological parents so that each member of the family had good knowledge
of one another. Students from the university took part in the study for either
$25 or course credit. Their parents each received $25 for participating in the
study. Two hundred thirty-five students came to the laboratory with their
parents to complete the study. One hundred students and their parents completed
the study in their homes.
Participants who came into the laboratory filled out consent forms, and these participants were seated in separate rooms to ensure that reports were made independently. They filled out a series of questionnaires about themselves and about the other two members of their families. They were then debriefed and thanked for their participation. Students who took the questionnaires home met with a researcher who gave them detailed instructions and the questionnaire packages. Participants were asked to fill out the questionnaires in separate rooms and refrain from talking about their responses until all members of the
family have completed the questionnaire. Each family member received an
envelope, into which the family
member placed his or her own completed ques- tionnaire, and he or she sealed the envelope
and signed it across the flap. Once the questionnaire packages
were completed, partici- pants returned the questionnaire packages, and they were debriefed
and thanked for their participation.
Since well-being is defined as an evaluation of an individual’s actual life, the assessment of well-being has to be retrospective. For this reason, we asked participants to think about the past 6 months when answering the questions. Additionally, since global judgments of life satisfaction can be influenced by temporarily accessible information (Schimmack & Oishi, 2005; Schwarz &Strack, 1999), the global self-ratings of life satisfaction were assessed first.
Global life evaluation.
For the global evaluative judgments, the
first three items of the Satisfaction With Life Scale were used (SWLS; Diener, Emmons, Larsen, &
Griffin, 1985). The items ask participants to evaluate their lives
on a 7-point Likert scale ranging
from 1 (strongly disagree) to 7 (strongly
agree). The first three items (“In
most ways my life is close to my ideal”; “The conditions of my life are excellent”; “I am satisfied with my life”) were chosen because
they have been shown to have better psychometric prop- erties than the last two
items of the scale (Oishi, 2006). Consistent with prior studies, the internal
consistency of the three-item scale was good, alphas > .80 (C= .83 for students; C= .89
for mothers; C = .89 for fathers). The items
for the informant reports were virtually the same, but the wording was changed
to an informant report format (e.g., Kim
et al., 2012). Informants were instructed to fill out the scale from the target’s
perspective. For example, students serving as informants for their father would
rate “In most ways my father thinks that
his life is close to his ideal.” Ratings were made on 7-point Likert
scales. The internal consistency of informant-ratings was similar to the
internal consistency of self- ratings (ranged from C = .85 to C = .93).
Averaged domain satisfaction. Domain satisfaction was as- sessed with single-item indicators for six important life domains, using satisfaction judgments (I am satisfied with..). The life do- mains were romantic life, work/academic life, health, recreational life, housing, and friendships. Responses were made on 7-point Likert scales ranging from 1 (strongly disagree) to 7 (strongly agree). The domains were chosen based on previous studies show- ing that these domains are rated as moderately to very important (Schimmack, Diener, & Oishi, 2002). We averaged these items to obtain an alternative measure of life
evaluations. The informant version of the questionnaire changed the stem from
“I am . . . ” to “My son/daughter/mother/father is . . . ” and “my” to
Positive and negative affect. Positive and negative affect were assessed using the Hedonic Balance Scale (Schimmack et al.,2002). The scale has three items for positive affect (pleasant, positive, good) and three items for negative affect (unpleasant, negative, bad). The items for positive and negative affect were averaged separately to create composites for positive and negative affect, respectively. All of the self-ratings for positive affect had a reliability of over .80 (C = .82 for students; C = .85 for mothers; C = .85 for fathers). Similarly, all of the self-ratings for negative affect had a reliability of over .75 (C = .80 for students; C = .75 for mothers; C = .78 for fathers). For the informant reports, “. . . how often do you experience the following feelings?” was re- placed with “. . . how often does your mother/father/son/daughter experience the following feelings?” All of the informant reports had reliabilities of over .75 (range from C = .75 to C = .89).
Table 3 shows the correlations among the 16 variables created by crossing the four indicators (life satisfaction, domain satisfac- tion, positive affect, and negative affect) with the four raters (self, student informant, mother informant, and father informant). Note that since the self cannot also serve as the informant for the self, correlations between self-reports and informant reports are based on 66% of all observations. The correlations between the self- report measures were based on 100% of the observations.
Correlations between the
same construct assessed with different methods (i.e., convergent validity
coefficients) are bolded. All of the convergent validity coefficients were significantly
greater than zero and exceeded a minimum value of r = .25. Convergent
validity correlations for affective indicators (positive affect and negative
affect) were lower than correlations for the evaluative indicators (life
satisfaction and domain satisfaction). These find- ings replicate the results
of a meta-analysis (Schneider &
Table 3 can also be used to examine whether each indicator measures well-being in a slightly different manner. Twenty-two out of 24 cross-indicator– cross-rater correlations were weaker than the convergent validity coefficients, indicating that the dif- ferent indicators have unique variance. This finding replicates Lucas et al.’s (1996). However, Table 3 also shows that all well-being measures are related to each other. This pattern of results is consistent with the assumption that all measures reflect a common construct.Table 3 also shows stronger same-rater correlations than cross- rater correlations. This pattern is consistent with our assumption that ratings by a single rater are influenced by an evaluative bias (Anusic et al., 2009; Campbell & Fiske, 1959). Most important, Table 3 provides new information about informant–informant agreement. One notable pattern in the data is that the correlations between informant ratings by mothers (mother informant) and fathers (father informant) were stronger than correlations of infor- mant ratings by parents with those by students as informants. There are two possible explanations for this pattern. First, it is possible that students’ informant reports are less valid than par- ents’ informant ratings. However, this interpretation of the data is inconsistent with the finding that self-ratings were more highly correlated with students’ informant ratings than with parents’ informant ratings. Therefore, we favor the second explanation that parents’ informant ratings share method variance. This interpreta- tion is also consistent with other multirater studies that have demonstrated shared method variance between parents’ ratings of their children’s personality (Funder, Kolar, & Blackman, 1995).
Structural Equation Modeling
We fitted the measurement model in Figure 1 to our data. In the first model, we did not constrain coefficients. This model served as the base-model for model comparisons to more parsimonious models with constrained coefficients. The first model with uncon- strained coefficients had acceptable fit to the data, x2(df = 78) = 104.41, CFI = 0.995, RMSEA = 0.018, standardized root-mean- square residual (SRMR) = 0.026; BIC = 31,102. Factor loadings of ratings by different raters of the same measure (e.g., life- satisfaction) showed very similar loadings. We therefore specified a model that constrained factor loadings and residuals for the four raters to be equal. This model implies that ratings by different raters are equally valid. The model with constrained parameters maintained good fit and had a lower (i.e., superior) BIC value, x2(df = 102) = 148.18, CFI = 0.991, RMSEA = 0.021, SRMR = 0.041; BIC = 30,993. In the next model, we constrained the loadings on the rater-specific bias factors to be equal across raters. Again, model fit remained acceptable, and BIC decreased, indicating that rater bias is similar across raters x2(df = 117) = 188.48, CFI = 0.986, RMSEA = 0.025, SRMR = 0.068; BIC = 30,936. We retained this model as the final model. The parameter estimates of the final model and their 95% confidence intervals are listed in Table 4. For ease of interpretation, the main parameter estimates are also included in Figure 1.
The main finding was that the life-satisfaction factor and the average domain satisfaction factor had very high loadings on the well-being factor. Thus, our results provide no support for the hypothesis that focusing illusions undermine the validity of global life-satisfaction judgments. We also found a very strong effect of hedonic balance on the well-being factor. Yet, all three measures of well-being had significant residual variances, indicating that the measures are not redundant. Most important, about 20% of the variance in well-being was not accounted for by hedonic balance. This suggests that affective measures and evaluative judgments can show divergent patterns of correlations with predictor variables.
The factor loadings of the observed variables on the factor representing the shared variance among raters (e.g., self-ratings of life satisfaction [LS] on LS factor) can be interpreted as validity coefficients for specific constructs (e.g., the validity of a self-rating of life-satisfaction as a measure of life-satisfaction; cf. Schimmack, 2010). The validity coefficients of the four types of indicators were very similar (see Table 4). The validity coefficients suggest that about one third (29% to 38%) of the variance in a single indicator by a single rater (e.g., self-ratings of life- satisfaction) is valid variance.
It is important to keep in mind that these estimates examine the validity of a single rater with regard to a specific measure of well-being rather than the validity of these measures as measures of well-being. To examine the validity of specific measures as measures of the well-being factor in our measurement model, we need to estimate indirect effects of the well-being factor on specific measures. For example, self-ratings of life satisfaction load at .60 on the life satisfaction factor. However, this does not mean that self-ratings of life satisfaction capture 36% (.6*.6) of valid variance of well-being, because life satisfaction is not a perfect indicator of well-being. Based on our model, the life satisfaction factor loads at .96 on the well-being factor. We also need to take this measurement error into account to examine the validity of self- ratings of life satisfaction in assessing well-being (.96*.60 = .58, valid variance = 33%).
Our study provides the
first quantitative estimates of the validity
of various well-being measures using a theoretically grounded model of
well-being. Our main findings were that (a) about one third of the variance in
a single well-being indicator is valid variance, (b) self-ratings are neither
significantly more nor less valid than ratings
by a single well-acquainted informant, (c) a large portion of the valid variance in a
specific type of indicator is shared across indicators, and (d) hedonic balance
and evaluative judgments have some unique variance.
We found no support for the focusing illusion hypothesis. If the distinction between hedonic balance and global life-satisfaction judgments were caused by a focusing illusion, the factor loading of life satisfaction on well-being should have been lower than the factor loading of the average domain satisfaction judgment. However, the actual results showed a slightly reversed pattern. This suggests that unique variance in evaluative judgments reflects valid well-being variance because individuals do not rely exclusively on hedonic balance to evaluate their lives. This finding provides empirical support for philosophical arguments against purely hedonistic definitions of well-being (Sumner, 1996). At the same time, the overlap between evaluative judgments and hedonic balance is substantial, indicating that positive experiences make an important contribution to well-being for most individuals. Another noteworthy finding was that global life-satisfaction judgments and averaged domain satisfaction judgments were approximately equally valid. This finding contradicts previous findings that averaged domain satisfaction judgments were more valid in a study with friends as informants (Schneider & Schimmack, 2010). Future research needs to examine whether the type of informant is a moderator. For example, it is possible that global life-satisfaction judgments are more difficult to make, which gives family members an advantage over friends. Subsequently, we discuss the main implications of our findings for the use of well-being measures in the assessment of individuals’ well-being and for the use of well-being measures in policy decisions.
Validity of Well-Being Indicators
Our results suggest that about one third of the variance in a single well-being indicator by a single rater is valid variance. This finding has important implications for the interpretation of studies that rely on a single well-being indicator as a measure of wellbeing. For example, many important findings about well-being are based on a single global life-satisfaction rating in the German Socio-Economic Panel (e.g., Lucas & Schimmack, 2009). It is well-known that observed effect sizes in these studies are attenuated by random measurement error and that it would be desirable to correct effect size estimates for unreliability (Schmidt & Hunter, 1996). However, systematic measurement error can further attenuate observed effect sizes. Schimmack (2010) proposed that quantitative estimates of validity could be used to disattenuate observed effect sizes for invalidity. To illustrate the implications of correcting for invalidity in well-being indicators, we use Kahneman et al.’s (2006) finding that household income was a moderate predictor of self-reported life-satisfaction (r .32). Our findings suggest that this observed relationship underestimates the relationship between household income and well-being. To disattenuate the observed relationship, the observed correlation has to be divided by the validity coefficient (i.e., .96 .60 .58). Thus, the corrected estimate of the true effect size would increase to r .56 (.32/.58), which is considered a strong effect size (Cohen, 1992). Researchers may be reluctant to trust adjusted effect sizes because they rely on assumptions about validity. However, the common practice of relying on observed relationships as estimates of effect sizes also relies on an implicit assumption, namely, that the observed measure is perfectly valid. In comparison to an assumption of 100% valid variance in a single global life-satisfaction judgment, our estimate of about one-third valid variance is more realistic and supported by empirical evidence. Nevertheless, our findings should only be treated as a first estimate and a benchmark for future studies. Future research needs to replicate our findings and examine moderating factors of validity in well-being measures.
Self-Reports Versus Informant Reports
Schimmack (2009) noted that
previous studies failed to compare the validity of self-ratings and informant ratings. Our results suggest that
self-ratings and ratings by a single well- acquainted informant are
approximately equally valid. While this is a surprising finding given the
subjective nature of well-being, it is not uncommon in personality psychology
to find evidence of equal or sometimes greater validity in informant ratings
than self-ratings. For instance, informant reports of personality often provide
better predictive validity than self-reports (e.g., Kolar,Funder, & Colvin, 1996). Since we did not have any outcome measure of
well-being (e.g., suicide) in the present study, we could not test for the predictive validity of self- and
informant reports. However, this is an important avenue for future research. To
our knowledge, no study has compared self-ratings and informant ratings using
life-events that are known to influence well-being
such as marriage, divorce, or unemployment (Diener, Lucas, &Scollon, 2006).
Informant ratings also have
an important advantage over self- ratings. Namely, it
is possible to obtain ratings from multiple informants, but there is only one
self to provide self-ratings. Aggregation of informant
ratings can substantially increase the validity of informant ratings. We
computed well-being indicators for single raters and multiple raters using the
following weights (Well-Being = 1.5 Life Satisfaction + 1.5 Domain
Satisfaction + 2 Positive Affect – 1 Negative Affect) and computed the corre- lation with
the well-being factor in Figure 1. The correlations were r
= .62 for self-ratings, r
= .77 for an aggregate of three informant ratings, and r = .81 for an aggregate of all four ratings. Although the difference between .62 and .77 may not seem
impressive, it implies that aggregation across raters can increase the amount of valid variance from one third to two thirds of the
observed vari- ance. This finding suggests that clinicians can benefit considerably
from obtaining well-being measures from multiple informants to assess
Our study has numerous limitations. The use of a convenience sample from a specific population means that the generalizability of our findings needs to be examined in samples drawn from other populations. However, our results are broadly consistent with meta-analytic findings (Schneider & Schimmack, 2009). Another limitation was that parents are not independent raters and appear to share rating biases. In the future, it would be desirable to obtain ratings from independent raters (e.g., friends & parents). Finally, our conclusions are limited by the assumptions of our model. While it is possible to fit other models to our data in Table 3 (e.g., Busseri & Sadava, 2011), the alternative models each have their own limitations. Future studies should test these alternative models to examine if they may reveal different or unique findings from the present study. We encourage readers to fit alternative models to the correlation matrix in Table 3 and examine whether these model provide better fit to our data. We consider our model merely as a plausible first attempt to create a measurement model of well- being that can underpin empirical studies of well-being.
Although the study of happiness has been of great interest to many researchers and the general public, the validity of well-being measures has not improved for the past 50 years (Schneider &Schimmack, 2009). In order for well-being researchers to provide accurate information about the determinants of well-being, it is crucial to use a valid method to assess well-being. If invalid measures are used, findings that rely on such measures will also lack validity. From the current study, we found that only about one third of the variance in a self-report measure of well-being is valid. In order to increase the validity of well-being measures, multiple methods of well-being should be used. When better measures are used, researchers can also be more confident that their findings can be trusted.
Andrews, F. M., & Withey, S. B. (1976). Social indicators of well-being: America’s
perception of life quality. New York, NY: Plenum.
Anusic, I., Schimmack, U., Pinkus, R. T., &
Lockwood, P. (2009). The nature and structure of correlations among Big Five
ratings: The halo- alpha-beta model. Journal
of Personality and Social Psychology, 97, 1142–1156. doi:10.1037/a0017159
Busseri, M. A., & Sadava, S. W. (2011). A
review of the tripartite structure
of subjective well-being: Implications for conceptualization, operation-
alization, analysis, and synthesis. Personality
and Social Psychology Review, 15, 290 –314. doi:10.1177/1088868310391271
Campbell, D. T., & Fiske, D. W. (1959).
Convergent and discriminant validation by the multitrait–multimethod matrix. Psychological Bulletin, 56, 81–105. doi:10.1037/h0046016
Cantril, H. (1965). The pattern of human concerns (Vol. 4). New Bruns- wick, NJ:
Rutgers University Press.
Diener, E., Emmons,
R. A., Larsen, R. J., & Griffin, S. (1985). The Satisfaction
With Life Scale. Journal of Personality
Assessment, 49, 71–75. doi:10.1207/s15327752jpa4901_13
Diener, E., Lucas, R. E., Schimmack, U., &
Helliwell, J. F. (2009). Well-being for
public policy. New York, NY: Oxford University Press. doi:10.1093/acprof:oso/9780195334074.001.0001
Diener, E., Lucas, R. E., & Scollon, C. N.
(2006). Beyond the hedonic treadmill: Revising the adaptation theory of
well-being. American Psy- chologist, 61, 305–314.
Diener, E., Smith, H., & Fujita, F. (1995).
The personality structure of affect. Journal
of Personality and Social Psychology, 69, 130 –141. doi:10.1037/0022-3522.214.171.124
Diener, E., Suh, E. M., Lucas, R. E., & Smith, H. L. (1999). Subjective well-being: Three decades of progress. Psychological Bulletin, 125, 276 –302. Funder, D. C., Kolar, D. C., & Blackman, M. C. (1995). Agreement among judges of personality: Interpersonal-relations, similarity, and acquain- tanceship. Journal of Personality and Social Psychology, 69, 656 – 672. doi:10.1037/0022-35126.96.36.1996
Funder, D. C., Kolar, D. C., & Blackman, M. C. (1995). Agreement among judges of personality: Interpersonal-relations, similarity, and acquain- tanceship. Journal of Personality and Social Psychology, 69, 656 – 672. doi:10.1037/0022-35188.8.131.526
Gere, J., & Schimmack, U. (2011). A
multi-occasion multi-rater model of affective
dispositions and affective well-being. Journal
of Happiness Studies, 12, 931–945. doi:10.1007/s10902-010-9237-3
Krueger, A. B., Schkade, D., Schwarz, N., & Stone, A. A.
(2006). Would you be happier if you were richer? A focusing illusion. Science, 312, 1908 –1910. doi:10.1126/science.1129688
Kim, H., Schimmack,
U., & Oishi, S. (2012). Cultural differences in self- and other-evaluations
of well-being: A study of European and Asian
Canadians. Journal of Personality and
Social Psychology, 102, 856 – 873. doi:10.1037/a0026803
Kolar, D. W., Funder,
D. C., & Colvin, C. R. (1996). Comparing the accuracy of personality
judgments by the self and knowledgeable others. Journal of
Personality, 64, 311–337. doi:10.1111/j.1467-6494.1996
Lucas, R. E., Diener,
E., & Suh, E. (1996). Discriminant validity of
well-being measures. Journal of
Personality and Social Psychology, 71, 616 – 628.
Lucas, R. E., & Schimmack, U. (2009).
Income and well-being. How big is the gap between the
rich and the poor? Journal of Research in Personality, 43, 75–78. doi:10.1016/j.jrp.2008.09.004
Muthén, L. K., & Muthén, B. O. (2007). Mplus user’s guide (5th ed.). Los Angeles, CA: Muthén &
Oishi, S. (2006). The concept of life
satisfaction across cultures: An IRT analysis. Journal of Research in Personality, 40, 411– 423. doi:10.1016/j.jrp.2005.02.002
Raftery, A. E. (1995). Bayesian model selection
in social research. Soci- ological Methodology, 25, 111–164. doi:10.2307/271063
Schermelleh-Engel, K., Moosbrugger, H., &
Muller, H. (2003). Evaluating the fit of structural
equation models: Tests of significance and descrip-
tive goodness-of-fit measures. Methods of
Psychological Research, 8, 23–74.
Schimmack, U. (2010). What multi-method data
tell us about construct validity. European Journal of Personality, 24, 241–257.
Schimmack, U., Diener, E., & Oishi, S.
(2002). Life-satisfaction is a momentary judgement and a stable personality
characteristic: The use of chronically accessible and stable sources. Journal of Personality, 70, 345–384. doi:10.1111/1467-6494.05008
Schimmack, U., & Oishi, S. (2005). The
influence of chronically and temporarily accessible information on life
satisfaction judgments. Jour- nal of
Personality and Social Psychology, 89, 395– 406. doi:10.1037/0022-35184.108.40.2065
Schimmack, U., Radhakrishnan, P., Oishi, S.,
Dzokoto, V., & Ahadi, S. (2002). Culture, personality, and subjective
well-being: Integrating pro- cess models of life satisfaction. Journal of Personality and Social
Psychology, 82, 582–593.
Schimmack, U., Schupp, J., & Wagner, G. G.
(2008). The influence of environment and personality on the affective and
cognitive component of subjective well-being. Social Indicators Research, 89, 41– 60. doi:10.1007/s11205-007-9230-3
Schmidt, F. L., & Hunter,
J. E. (1996). Measurement error
in psychological research: Lessons
from 26 research scenarios. Psychological Methods,
1, 199 –223. doi:10.1037/1082-989X.1.2.199
Schneider, L., & Schimmack, U. (2009).
Self-informant agreement in well-being ratings: A meta-analysis. Social Indicators Research, 94, 363–376.
Schneider, L., & Schimmack, U. (2010).
Examining sources of self- informant agreement in life-satisfaction judgments. Journal of Research in Personality, 44, 207–212.
Schwarz, N., & Strack, F. (1999). Reports
of subjective well-being: Judg- mental processes and their methodological
implications. In D. Kahne- man, E. Diener, & N. Schwarz (Eds.), Well-being: The foundations of hedonic
psychology (pp. 61– 84). New York, NY: Russell-Sage.
Suh, E., Diener, E.,Oishi, S., & Triandis,
H. C. (1998). The shifting basis of life satisfaction judgments across
cultures: Emotions versus norms. Journal
of Personality and Social Psychology, 74, 482– 493.
Sumner, L. W. (1996). Welfare, happiness, and ethics. New York, NY: Oxford University
Veenhoven, R., & Jonkers, T. (1984). Conditions of happiness (Vol. 2).
Dordrecht, the Netherlands: Reidel.
Walker, S. S., & Schimmack, U. (2008).
Validity of a happiness implicit association test as a measure of subjective
well-being. Journal of Re- search in
Personality, 42, 490 – 497. doi:10.1016/j.jrp.2007.07.005
Most published psychological measures are unvalid. (subtitle) *unvalid = the validity of the measure is un-known.
This blog post served as a first draft for a manuscript that is currently under review at Meta-Psychology. You can find the latest version here (pdf).
8 years ago, psychologists started to realize that they have a replication crisis. Many published results do not replicate in honest replication attempts that allow the data to decide whether a hypothesis is true or false.
The replication crisis is sometimes attributed to the lack of replication studies before 2011. However, this is not the case. Most published results were replicated successfully. However, these successes were entirely predictable from the fact that only successful replications would be published (Sterling, 1959). These sham replication studies provided illusory evidence for theories that have been discredited over the past eight years by credible replication studies.
New initiatives that are called open science are likely to improve the replicability of psychological science in the future, although progress towards this goal is painfully slow.
This blog post addresses another problem in psychological science. I call it the validation crisis. Replicability is only one necessary feature of a healthy science. Another necessary feature of a healthy science is the use of valid measures. This feature of a healthy science is as obvious as the need for replicability. To test theories that relate theoretical constructs to each other (e.g., construct A influences construct B for individuals drawn from population P under conditions C), it is necessary to have valid measures of constructs. However, it is unclear which criteria a measure has to fulfill to have construct validity. Thus, even successful and replicable tests of a theory may be false because the measures that were used lacked construct validity.
The classic article on “Construct Validity” was written by two giants in psychology; Cronbach and Meehl (1955). Every graduate student of psychology and surely every psychologists who published a psychological measure should be familiar with this article.
The article was the result of an APA task force that tried to establish criteria, now called psychometric properties, for tests to be published. The result of this project was the creation of the construct “Construct validity”
The chief innovation in the Committee’s report was the term construct validity. (p. 281).
Cronbach and Meehl provide their own definition of this construct.
Construct validation is involved whenever a test is to be interpreted as a measure of some attribute or quality which is not “operationally defined” (p. 282).
In modern language, construct validity is the relationship between variation in observed test scores and a latent variable that reflects corresponding variation in a theoretical construct (Schimmack, 2010).
Thinking about construct validity in this way makes it immediately obvious why it is much easier to demonstrate predictive validity, which is the relationship between observed tests scores and observed criterion scores than to establish construct validity, which is the relationship between observed test scores and a latent, unobserved variable. To demonstrate predictive validity, one can simply obtain scores on a measure and a criterion and compute the correlation between the two variables. The correlation coefficient shows the amount of predictive validity of the measure. However, because constructs are not observable, it is impossible to use simple correlations to examine construct validity.
The problem of construct validation can be illustrated with the development of IQ scores. IQ scores can have predictive validity (e.g., performance in graduate school) without making any claims about the construct that is being measured (IQ tests measure whatever they measure and what they measure predicts important outcomes). However, IQ tests are often treated as measures of intelligence. For IQ tests to be valid measures of intelligence, it is necessary to define the construct of intelligence and to demonstrate that observed IQ scores are related to unobserved variation in intelligence. Thus, construct validation requires clear definitions of constructs that are independent of the measure that is being validated. Without clear definition of constructs, the meaning of a measure reverts essentially to “whatever the measure is measuring,” as in the old saying “Intelligence is whatever IQ tests are measuring. This saying shows the problem of research with measures that have no clear construct and no construct validity.
In conclusion, the challenge in construct validation research is to relate a specific measure to a well-defined construct and to establish that variation in test scores are related to variation in the construct.
What are Constructs
Construct validation starts with an assumption. Individuals are assumed to have an attribute, today we may say personality trait. Personality traits are typically not directly observable (e.g., kindness rather than height), but systematic observation suggests that the attribute exists (some people are kinder than others across time and situations). The first step is to develop a measure of this attribute (e.g., a self-report measure “How kind are you?”). If the test is valid, variation in the observed scores on the measure should be related to the personality trait.
A construct is some postulated attribute of people, assumed to be reflected in test performance (p. 283).
The term “reflected” is consistent with a latent variable model, where unobserved traits are reflected in observable indicators. In fact, Cronbach and Meehl argue that factor analysis (not principle component analysis!) provides very important information for construct validity.
We depart from Anastasi at two points. She writes, “The validity of a psychological test should not be confused with an analysis of the factors which determine the behavior under consideration.” We, however, regard such analysis as a most important type of validation. (p. 286).
Factor analysis is useful because factors are unobserved variables and factor loadings show how strongly an observed measure is related to variation in a an unobserved variable; the factor. If multiple measures of a construct are available, they should be positively correlated with each other and factor analysis will extract a common factor. For example, if multiple independent raters agree in their ratings of individuals’ kindness, the common factor in these ratings may correspond to the personality trait kindness, and the factor loadings provide evidence about the degree of construct validity of each measure (Schimmack, 2010).
In conclusion, factor analysis provides useful information about construct validity of measures because factors represent the construct and factor loadings show how strongly an observed measure is related to the construct.
It is clear that factors here function as constructs (p. 287).
The term convergent validity was introduced a few years later in another seminal article on validation research by Campbell and Fiske (1959). However, the basic idea of convergent validity was specified by Cronbach and Meehl (1955) in the section “Correlation matrices and factor analysis”
If two tests are presumed to measure the same construct, a correlation between them is predicted (p. 287).
If a trait such as dominance is hypothesized, and the items inquire about behaviors subsumed under this label, then the hypothesis appears to require that these items be generally intercorrelated (p. 288)
Cronbach and Meehl realize the problem of using just two observed measures to examine convergent validity. For example, self-informant correlations are often used in personality psychology to demonstrate validity of self-ratings. However, a correlation of r = .4 between self-ratings and informant ratings is open to very different interpretations. The correlation could reflect very high validity of self-ratings and modest validity of informant ratings or the opposite could be true.
If the obtained correlation departs from the expectation, however, there is no way to know whether the fault lies in test A, test B, or the formulation of the construct. A matrix of intercorrelations often points out profitable ways of dividing the construct into more meaningful parts, factor analysis being a useful computational method in such studies. (p. 300)
A multi-method approach avoids this problem and factor loadings on a common factor can be interpreted as validity coefficients. More valid measures should have higher loadings than less valid measures. Factor analysis requires a minimum of three observed variables, but more is better. Thus, construct validation requires a multi-method assessment.
The term discriminant validity was also introduced later by Campbell and Fiske (1959). However, Cronbach and Meehl already point out that high or low correlations can support construct validity. Crucial for construct validity is that the correlations are consistent with theoretical expectations.
For example, low correlations between intelligence and happiness do not undermine the validity of an intelligence measure because there is no theoretical expectation that intelligence is related to happiness. In contrast, low correlations between intelligence and job performance would be a problem if the jobs require problem solving skills and intelligence is an ability to solve problems faster or better.
Only if the underlying theory of the trait being measured calls for high item intercorrelations do the correlations support construct validity (p. 288).
Quantifying Construct Validity
It is rare to see quantitative claims about construct validity. Most articles that claim construct validity of a measure simply state that the measure has demonstrated construct validity as if a test is either valid or invalid. However, the previous discussion already made it clear that construct validity is a quantitative construct because construct validity is the relation between variation in a measure and variation in the construct and this relation can vary . If we use standardized coefficients like factor loadings to assess the construct validity of a measure, construct validity can range from -1 to 1.
Contrary to the current practices, Cronbach and Meehl assumed that most users of measures would be interested in a “construct validity coefficient.”
There is an understandable tendency to seek a “construct validity coefficient. A numerical statement of the degree of construct validity would be a statement of the proportion of the test score variance that is attributable to the construct variable. This numerical estimate can sometimes be arrived at by a factor analysis” (p. 289).
Cronbach and Meehl are well-aware that it is difficult to quantify validity precisely, even if multiple measures of a construct are available because the factor may not be perfectly corresponding with the construct.
Rarely will it be possible to estimate definite “construct saturations,” because no factor corresponding closely to the construct will be available (p. 289).
And nobody today seems to remember Cronbach and Meehl’s (1955) warning that rejection of the null-hypothesis, the test has zero validity, is not the end goal of validation research.
It should be particularly noted that rejecting the null hypothesis does not finish the job of construct validation (p. 290)
The problem is not to conclude that the test “is valid” for measuring- the construct variable. The task is to state as definitely as possible the degree of validity the test is presumed to have (p. 290).
One reason why psychologists may not follow this sensible advice is that estimates of construct validity for many tests are likely to be low (Schimmack, 2010).
The Nomological Net – A Structural Equation Model
Some readers may be familiar with the term “nomological net” that was popularized by Cronbach and Meehl. In modern language a nomological net is essentially a structural equation model.
The laws in a nomological network may relate (a) observable properties or quantities to each other; or (b) theoretical constructs to observables; or (c) different theoretical constructs to one another. These “laws” may be statistical or deterministic.
It is probably no accident that at the same time as Cronbach and Mehl started to think about constructs as separate from observed measures, structural equation model was developed as a combination of factor analysis that made it possible to relate observed variables to variation in unobserved constructs and path analysis that made it possible to relate variation in constructs to each other. Although laws in a nomological network can take on more complex forms than linear relationships, a structural equation model is a nomological network (but a nomological network is not necessarily a structural equation model).
As proper construct validation requires a multi-method approach and demonstration of convergent and discriminant validity, SEM is ideally suited to examine whether the observed correlations among measures in a mulit-trait-multi-method matrix are consistent with theoretical expectations. In this regard, SEM is superior to factor analysis. For example, it is possible to model shared method variance, which is impossible with factor analysis.
Cronbach and Meehl also realize that constructs can change as more information becomes available. It may also occur that the data fail to provide evidence for a construct. In this sense, construct validiation is an ongoing process of improved understanding of unobserved constructs and how they are related to observable measures.
Ideally this iterative process would start with a simple structural equation model that is fitted to some data. If the model does not fit, the model can be modified and tested with new data. Over time, the model would become more complex and more stable because core measures of constructs would establish the construct validity, while peripheral relationships may be modified if new data suggest that theoretical assumptions need to be changed.
When observations will not fit into the network as it stands, the scientist has a certain freedom in selecting where to modify the network (p. 290).
Too often psychologists use SEM only to confirm an assumed nomological network and it is often considered inappropriate to change a nomological network to fit observed data. However, SEM is as much testing of an existing construct as exploration of a new construct.
The example from the natural sciences was the initial definition of gold as having a golden color. However, later it was discovered that the pure metal gold is actually silver or white and that the typical yellow color comes from copper impurities. In the same way, scientific constructs of intelligence can change depending on the data that are observed. For example, the original theory may assume that intelligence is a unidimensional construct (g), but empirical data could show that intelligence is multi-faceted with specific intelligences for specific domains.
However, given the lack of construct validation research in psychology, psychology has seen little progress in the understanding of such basic constructs such as extraversion, self-esteem, or wellbeing. Often these constructs are still assessed with measures that were originally proposed as measures of these constructs, as if divine intervention led to the creation of the best measure of these constructs and future research only confirmed their superiority.
Instead many claims about construct validity are based on conjectures than empirical support by means of nomological networks. This was true in 1955. Unfortunately, it is still true over 50 years later.
For most tests intended to measure constructs, adequate criteria do not exist. This being the case, many such tests have been left unvalidated, or a finespun network of rationalizations has been offered as if it were validation. Rationalization is not construct validation. One who claims that his test reflects a construct cannot maintain his claim in the face of recurrent negative results because these results show that his construct is too loosely defined to yield verifiable inferences (p. 291).
Given the difficulty of defining constructs and finding measures for it, even measures that show promise in the beginning might fail to demonstrate construct validity later and new measures should show higher construct validity than the early measures. However, psychology shows no development in measures of the same construct. The most widely used measure of self-esteem is still Rosenberg’s scale from 1965 and the most widely used measure of wellbieng is still Diener et al.’s scale from 1984. It is not clear how psychology can make progress, if it doesn’t make progress in the development of nomological networks that provide information about constructs and about the construct validity of measures.
Cronbach and Meehl are clear that nomological networks are needed to claim construct validity.
To validate a claim that a test measures a construct, a nomological net surrounding the concept must exist (p. 291).
However, there are few attempts to examine construct validity with structural equation models (Connelly & Ones, 2010; Zou, Schimmack, & Gere, 2013). [please share more if you know some]
One possible reason is that construct validation research may reveal that authors initial constructs need to be modified or their measures have modest validity. For example, McCrae, Zonderman, Costa, Bond, and Paunonen (1996) dismissed structural equation modeling as a useful method to examine the construct validity of Big Five measures because it failed to support their conception of the Big Five as orthogonal dimensions with simple structure.
Recommendations for Users of Psychological Measures
The consumer can accept a test as a measure of a construct only when there is a strong positive fit between predictions and subsequent data. When the evidence from a proper investigation of a published test is essentially negative, it should be reported as a stop sign to discourage use of the test pending a reconciliation of test and construct, or final abandonment of the test (p. 296).
It is very unlikely that all hunches by psychologists lead to the discovery of useful constructs and development of valid tests of these constructs. Given the lack of knowledge about the mind, it is rather more likely that many constructs turn out to be non-existent and that measures have low construct validity.
However, the history of psychological measurement has only seen development of more and more constructs and more and more measures to measure this increasing universe of constructs. Since the 1990s, constructs have doubled because every construct has been split into an explicit and an implicit version of the construct. Presumably, there is even implicit political orientation or gender identity.
The proliferation of constructs and measures is not a sign of a healthy science. Rather it shows the inability of empirical studies to demonstrate that a measure is not valid or that a construct may not exist. This is mostly due to self-serving biases and motivated reasoning of test developers. The gains from a measure that is widely used are immense. Thus, weak evidence is used to claim that a measure is valid and consumers are complicit because they can use these measures to make new discoveries. Even when evidence shows that a measure may not work as intended (e.g., Bosson et al., 2000), it is often ignored (Greenwald & Farnham, 2001).
Just like psychologist have started to appreciate replication failures in the past years, they need to embrace validation failures. Some of the measures that are currently used in psychology are likely to have insufficient construct validity. If this was the decade of replication, the 2020s may become the decade of validation, and maybe the 2030s may produce the first replicable studies with valid measures. Maybe this is overly optimistic, given the lack of improvement in validation research since Cronbach and Meehl (1955) outlined a program of construct validation research. Ample citations show that they were successful in introducing the term, but they failed in establishing rigorous criteria of construct validity. The time to change this is now.
The Implicit Association Test (IAT) is 21 years old.
Greenwald et al. (1998) proposed that the IAT measures individual differences
in implicit social cognition. This claim
requires evidence of construct validity. I review the evidence and show that there
is insufficient evidence for this claim.
Most important, I show that few studies were able to test discriminant
validity of the IAT as a measure of implicit personality characteristics and
that a single-construct model fits multi-method data as well or better than a
dual-construct models. Thus, the IAT
appears to be a measure of the same personality characteristics that are
measured with explicit measures. I also show that the validity of the IAT varies
across personality characteristics. It has low validity as a measure of
self-esteem, moderate validity as a measure of racial bias, and high validity
as a measure of political orientation. The
existing evidence also suggests that the IAT measures stable characteristics
rather than states and has low predictive validity of single behaviors. Based
on these findings, it is important that users of the IAT clearly distinguish
between implicit measures and implicit constructs. The IAT is an implicit measure,
but there is no evidence that it measures implicit constructs.
The Implicit Association Test at Age 21:
No Evidence for Construct Validity
Twenty-one years ago, Greenwald, McGree, and Schwartz (1998)
published one of the most influential articles in personality and social
psychology. It is already the 4th
most cited article (4582 citations in Web of Science) in the Journal of
Personality and Social Psychology and will be number 3 this year. As the title “Measuring
Individual Differences in Social Cognition” suggests, the article introduced a
new individual difference measure that has been used in hundreds of studies to
measure attitudes, stereotypes, self-concepts, well-being, and personality
traits. Henceforth, I will refer to these constructs as personality
A Critical Evaluation of Greenwald’s (1998) Evidence for Discriminant
The Implicit Association Test (IAT) uses reaction times in
classification tasks to measure individual differences in the strength of
associations (Nosek et al., 2007). However,
the main purpose of the IAT is not to measure associations or to provide an indirect
measure of personality characteristics.
The key constructs that the IAT was designed to measure are individual
differences in implicit personality characteristics as suggested in the title
of Greenwald et al.’s (1998) seminal article “Measuring Individual Differences
in Implicit Cognition.”
The notion of implicit cognition is based on a conception of
human information processing that largely takes place outside of consciousness,
and the IAT was supposed to provide a window into the unconscious. “There has
been an increased interest in measuring aspects of thinking and feeling that
may not be easily accessed or available to consciousness. Innovations in
measurement have been undertaken with the purpose of bringing under scrutiny
new forms of cognition and emotion that were previously undiscovered” (Nosek,
Greenwald, & Banaji, 2007, p. 265).
Thus, the IAT was not just a new way of measuring the same
individual differences that were already measured with self-report measures. It was designed to measure information that is
“simply unreachable, in the same way that memories are sometimes unreachable [by
introspection]” (Nosek et al., 2007, p. 266).
The promise to measure individual differences that were not
accessible to introspection explains the appeal of the IAT, and many articles used
the IAT to make claims about individual differences in implicit forms of
self-esteem, prejudice, or craving for drugs. Thus, the hypothesis that the IAT
measures something different from self-report measures is a fundamental feature
of the construct validity of the IAT. In psychometrics, the science of test
validation, this property of a measure is known as discriminant validity
(Campbell & Fiske, 1959). If the IAT
is a measure of implicit individual differences that are different from
explicit individual differences, the IAT should demonstrate discriminant
validity from self-report measures. Given
the popularity of the IAT, one might expect ample evidence for the discriminant
validity of the IAT. However, due to methodological
limitations this is actually not the case.
Confusion about Convergent and Discriminant Validity
Greenwald et al.’s seminal article promised a measure of
individual differences, but failed to provide evidence for the convergent or
discriminant validity of the IAT. Study
1 with N = 32 participants showed that, on average, participants preferred
flowers to insects and musical instruments to weapons. These average tendencies
cannot be used to validate the IAT as a measure of individual differences. However,
Greenwald et al. (1998) also reported correlations across N = 32 participants
between the IAT and explicit measures. These correlations were low. Greenwald et al. (1998) suggest that this
finding provides evidence of discriminant validity. “This conceptual divergence
between the implicit and explicit measures is of course expected from
theorization about implicit social cognition” (p. 1470). However, these low correlations are uninformative
because discriminant validity requires a multi-method approach. As the IAT was the only implicit measure, low
correlations with explicit measures may simply show that the IAT has low
validity as a measure of individual differences.
Experiment 2 used the IAT with 17 Korean and 15 Japanese
American students to assess their attitudes towards Koreans vs. Japanese. In this study, Greenwald et al. found “unexpectedly
the feeling thermometer explicit rating was more highly correlated with the IAT
measure (average r = .59) than it was with another explicit attitude measure,
the semantic differential (r = .43)” (p. 1473). This finding actually
contradicts the hypothesis that the IAT measures some construct that is not
measured with self-ratings because discriminant validity implies higher
same-method than cross-method correlations (Campbell & Fiske, 1959).
Study 3 introduced the race IAT to measure prejudice with
the IAT with a sample of 26 participants.
In this small sample, IAT scores were only weakly and not significantly
correlated with explicit measures. The
authors realize that this finding is open to multiple interpretations.
“Although these correlations provide no evidence for convergent validity of the
IAT, nevertheless because of the expectation that implicit and explicit
measures of attitude are not necessarily correlated-neither do they damage the
case for construct validity of the IAT” (p. 1476). In other words, the low correlations might
reflect discriminant validity, but it could also show low convergent validity
if the IAT and explicit measures measure the same construct.
The discussion has a section on “Discriminant Validity of
IAT Attitude Measures,” although the design of the studies makes it impossible
to provide evidence for discriminant validity. Nevertheless, Greenwald et al.
(1998) claimed that they provided evidence for the discriminant validity of the
IAT as a measure of implicit cognitions. “It is clear that these implicit-explicit
correlations should be taken not as evidence for convergence among different
methods of measuring attitudes but as evidence for divergence of the constructs
represented by implicit versus explicit attitude measures” (p. 1477). The scientific
interpretation of these correlations is that they provide no empirical evidence
about the validity of the IAT because multiple measures of a single construct
are needed to examine construct validity (Campbell & Fiske, 1959). Thus, unlike
most articles that introduce a new measure of individual differences, Greenwald
et al. (1998) did not examine the psychometric properties of the IAT. In this article, I examine whether evidence gathered
over the past 21 years has provided evidence of construct validity of the IAT
as a measure of implicit personality characteristics.
First Problems for the Construct Validity of the IAT
The IAT was not the first implicit measure in social
psychology. Several different measures had been developed to measure
self-esteem with implicit measures. A team of personality psychologists
conducted the first multi-method validation study of the IAT as a measure of
implicit self-esteem (Bosson, Swan, & Pennebaker, 2000). The main finding in this study was that
several implicit measures, including the IAT, had low convergent validity. However, this finding has been largely
ignored and researchers started using the self-esteem IAT as a measure of some
implicit form of self-esteem that operates outside of conscious awareness (Greenwald
& Farnham, 2000).
At the same time, attitude researchers also found weak
correlations between the race IAT and other implicit measures of prejudice.
However, this lack of convergent validity was also ignored. An influential review article by Fazio and
Olson (2003) suggested that low correlations might be due to different
mechanisms. While it is entirely possible that evaluative priming and the IAT
have different mechanisms, it is not relevant for the ability of either measure
to be a valid measure of personality characteristics. Explicit ratings probably
also rely on a different mechanism as the IAT.
The mechanics of measurement have to be separated from the constructs
that the measures aim to measure.
Continued Confusion about Discriminant Validity
Nosek et al. (2007) examined evidence for the construct
validity of the IAT at age 7. The
section on convergent and discriminant validity lists a few studies as evidence
for discriminant validity. However, closer
inspection of these studies show that they suffer from the same methodological
limitation as Greenwald et al.’s (1998) seminal study. That is, constructs were assessed with a
single implicit method; the IAT. Thus,
it was impossible to examine construct validity of the IAT as a measure of
implicit personality characteristics.
Take Nosek and Smyth’s (2007) “A Multi-trait-multi-method validation
of the Implicit Association Test” as an example. The title clearly alludes to
Campbell and Fiske’s approach to construct validation. The data were 7 explicit ratings and 7 IATs
of 7 attitude pairs (e.g., flower vs. insect). The authors fitted several structural equation
models to the data and claimed that a model with separate, yet correlated, explicit
and implicit factors fitted the data better than a model with a single factor
for each attitude pair. This claim is invalid
because each attitude pair was assessed with a single IAT and parcels were used
to correct for unreliability. This
measurement model assumes that all of the reliable variance in an IAT that is
not shared with explicit ratings or with IATs of other attitudes reflects
implicit individual differences. However, it is also possible that this
variance reflects systematic measurement error that is unique to a specific
IAT. A proper multi-method approach
requires multiple independent measures of the same construct. As
demonstrated with real multi-method data below, there is consistent evidence
that the IAT has systematic method variance that is unique to a specific
Nevertheless, Nosek and Smyth’s (2007) multi-attitude study
provided some interesting information. The correlation of the 7 means of the
IAT and the 7 means of the explicit ratings was r = .86. For example, implicit
and explicit measures showed a preference for flowers over insects and a
dislike of evolution versus creation. If
implicit measures reflect distinct, unconscious processes, it is not clear why
the means correspond to those based on self-reports. However, this finding is
easily explained by a single-attitude model, where the mean structure depends
on the mean structure of the latent attitude variable.
In sum, Nosek et al.’s claim that the IAT has demonstrated discriminant
validity is based on a misunderstanding of Campbell and Fiske’s (1959) approach
to construct validation. A proper assessment of construct validity requires demonstration
of convergent validity before it is possible to demonstrate discriminant
validity, and to demonstrate convergent validity it is necessary to use
multiple independent measures of the same construct. Thus, to demonstrate construct validity of
the IAT as a measure of implicit personality characteristics requires multiple
independent implicit measures.
First Evidence of Discriminant Validity in a Multi-Method Study
Cunningham, Preacher, and Banaji (2001) reported the results
of the first multi-method study of prejudice. Participants were 93 students
with complete data. Each student completed a single explicit measure of
prejudice, the Modern Racism Scale (McConahay, 1986), and three implicit
measures: (a) the standard race IAT (Greenwald et al., 1998), a response window
IAT (Cunningham et al., 2001), and a response-window evaluative priming task (Fazio
et al., 1986). The assessment was repeated on four occasions two weeks apart.
I used the published correlation matrix to reexamine the
claim that a single-factor model does not fit the data. First, I was able to reproduce
the model fit of the published dual-attitude model with MPLUS8.2 (original fit:
chi2(100, N = 93) = 111.58, p = .20; NNFI = .96; CFI = .97; RMSEA = 0.041 (90%
confidence interval: 0.00, 0.071); reproduced fit: chi2(100, N = 93) = 112, CFI
= .977, RMSEA = 0.036, 90%CI = .000 to .067.
Thus, the model fit of the reproduced model serves as a comparison
standard for the alternative models that I examined next.
The original model is a hierarchical model with an implicit
attitude factor as a second-order factor, and method-specific first-order
factors. Each first-order factor has four indicators for four repeated
measurements with the same method. This model
imposes constraint on the first order loadings because they contribute to the
first-order relations among indicators of the same method and to the second
order relations of different implicit methods to each other.
An alternative way to model multi-method data are bi-factor
models (Chen, West, & Sousa, 2006). A bifactor model allows for all measures
to be directly related to the general trait factor that corresponds to the second-order
factor in a hierarchical model. However,
bi-factor models may not be identified if there are no method factors. Thus, a
first step is to allow for method-specific correlated residuals and to examine whether
these correlations are positive.
The model with a single factor and method-specific residual correlations fit the data better than the hierarchical model, chi2(80, N = 93) = 87, CFI = .988, RMSEA = 0.029, 90%CI = .000 to .065. Inspection of the residual correlations showed high correlations for the Modern Racism scale, but less evidence for method-specific variance for the implicit measures. The response window IAT had no significant residual correlations. This explains the high factor loading of the respond window IAT in the hierarchical model. It does not suggest that this is the most valid measure. Rather, it shows that there is little method specific variance. Fixing these residual correlations to zero, improved model fit, chi2(86, N = 93) = 91, CFI = .991, RMSEA = 0.025, 90%CI = .000 to .062. I then tried to create method factors for the remaining methods. For the IAT, a method factor could only be created for the first three occasions. However, model fit for this model decreased unless occasion 2 was allowed to correlate with occasion 4. This unexpected finding is unlikely to reflect a real relationship. Thus, I retained the model with a method factor for the first three occasions only, chi2(89, N = 93) = 97, CFI = .986, RMSEA = 0.029, 90%CI = .000 to .064. I was able to fit a method factor for evaluative priming, but model fit decreased, chi2(91, N = 93) = 101, CFI = .983, RMSEA = 0.033, 90%CI = .000 to .065, and the first occasion did not load on the method factor. Model fit could be improved by fixing the loading to zero and by allowing for an additional correlation between occasion 1 and 3, chi2(91, N = 93) = 98, CFI = .988, RMSEA = 0.027, 90%CI = .000 to .062. However, there is no rational for this relationship and I retained the more parsimonious model. Fitting the measurement model for the modern racism scale also decreased fit, but fit was better than for the model in the original article, chi2(94, N = 93) = 107, CFI = .977, RMSEA = 0.038, 90%CI = .000 to .068. This was the final model (Figure 1).
The most important results are the factor loadings of the
measures on the trait factor. Factor loadings for the Modern racism scale
ranged from .35 to .45 (M = .40). Factor loadings for the standard IAT ranged
from .43 to .54 (M = .47). Factor loadings for the response window IAT ranged
from .41 to .69 (M = .51). The
evaluative priming measures had the lowest factor loadings ranging from .13 to
.47 (M = .29). Thus, there is no
evidence that implicit measures are more strongly related to each other than to
explicit measures, as stated in the original article.
In terms of absolute validity, all of these validity
coefficients are low, suggesting that a single standard IAT measure on a single
occasion has .47^2 = 22% valid variance.
Most important, these results suggest that the Modern Racism Scale and
the IAT measure a single construct and that the low correlation between
implicit and explicit measures reflects low convergent validity rather than high
In conclusion, a reexamination of Cunningham et al.’s data
shows that the data do not provide evidence of discriminant validity and that
the IAT may simply be an alternative measure of the same construct that is
being measured with explicit measures like the Modern Racism Scale. Thus, the
study provides no evidence for the construct validity of the IAT as a measure
of implicit individual differences in race attitudes.
Meta-Analysis of Implicit – Explicit Correlations
Hofmann, Gawronski, Geschwendner, and Le (2005) conducted a
meta-analysis of 126 studies that had reported correlations between an IAT and
an explicit measure of the same construct. Notably, over one-hundred studies had
been conducted without using multiple-implicit measures. The mono-method
approach taken in these studies suggests that authors took construct validity
of the IAT for granted, and used the IAT as a measure of implicit constructs. As a result, these studies provide no test of
the construct validity of the IAT.
Nevertheless, the meta-analysis produced an interesting
result. Correlations between implicit
and explicit measures varied across personality characteristics. Correlations were lowest for self-esteem,
which is consistent with Bosson et al.’s (2000) finding, and highest for simple
attitude objects like consumer products (e.g. Pepsi vs. Coke). Any theory of implicit attitude measures has
to explain this finding. One explanation
could be that explicit measures of self-esteem are less valid than explicit-measures
of preferences for consumer goods. However, it is also possible that the validity
of the IAT varies. Once more, a
comparison of different personality characteristics with multiple methods is
needed to test this competing theories.
Problems with Predictive Validity
Ten years after the IAT was published another problem
emerged. Some critics voiced concerns
that the IAT, especially the race IAT, lacks predictive validity (Blanton,
Jaccard, Klick, Mellers, Mitchell, & Tetlock (2009). To examine the predictive validity of the
IAT, Greenwald and colleagues (2009) published a meta-analysis of IAT-criterion
correlations. The key finding was that “for 32 samples with criterion measures
involving Black–White interracial behavior, predictive validity of IAT measures
significantly exceeded that of self-report measures” (p. 17). Specifically, the authors reported a
correlation of r = .24 for the IAT
and a criterion and a correlation of r
= .12 for an explicit measure and a criterion, and that these correlations were
significantly different from each other.
A few years later, Oswald, Mitchell, Blanton, Jaccard, and Tetlock
(2013) published a critical reexamination of the literature and reported
different results. “IATs were poor predictors of every criterion category other
than brain activity, and the IATs performed no better than simple explicit
measures” (p. 171). The only exception
were fMRI studies with extremely small samples that produced extremely large
correlations, often exceeding the reliability of the IAT. It is well known that these correlations are
inflated and difficult to replicate (Vul, Harris, Winkielman, & Hashler,
2009). Moreover, correlations with
neural activity are not evidence that IAT scores predict behavior.
More recently, Greenwald and colleagues published a new
meta-analysis (Kurdi et al., 2018). This meta-analysis produced weaker
criterion correlations than the previous meta-analysis. The median IAT-criterion correlation was r = .050. This is also true if the analysis is limited
to studies with the race IAT. After correcting
for random measurement error, the authors report on average correlation of r = .14.
However, correction for unreliability yields hypothetical correlations
that could be obtained if the IAT were perfectly reliable, which it is not. Thus,
for the practical evaluation of the IAT as a measure of individual differences,
it is more important how much the actual IAT scores can predict some validation
criterion. With small IAT-criterion
correlations around r = .1, large
samples would be required to have sufficient power to detect effects,
especially incremental effects above and beyond explicit measures. Given that
most studies had sample sizes of less than 100 participants, “most studies were
vastly underpowered” (Kurdi et al., 2018, p. 1). Thus, it is now clear that IAT
scores have low predictive validity, but it is not clear whether IAT scores
have any predictive validity, when they have predictive validity, and whether
they have predictive validity after controlling for explicit predictors of
Greenwald et al.’s (2009) 2008 US Election Study
In 2008, a historic event occurred in the United States. US
voters had the opportunity to elect the first Black president. Although the
outcome is now a historic fact, it was uncertain before the election how much
Barak Obama’s racial background would influence White voters. There was also considerable concern that
voters might not reveal their true feelings. This provided a great opportunity
to test the validity of implicit measures of racial bias. If White voters are influenced by racial
bias, IAT scores should predict voting intentions above and beyond explicit
measures. According to the abstract of the article, the results confirm this
prediction. “The implicit race attitude measures (Implicit Association Test and
Affect Misattribution Procedure) predicted vote choice independently of the
self-report race attitude measures, and also independently of political
conservatism and symbolic racism. These findings support construct validity of
the implicit measures” (p. 242).
These claims were based on results of multiple regression
analyses. “When entered after the self-report measures, the two implicit
measures incrementally explained 2.1% of vote intention variance, p=.001, and
when political conservativism was also included in the model, “the pair of
implicit measures incrementally predicted only 0.6% of voting intention
variance, p = .05.” (p. 247).
I tried to reproduce these results with the published
correlation matrix and failed to do so.
A multiple regression analysis with explicit measures, implicit
measures, and political orientation as predictors showed non-significant
effects for the IAT, b = .002, se = .024, t = .087, p = .930 and the AMP, b =
.033, se = .023, t = 1.470, p = .142. I also obtained the raw data from Anthony
Greenwald, but I was unable to recreate the sample size of N = 1,057. Instead I
obtained a similar sample size of N = 1,035. Performing the analysis on this sample also
produced non-significant results; IAT, b = -.003, se = .044, t = .070, p = .944
and the AMP, b = -.014, se = .042, t = 0.344, p = .731.
To fully explore the relationship among the variables in this valuable dataset, I fitted a structural equation model to the raw data (N = 1,035). The model had good fit, chi2(9) = 18.27, CFI = .995, RMSEA = .032 90%CI(.009-.052). As shown in Figure 2, the IAT did not have incremental predictive validity as the residual variance was unrelated to voting. There is also no evidence of discriminant validity because the residuals of the two measures are not correlated. However, the model does show that a ProWhite bias predicts voting above and beyond political orientation. Thus, the results do support the hypothesis that racial bias influenced voting in the 2008 election. This bias is reflected in explicit and implicit measures. Interestingly, the validity coefficients in this study differ from those in Cunningham et al.’s study with undergraduate students. The factor loadings suggest that the IAT is the most valid measure of racial bias with .59^2 = 36% valid variance as a measure of explicit attitudes. This makes the IAT as valid as the feeling thermometer, which is more valid than the Modern Racism Scale in Cunningham’s study. This finding has been replicated in subsequent studies (Axt, 2018).
In conclusion, a reexamination of the 2008 election study shows
that the data are entirely consistent with a single-attitude model and that
there is no evidence for incremental predictive validity or discriminant
validity in these data. However, the study does show some predictive validity
of the IAT and convergent validity with explicit measures. Thus, the results
provide no construct validity of the IAT as a measure of implicit individual differences,
but the results can also be interpreted as evidence for validity as a measure
of the same construct that is measured with explicit measures. This shows that claims about validity vary as
a function of the construct that is being measured. A scale is a good measure of weight, but not
of intelligence. The results here
suggest that the race IAT is a moderately valid measure of racial bias, but an
invalid measure of implicit bias, which may not even exist because scientific
claims about implicit bias require valid measures of implicit bias.
Reexamining a Multi-Trait Multi-Method Study
The most recent and extensive multi-trait multi-method
validation study of the IAT was published last year (Bar-Anan & Vianello,
2018). The abstract claims that the
results provide clear support for the validity of the IAT as a measure of
implicit cognitions, including implicit self-esteem. “The evidence supports the
dual-attitude perspective, bolsters the validation of 6 indirect measures, and
clears doubts from countless previous studies that used only one indirect
measure to draw conclusions about implicit attitudes” (p. 1264).
Below I show that these claims are not supported by the
data, and that single-attitude models fit the data as well as dual-attitude
models. I also show that dual-attitude models show low convergent validity
across implicit measures, while IAT variants share method variance because they
rely on the same mechanisms to measure attitudes.
Bar-Anan and Vianello (2018) fitted a single model to measures
of self-esteem, racial bias, and political orientation. This makes the model
extremely complex and produced some questionable results (e.g., the implicit
and explicit method factors were highly correlated; some measured had negative
loadings on the method factors). In
structural equation modeling, it is good practice to fit smaller models before
creating a larger model. Thus, I first examined
construct validity for each domain separately before I fitted a model that integrates
models into a single unified model.
I first fitted a dual-attitude model to measures of racial attitudes and included contact as the criterion variable. I did not specify a causal relationship between contact and attitudes because attitudes can influence contact and vice versa. The dual-attitude model had good fit, chi2(48) = 109.41; CFI = .975; RMSEA = 0.010 (90% confidence interval: 0.007, 0.012). The best indicator of the explicit factor was the preference rating (Figure 3). The best indicator of the implicit factor was the BIAT. However, all IAT-variants had moderate to high loadings on the implicit factor. In contrast, the evaluative priming measure had a low loading on the implicit factor and the AMP had a moderate loading on the explicit factor and no significant loading on the implicit factor. These results show that Bar-Anan and Vianello’s model failed to distinguish between IAT-specific method variance and method variance for implicit measures in general. The present results show that IAT-variants share little valid variance or method variance with conceptually distinct implicit measures.
Not surprisingly, a single-attitude model with an IAT method factor (Figure 4) also fit the data well, chi2(46) = 112.04; CFI = .973; RMSEA = 0.010 (90% confidence interval: 0.008, 0.013). Importantly, the model has no shared method variance between conceptually different explicit measures like preference ratings and the Modern Racism Scale (MRS). The AMP and the EP both are valid measures of attitudes but with relatively modest validity. The BIAT has a validity of .46, with 22% explained variance. This result is more consistent with Cunningham et al. (2001) than Greenwald et al. (2009) data. The model also shows a clear relationship between contact and less pro-White bias. Finally, the model shows that the IAT method factor is unrelated to contact. Thus, any relationship between IAT scores and contact is explained by the shared variance with explicit measures.
These results show that Bar-Anan and Vianello’s (2018)
conclusion are not supported by the data. Although a dual-attitude model can be
fitted to the data, it shows low convergent validity across different implicit
measures, and a single-attitude model fits the data as well as a dual-attitude
Figure 5 shows the dual-attitude model for political orientation. The explicit factor is defined by a simple rating of preference for republicans versus democrats, the modern racism scale, the right-wing-authoritarianism scale, and ratings of Hillary Clinton. The implicit factor is defined by the IAT, the brief IAT, the Go-NoGo Task, and single category IATs. The remaining two implicit measures, the Affect Misattribution Task, and Evaluative Priming are allowed to load on both factors. Voting in the previous election is predicted by explicit attitudes. The model has good fit to the data, chi2(48) = 99.34; CFI = .991; RMSEA = 0.009 (90% confidence interval: 0.006, 0.011). The loading pattern shows that the AMP and EP load on the implicit factor. This supports the hypothesis that all implicit measures have convergent validity. However, the loadings for the IATs are much higher. In the dual-attitude framework this would imply that the IAT is a much more valid measure of implicit attitudes than the AMP or EP. Evidence for discriminant validity is weak. The correlation between the explicit and the implicit factor is r = .89. The correlation in the original article was r = .91. Nevertheless, the authors concluded that the data favor the two-factor model because constraining the correlation to 1 reduced model fit.
However, it is possible to fit a single-construct model by allowing for an IAT-variant method factor, chi2(50) = 86.25; CFI = .993; RMSEA = 0.007 (90% confidence interval: 0.005, 0.010). This model (Figure 6) shows that voting is predicted by a single latent factor that represents political orientation and that simple self-report measures of political orientation are the most valid measure of political orientation. The IAT shows stronger correlations with explicit measures because it is a more valid measure of political orientation, .74^2 = 55% valid variance, than the race IAT (22% valid variance).
Figure 7 shows the results for a dual-attitude model of
self-esteem. Model fit was good,
although CFI was lower than in the previous model due to weaker factor
loadings, chi2(16) = 28.62; CFI = .950; RMSEA = 0.008 (90% confidence interval:
0.003, 0.013). The model showed a
moderate correlation between the explicit and implicit factors, r = .46, which
is stronger than in the original article, r = .29, but clearly suggestive of
two distinct factors. However, the nature of these two factors is less clear.
The implicit factor is defined by the three IAT measures, whereas the AMP and
EP have very low loadings on this factor.
This is also true in the original article with loadings of .24 for AMP
and .13 for EP. Thus, the results
confirm Bosson’s seminal finding that different implicit measures have low
As the Implicit Factor was mostly defined by the IAT measures, it was also possible to fit a single-factor model mode with an IAT measurement factor (Figure 8), chi2(16) = 31.50; CFI = .938; RMSEA = 0.009 (90% confidence interval: 0.004, 0.013). However, some of the results of this model are surprising.
According to this model, the validity coefficient of the widely used Rosenberg self-esteem scale is only r = .35, suggesting that only 12% of the variance in the Rosenberg self-esteem scale is valid variance. In addition, the IAT and the BIAT would be equally valid measures of self-esteem. Thus, previous results of low implicit-explicit correlations for self-esteem (Bosson et al., 2000; Hofmann et al., 2005) would imply low validity of implicit and explicit measures. This finding would have dramatic implications for the interpretation of low self-esteem-criterion correlations. A valid self-esteem-criterion correlation of r = .3, would produce only an observed correlation of r = .30*.35 = .11 with the Rosenberg self-esteem scale or the IAT. Correlations of this magnitude require large samples (N = 782) to have an 80% probability to obtain a significant result with alpha = .05 or N = 1,325 with alpha = .005. Thus, most studies that tried to predict performance criteria form self-esteem were underpowered. However, the results of this study are limited by the use of an online sample and the lack of proper criterion variables to examine predictive validity. The main conclusion from this analysis is that a single-factor model with an IAT method factor fit the data well and that the dual attitude model failed to demonstrate convergent validity across different implicit measures; a finding that replicates Bosson et al. (2000), which Bar-Anan and Vianello do not cite.
A Unified Model
After establishing well-fitting models for each personality
characteristic, it is possible to fit a unified model. Importantly, no changes
to the individual models should be made because a decrease in fit can be attributed
to the new relationships across different personality characteristics. Without any additional modifications, the
overall model in Figure 9 had good fit, XX. Correlations among the IAT method factors
showed significant positive correlations of the method factor for race with the
method factor for self-esteem (r = .4) and political orientation (r = .2), but
a negative correlation for the method factors for self-esteem and political
orientation (r = -.3). This pattern of
correlations is inconsistent with a simple method factor that is expected to
produce positive correlations. Thus, it is impossible to fit a general method
factor to different IATs. This finding replicates Nosek and Smyth’s (2007)
Correlations among the personality characteristics replicate
the finding with Greenwald et al.’s (2009) data that Republicans are more
likely to have a pro-white bias, r = .4.
Political orientation is unrelated to self-esteem, r = .0, but Pro-White
bias tends to be positively related to self-esteem, r = .2.
In conclusion, the present results show that Bar-Anan and Vianello’s
claims are not supported by the data.
Their data do not provide clear evidence for discriminant validity of
implicit and explicit constructs. The
data are fully consistent with the alternative hypothesis that the IAT and
other implicit measures measure the same construct that is being measured with
implicit factors. Thus, the data provide no support for the construct validity
of the IAT as a measure of implicit personality characteristics.
Validity of the Self-Esteem IAT
Bosson et al. (2000) seminal article raised first concerns
about the construct validity of the self-esteem IAT. Since then, other critical
articles have been published; none of which are cited in Kurdi et al. (2018).
Gawronski, LeBel, and Peters (2007) wrote a PoPS article on the construct
validity of implicit self-esteem. They fond no conclusive evidence that(a) the
self-esteem IAT measures unconscious self-esteem or that (b) low correlations
are due to self-report biases in explicit measures of self-esteem. Walker and
Schimmack (2008) used informant ratings to examine predictive validity of the
self-esteem IAT. Informant ratings are the most widely used validation
criterion in personality research, but have not been used by social psychologists.
One advantage of informant ratings is that they also measure general
personality characteristics rather than specific behaviors, which ensures
higher construct-criterion correlations due to the power of aggregation (Epstein,
1980). Walker and Schimmack (2008) found
that informant ratings of well-being were more strongly correlated with
explicit self-ratings of well-being than with a happiness or a self-esteem
The most recent and extensive review was conducted by Falk
and Heine (2014) who found that “the validity evidence for the IAT in measuring
ISE [implicit self-esteem] is strikingly weak” (p. 6). They also point out that implicit measures of
self-esteem “show a remarkably consistent lack of predictive validity” (p.
6). Thus, an unbiased assessment of the
evidence is fully consistent with the analyses of Bar-Anan and Vianello’s data
that also found low validity of the self-esteem IAT as a measure of self-esteem.
Currently, a study by Falk, Heine, Takemura, Zhang, and Hsu
(2013) provides the most comprehensive examination of convergent and
discriminant validity of self-esteem measures. I therefore used structural
equation modeling of their data to see how consistent the data are with a
dual-attitude model or a single-attitude model.
The biggest advantage of the study was the inclusion of informant ratings
of self-esteem, which makes it possible to model method-variance in
self-ratings (Anusic et al., 2009). Previous
research showed that self-ratings of self-esteem have convergent validity informant
ratings of self-esteem (Simms, Zelazny, Yam, & Gros, 2010; Walker &
Schimmack, 2008). I also included the self-report
measures of positive affect and negative affect to examine criterion validity.
It was possible to fit a single-factor model to the data (Figure 10), chi2(67) = 115.85; CFI = .964; RMSEA = 0.050 (90% confidence interval: 0.034, 0.065). Factor loadings show the highest loadings for self-ratings on the self-competence scale and the Rosenberg self-esteem scale. However, informant ratings also had significant loadings on the self-esteem factor, as did self-ratings on the narcissist personality inventory. A measure of halo bias in self-ratings of personality (SEL) also had moderate loadings, which confirms previous findings that self-esteem is related to evaluative biases in personality ratings (Anusic et al., 2009). The false uniqueness measure (FU; Falk et al., 2015) had modest validity. In contrast, the implicit measures had no significant loadings on this factor. In addition, the residual correlations among the implicit measures were weak and not significant. Given the lack of positive relations among implicit measures it was impossible to fit a dual-attitude model to these data.
It is not clear why Bar-Anan and Vianello’s data failed to
show higher validity of explicit measures, but the current results are
consistent with moderate validity of explicit self-ratings in the personality
literature (Simms et al., 2010). Thus, there is consistent evidence that implicit
self-esteem measures have low validity as measures of self-esteem and there is
no evidence that they are measures of implicit self-esteem.
Explaining Variability in Explicit-Implicit Correlations
One well-established phenomenon in the literature is that
correlations between IAT scores and explicit measures vary across domains
(Bar-Anan & Vianello, 2018; Hofmann et al., 2005). As shown earlier, correlations for political
orientation are strong, correlations for racial attitudes are moderate, and
correlations for self-esteem are weak. Greenwald
and Banaji (2017) offer a dual-attitude explanation for this finding. “The
plausible interpretations of the more common pattern of weak implicit– explicit
correlations are that (a) implicit and explicit measures tap distinct
constructs or (b) they might be affected differently by situational influences in
the research situation (cf. Fazio & Towles-Schwen, 1999; Greenwald et al.,
2002) or (c) at least one of the measures, plausibly the self-report measure in
many of these cases, lacks validity” (p. 868).
The evidence presented here offers a different explanation. IAT-explicit correlations and IAT-criterion
correlations increase with the validity of the IAT as a measure of the same
personality characteristics that are measured with explicit measures. Thus, low correlations of the self-esteem IAT
with explicit measures of self-esteem show low validity of the self-esteem IAT.
High correlations of the political
orientation IAT with explicit measures of political orientation show high
validity of the IAT as a measure of political orientation; not implicit
political orientation. Finally, modest
correlation between the race IAT and explicit measures of racial bias show
moderate validity of the race IAT as a measure of racial bias. However, the
validity of the race IAT as a measure of racial bias (not implicit racial
bias!) varies considerably across studies. This variation may be due to the variability
of racial bias in samples which may be lower in student samples. Thus, contrary to Greenwald and Banaji’s
claims, the problem is not with the explicit measures, but with the IAT.
An important question is why the self-esteem IAT is less
valid than the political orientation IAT.
I propose that one cause of variation in the validity of the IAT is
related to the proportion of respondents on the two ends of a personality
characteristic. To test this hypothesis, I used Bar-Anan and Vianello’s
data. To determine the direction of the
IAT score, I used a value of 0 as the neutral point. As predicted, 90% of participants associated
self with good, 78% associated White is good, and 69% associated Democrat with
good. Thus, validity decreases with the proportion
of participants who are on one side of the bipolar dimension.
Next, I regressed the preference measure on a simple
dichotomous predictor that coded the direction of the IAT. I standardized the preference measure and
report standardized and unstandardized regression coefficients. Standardized regression coefficients are
influenced by the distribution of the predictor variable and should show the
expected pattern. In contrast, unstandardized coefficients are not sensitive to
the proportions and should not show the pattern. I also added the IAT scores as
predictors in a second step to examine the incremental predictive validity that
is provided by the reaction times.
The standardized coefficients are consistent with
predictions (Table 1). However, the unstandardized coefficients also show the
same pattern. Thus, other factors also play a role. The amount of incremental
explained variance by reaction times shows no differences between the race and
the political orientation task. Most of
the differences in validity are due to the direction of the attitude (4% explained
variance for race bias vs. 38% explained variance for political orientation).
SE B = .310, SE = .142; b = .093, se = .043;
r2 = .009, Δr2 = .002, z = 1.09
Race B = .467, SE =
.010, b = .193, se = .041, r2 = .041, Δr2 = .060, z = 5.79
PO B = 1.380, SE = .080, b = .637, se =
.037, r2 = .380, Δr2 = .070, z = 7.83
The results show the importance of taking the proportion of
respondents with opposing personality characteristics into account. The IAT is
least valid when most participants are high or low on a personality
characteristic, and it is most valid when participants are split into two
equally large groups.
In conclusion, I provided an alternative explanation of
variation in explicit-implicit correlations that is consistent with the
data. Implicit-explicit correlations vary
at least partially as a function of the validity of the IAT as a measure of the
same construct that is measured with explicit measures, and the validity of the
IAT varies as a function of the proportion of respondents who are high versus
low on a personality characteristic. As most respondents associate the self
with good, and reaction times contribute little to the validity of the IAT, the
IAT has particularly low validity as a measure of self-esteem.
The Elusive Malleability of Implicit Attitude Measures
Numerous experimental studies have tried to manipulate
situational factors in order to change scores on implicit attitude measures
(Lai, Hoffman, & Nosek, 2013). Many
of these studies focused on implicit measures of prejudice in order to develop
interventions that could reduce prejudice. However, most studies were limited
to brief manipulations with immediate assessment of attitudes (Lai et al.,
2013). The results of these studies are
mixed. In a seminal study, Dasgupta and
Greenwald (2001) exposed participants to images of admired Black exemplars and
disliked White exemplars. They reported that this manipulation had a large
effect on IAT scores. However, these days the results of this study are less
convincing because it has become apparent that large effect sizes from small
samples often do not replicate (Open Science Collaboration, 2015). Consistent
with this skepticism, Joy-Gaba and Nosek (2010) had difficulties replicating
this effect with much larger samples and found only an average effect size of d
= .08. With effect sizes of this
magnitude, other reports of successful experimental manipulations were
extremely underpowered. Another study with large samples found
stronger effects (Lai et al., 2016). The
strongest effect was observed for instruction to fake the IAT. However, Lai et al. also found that none of
these manipulations had lasting effects in a follow-up assessment. This finding
suggests that even when changes are observed, they reflect context-specific
method variance rather than actual changes in the construct that is being
This conclusion is also supported by one of the few
longitudinal IAT studies. Cunningham et al.’s (2001) multi-method study repeated
the measurement of racial bias on four separate occasions. The model shown in Figure 1 shows no systematic
relationships between measures taken on the same occasion, and adding these
relationships shows non-significant correlated residuals. Thus, in this sample
naturally occurring factors did not change race bias. This finding suggests
that the IAT and explicit measures measure stable personality characteristics rather
than context-specific states.
Only a few serious intervention studies with the IAT have
been conducted (Lai et al., 2013). The
most valuable evidence so far comes from studies that examined the influence of
living with an African American roommate on White students’ racial attitudes
(Shook & Fazio, 2008; Shook, Hopkins, & Koech, 2016). One study found effects on an implicit measure,
F(1,236) = 4.33, p = .04 (Shook & Fazio, 2008), but not on an explicit measure
(Shook, 2007). The other study found
effects on explicit attitudes, F(1,107) = 7.34, p = .008 but no results for
implicit measures were reported (Shook, Hopkins, & Koech, 2016). Given the
small sample sizes of these studies, inconsistent results are to be expected.
In conclusion, the existing evidence shows that implicit and
explicit attitude measures are highly stable over time (Cunningham et al.,
2001). I also concur with Joy-Gaba and Nosek that moving scores on implicit
bias measures “may not be as easy as implied by the existing experimental demonstrations”
(p. 145), and a multi-method assessment is needed to distinguish effects on
specific measures from effects on personality characteristics (Olsen &
Future studies of attitude change need a multi-method
approach, powerful interventions, adequate statistical power, and multiple
repeated measurements of attitudes to distinguish mere occasion-specific variability
(malleability) from real attitude change (Anusic & Schimmack, 2016). Ideally,
the study would also include informant ratings. For example, intervention
studies with roommates could use African Americans as informants to rate their
White roommates’ racial attitudes and behaviors. The single-attitude model predicts that
implicit and explicit measures will show consistent results and that variation
in effect sizes is explained by the validity of each measure.
Does the IAT Measure Implicit Constructs?
Construct validation is a difficult and iterative process
because scientific evidence can alter the understanding of constructs. However, construct validation research has to
start with a working definition of a construct.
The IAT was introduced as a measure of individual differences in
implicit social cognition, and implicit social cognitions were defined as aspects
of thinking and feeling that may not be easily accessed or available to
consciousness (Nosek, Greenwald, & Banaji, 2007, p. 265). This definition is vague, but it makes a clear
prediction that the IAT should measure personality characteristics that cannot
be measured with self-reports. This
leads to the prediction that explicit measures and the IAT have discriminant validity. To demonstrate discriminant validity, unique
variance in the IAT has to be related to other indicators of implicit
personality characteristics. This can be
demonstrated with incremental predictive validity or convergent validity with
other measures of implicit personality characteristics. Consistent with this line of reasoning,
numerous articles have claimed that the IAT has construct validity as a measure
of implicit personality characteristics because it shows incremental predictive
validity (Greenwald et al., 2009; Kurti et al., 2018) or because the IAT shows convergent
validity with other implicit measures and discriminant validity with explicit
measures (Bar-Anan & Vianello, 2018).
I demonstrated that all of these claims were false and that the existing
evidence provides no evidence for the construct validity of the IAT as a
measure of implicit personality characteristics. The main problem is that most studies that
used the IAT assumed construct validity rather than testing it. Hundreds of studies used the IAT as a single
measure of implicit personality characteristics and made claims about implicit
personality traits based on variation in IAT scores. Thus, hundreds of studies made claims that
are not supported by empirical evidence simply because it has not been
demonstrated that the IAT measures implicit personality constructs. In this regard the IAT is not alone. Aside from the replication crisis in
psychology (OSC, 2015), psychological science suffers from an even more serious
validation crisis. All empirical claims rest on the validity of measures that
are used to test theoretical claims. However, many measures in psychology are
used without proper validation evidence.
Personality research is a notable exception. In response to criticism of low predictive
validity (Mischell, 1968), personality psychologists embarked on a program of
research that demonstrated predictive validity and convergent validity with
informant ratings (Funder, $$$). Another
problem is that psychologists treat validity as a qualitative construct,
leading to any evidence of validity to support claims that a measure is valid,
as if it were 100% valid. However, most measures in psychology have only
moderate validity (Schimmack, 2010). Thus, it is important to quantify validity
and to use a multi-method approach to increase validity. The popularity of the IAT reveals the problems
with using measures without proper validation evidence. Social psychologists have influenced public discourse,
if not public policy, about implicit racial bias. Most of these claims are based on findings
with the IAT, assuming that IAT scores reflect implicit bias. As demonstrated
here, these claims are not valid because the IAT lacks construct validity as a
measure of implicit bias. In the future,
psychologists need to be more careful when they make claims based on new
measures with limited knowledge about their validity. Maybe psychological organizations should
provide clear guidelines about minimal standards that need to be met before a
measure can be used, just like there are guidelines for validity evidence for
personality assessment. In conclusion, psychology
suffers as much from a validation crisis as it suffers from a replication crisis. Fixing the replication crisis will not
improve psychology if replicable results are obtained with invalid measures.
The Silver Lining
Psychologists are often divided into opposing camps (e.g.
nature vs. nurture; person vs. situation; the IAT is valid vs. invalid). Many fans of implicit measures are likely to
dislike what I had to say about the IAT.
However, my position is different from previous criticisms of the IAT as
being entirely invalid (Oswald et al., 2013).
I have demonstrated with several multi-method studies that the IAT has convergent
validity with other measures of some personality characteristics. In some domains
this validity is too low to be meaningful.
In other domains, the validity of explicit measures is so high that
using the IAT is not necessary. However, for sensitive attitudes like racial
attitudes, the IAT offers a promising complementary measure to explicit
measures of racial attitudes. Validity
coefficients ranged from 20% to 40%. As
the IAT does not appear to share method variance with explicit measures, it is
possible to improve the measurement of racial bias by using explicit and
implicit measures and to aggregate scores to obtain a more valid measure of
racial bias than either an explicit or an implicit measure can provide. The IAT may also offer benefits in situations
where socially desirable responding is a concern. Thus, the IAT might complement other measures
of personality characteristics. This changes the interpretation of explicit-IAT
correlations. Rather than (mis)interpreting low correlations as evidence of discriminant
validity, high correlations can reveal convergent validity. Similarly,
improvements in implicit measures should produce higher correlations with explicit
measures. How useful the IAT and other
implicit measures are for the measurement of other personality characteristics
has to be examined on a case by case basis. Just like it is impossible to make
generalized statements about the validity of self-reports, the validity of the
IAT can vary across personality characteristics.
Social psychologists have always distrusted self-report,
especially for the measurement of sensitive topics like prejudice. Many attempts were made to measure attitudes
and other constructs with indirect methods.
The IAT was a major breakthrough because it has relatively high
reliability compared to other methods. Thus, creating the IAT was a major achievement
that should not be underestimated because the IAT lacks construct validity as a
measure of implicit personality characteristics. Even creating an indirect
measure of attitudes is a formidable feat. However, in the early 1990s, social
psychologists were enthralled by work in cognitive psychology that demonstrated
unconscious or uncontrollable processes. Implicit measures were based on this
work and it seemed reasonable to assume that they might provide a window into
the unconscious. However, the processes that are involved in the measurement of
personality characteristics with implicit measures are not the personality
characteristics that are being measured.
There is nothing implicit about being a Republican or Democrat, gay or
straight, or low self-esteem. Conflating
implicit processes in the measurement of personality constructs with implicit personality
constructs has created a lot of confusion. It is time to end this confusion.
The IAT is an implicit measure of personality with varying validity. It is not a window into people’s unconscious
feelings, attitudes or personalities.
Anusic, I., &
Schimmack, U. (2016). Stability and change of personality traits, self-esteem,
and well-being: Introducing the meta-analytic stability and change model of
retest correlations. Journal of Personality and Social Psychology, 110(5),
Schimmack, U., Pinkus, R., & Lockwood, P. (2009). The nature and structure
of correlations among Big Five ratings: the halo-alpha-beta model. Journal of Personality
and Social Psychology, 97 6, 1142-56.
& Vianello, M. (2018). A multi-method multi-trait test of the dual-attitude
perspective. Journal of Experimental Psychology: General, 147(8), 1264-1272.
Jaccard, J., Klick, J., Mellers, B., Mitchell, G., & Tetlock, P. E. (2009).
Strong claims and weak evidence: Reassessing the predictive validity of the
IAT. Journal of Applied Psychology, 94(3), 567-582.
Bosson, J. K.,
Swann, W. B., Jr., & Pennebaker, J. W. (2000). Stalking the perfect measure
of implicit self-esteem: The blind men and the elephant revisited? Journal of
Personality and Social Psychology, 79(4), 631-643.
Campbell, D. T.,
& Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod
matrix. Psychological Bulletin, 56(2), 81-105.
Chen, F., West,
S.G., & Sousa, K.H. (2006) A Comparison of Bifactor and Second-Order Models
of Quality of Life, Multivariate Behavioral Research, 41:2, 189-225,
A., Preacher, K. J., & Banaji, M. R. (2001). Implicit attitude measures:
Consistency, stability, and convergent validity. Psychological Science, 12, 163-170.
& Greenwald, A. G. (2001). On the malleability of automatic attitudes:
Combating automatic prejudice with images of admired and disliked individuals. Journal of Personality and Social
Psychology, 81, 800–814. doi:10.1037/0022-35220.127.116.110
Falk, C. F.,
Heine, S. J., Takemura, K. , Zhang, C. X. and Hsu, C. (2015). Are Implicit
Self-Esteem Measures Valid for Assessing Individual and Cultural Differences. Journal
of Personality, 83: 56-68. doi:10.1111/jopy.12082
Falk, C., &
Heine, S.J. (2015). What is implicit self-esteem, and does it vary across
cultures? Personality and Social Psychology Review, 19, 177-98.
Greenwald, A. G.,
& Farnham, S. D. (2000). Using the Implicit Association Test to measure
self-esteem and self-concept. Journal of Personality and Social Psychology,
79(6), 1022-1038. http://dx.doi.org/10.1037/0022-3518.104.22.1682
McGhee, D.E., & Schwartz, J.L.K. (1998). Measuring individual differences
in implicit cognition: The Implicit Association Test. Journal of Personality and Social Psychology, 74, 1464–1480.
Greenwald, A. G.,
Poehlman, T. A., Uhlmann, E. L., & Banaji, M. R. (2009). Understanding and
using the Implicit Association Test: III. Meta-analysis of predictive validity.
Journal of Personality and Social Psychology, 97, 17–41.
Greenwald, A. G.,
Smith, C. T., Sriram, N., Bar-Anan, Y., & Nosek, B. A. (2009). Race
attitude measures predicted vote in the 2008 U. S. Presidential Election.
Analyses of Social Issues and Public Policy, 9, 241–253.
Gawronski, B., Gschwendner, T., Le, H., & Schmitt, M. (2005). A
meta-analysis on the correlation between the Implicit Association Test and
explicit self-report measures. Personality and Social Psychology Bulletin, 31,
1369 –1385. http://dx.doi.org/10.1177/0146167205275613
Seitchik, A. E., Axt, J. R., Carroll, T. J., Karapetyan, A., Kaushik, N., . . .
Banaji, M. R. (2018). Relationship between the Implicit Association Test and
intergroup behavior: A meta-analysis. American Psychologist. Advance online
(1986). Modern racism, ambivalence, and the modern racism scale. In J.F. Dovidio
& S.L. Gaertner (Eds.), Prejudice, discrimination, and racism (pp. 91–125).
Orlando, FL: Academic Press
Collaboration. (2015). Estimating the reproducibility of psychological science.
Science, 349(6251), 1-8.
Oswald, F. L.,
Mitchell, G., Blanton, H., Jaccard, J., & Tetlock, P. E. (2013). Predicting
ethnic and racial discrimination: A meta-analysis of IAT criterion studies.
Journal of Personality and Social Psychology, 105(2), 171-192.
Pelham, B. W., & Swann, W. B. (1989). From
self-conceptions to self-worth: On the sources and structure of global
self-esteem. Journal of Personality and Social Psychology, 57, 672– 680
(1965). Society and the Adolescent Self-image. Princeton, NJ: Princeton University
Zelazny, K., Yam, W.H., & Gros, D.F. (2010). Self-informant Agreement for
Personality and Evaluative Person Descriptors: Comparing Methods for Creating
Informant Measures. European Journal of Personality, 24 3, 207-221.
Vul, E, Harris, C, Winkielman, P., & Pashler,
(2009). Puzzlingly High Correlations in fMRI Studies of Emotion,
Personality, and Social Cognition, Perspectives on Psycholical Science, 4, 274-90.
Walker, S. S.,
& Schimmack, U. (2008). Validity of a happiness implicit association test
as a measure of subjective well-being. Journal of Research in Personality,
42(2), 490-497. http://dx.doi.org/10.1016/j.jrp.2007.07.005