Replicability Audit of John A. Bargh

“Trust is good, but control is better”  

INTRODUCTION

Information about the replicability of published results is important because empirical results can only be used as evidence if the results can be replicated.  However, the replicability of published results in social psychology is doubtful. Brunner and Schimmack (2018) developed a statistical method called z-curve to estimate how replicable a set of significant results are, if the studies were replicated exactly.  In a replicability audit, I am applying z-curve to the most cited articles of psychologists to estimate  the replicability of their studies.

John A. Bargh

Bargh is an eminent social psychologist (H-Index in WebofScience = 61). He is best known for his claim that unconscious processes have a strong influence on behavior. Some of his most cited article used subliminal or unobtrusive priming to provide evidence for this claim.

Bargh also played a significant role in the replication crisis in psychology. In 2012, a group of researchers failed to replicate his famous “elderly priming” study (Doyen et al., 2012). He responded with a personal attack that was covered in various news reports (Bartlett, 2013). It also triggered a response by psychologist and Nobel Laureate Daniel Kahneman, who wrote an open letter to Bargh (Young, 2012).

As all of you know, of course, questions have been raised about the robustness of priming results…. your field is now the poster child for doubts about the integrity of psychological research.

Kahneman also asked Bargh and other social priming researchers to conduct credible replication studies to demonstrate that the effects are real. However, seven years later neither Bargh nor other prominent social priming researchers have presented new evidence that their old findings can be replicated.

Instead other researchers have conducted replication studies and produced further replication failures. As a result, confidence in social priming is decreasing as reflected in Bargh’s citation counts (Figure 1)


Figure 1. John A. Bargh’s citation counts in Web of Science (3/17/19)

In this blog post, I examine the replicability and credibility of John A. Bargh’s published results using a statistical approach; z-curve (Brunner & Schimmack, 2018). ). It is well known that psychology journals only published confirmatory evidence with statistically significant results, p < .05 (Sterling, 1959). This selection for significance is the main cause of the replication crisis in psychology because selection for significance makes it impossible to distinguish results that can be replicated from results that cannot be replicated because selection for significance ensures that all results will be replicated (we never see replication failures).

While selection for significance makes success rates uninformative, the strength of evidence against the null-hypothesis (signal/noise or effect size / sampling error) does provide information about replicability. Studies with higher signal to noise ratios are more likely to replicate. Z-curve uses z-scores as the common metric of signal-to-noise ratio for studies that used different test statistics. The distribution of observed z-scores provides valuable information about the replicability of a set of studies. If most z-scores are close to the criterion for statistical significance (z = 1.96), replicability is low.

Given the requirement to publish significant results, researches had two options how they could meet this goal. One option requires obtaining large samples to reduce sampling error and therewith increase the signal-to-noise ratio. The other solution is to conduct studies with small samples and conduct multiple statistical tests. Multiple testing increases the probability of obtaining a significant results with the help of chance. This strategy is more efficient in producing significant results, but these results are less replicable because a replication study will not be able to capitalize on chance again. The latter strategy is called a questionable research practice (John et al., 2012), and it produces questionable results because it is unknown how much chance contributed to the observed significant result. Z-curve reveals how much a researcher relied on questionable research practices to produce significant results.

Data

I used WebofScience to identify the most cited articles by John A. Bargh (datafile).  I then selected empirical articles until the number of coded articles matched the number of citations, resulting in 43 empirical articles (H-Index = 41).  The 43 articles reported 111 studies (average 2.6 studies per article).  The total number of participants was 7,810 with a median of 56 participants per study.  For each study, I identified the most focal hypothesis test (MFHT).  The result of the test was converted into an exact p-value and the p-value was then converted into a z-score.  The z-scores were submitted to a z-curve analysis to estimate mean power of the 100 results that were significant at p < .05 (two-tailed). Four studies did not produce a significant result. The remaining 7 results were interpreted as evidence with lower standards of significance. Thus, the success rate for 111 reported hypothesis tests was 96%. This is a typical finding in psychology journals (Sterling, 1959).

Results

The z-curve estimate of replicability is 29% with a 95%CI ranging from 15% to 38%.  Even at the upper end of the 95% confidence interval this is a low estimate. The average replicability is lower than for social psychology articles in general (44%, Schimmack, 2018) and for other social psychologists. At present, only one audit has produced an even lower estimate (Replicability Audits, 2019).

The histogram of z-values shows the distribution of observed z-scores (blue line) and the predicted density distribution (grey line). The predicted density distribution is also projected into the range of non-significant results.  The area under the grey curve is an estimate of the file drawer of studies that need to be conducted to achieve 100% successes if hiding replication failures were the only questionable research practice that is used. The ratio of the area of non-significant results to the area of all significant results (including z-scores greater than 6) is called the File Drawer Ratio.  Although this is just a projection, and other questionable practices may have been used, the file drawer ratio of 7.53 suggests that for every published significant result about 7 studies with non-significant results remained unpublished. Moreover, often the null-hypothesis may be false, but the effect size is very small and the result is still difficult to replicate. When the definition of a false positive includes studies with very low power, the false positive estimate increases to 50%. Thus, about half of the published studies are expected to produce replication failures.

Finally, z-curve examines heterogeneity in replicability. Studies with p-values close to .05 are less likely to replicate than studies with p-values less than .0001. This fact is reflected in the replicability estimates for segments of studies that are provided below the x-axis. Without selection for significance, z-scores of 1.96 correspond to 50% replicability. However, we see that selection for significance lowers this value to just 14% replicability. Thus, we would not expect that published results with p-values that are just significant would replicate in actual replication studies. Even z-scores in the range from 3 to 3.5 average only 32% replicability. Thus, only studies with z-scores greater than 3.5 can be considered to provide some empirical evidence for this claim.

Inspection of the datafile shows that z-scores greater than 3.5 were consistently obtained in 2 out of the 43 articles. Both articles used a more powerful within-subject design.

The automatic evaluation effect: Unconditional automatic attitude activation with a pronunciation task (JPSP, 1996)

Subjective aspects of cognitive control at different stages of processing (Attention, Perception, & Psychophysics, 2009).

Conclusion

John A. Bargh’s work on unconscious processes with unobtrusive priming task is at the center of the replication crisis in psychology. This replicability audit suggests that this is not an accident. The low replicability estimate and the large file-drawer estimate suggest that replication failures are to be expected. As a result, published results cannot be interpreted as evidence for these effects.

So far, John Bargh has ignored criticism of his work. In 2017, he published a popular book about his work on unconscious processes. The book did not mention doubts about the reported evidence, while a z-curve analysis showed low replicability of the cited studies (Schimmack, 2017).

Recently, another study by John Bargh failed to replicate (Chabris et al., in press), and Jessy Singal wrote a blog post about this replication failure (Research Digest) and John Bargh wrote a lengthy comment.

In the commentary, Bargh lists several studies that successfully replicated the effect. However, listing studies with significant results does not provide evidence for an effect unless we know how many studies failed to demonstrate the effect and often we do not know this because these studies are not published. Thus, Bargh continues to ignore the pervasive influence of publication bias.

Bargh then suggests that the replication failure was caused by a hidden moderator which invalidates the results of the replication study.

One potentially important difference in procedure is the temperature of the hot cup of coffee that participants held: was the coffee piping hot (so that it was somewhat uncomfortable to hold) or warm (so that it was pleasant to hold)? If the coffee was piping hot, then, according to the theory that motivated W&B, it should not activate the concept of social warmth – a positively valenced, pleasant concept. (“Hot” is not the same as just more “warm”, and actually participates in a quite different metaphor – hot vs. cool – having to do with emotionality.) If anything, an uncomfortably hot cup of coffee might be expected to activate the concept of anger (“hot-headedness”), which is antithetical to social warmth. With this in mind, there are good reasons to suspect that in C&S, the coffee was, for many participants, uncomfortably hot. Indeed, C&S purchased a hot or cold coffee at a coffee shop and then immediately handed that coffee to passersby who volunteered to take the study. Thus, the first few people to hold a hot coffee likely held a piping hot coffee (in contrast, W&B’s coffee shop was several blocks away from the site of the experiment, and they used a microwave for subsequent participants to keep the coffee at a pleasantly warm temperature). Importantly, C&S handed the same cup of coffee to as many as 7 participants before purchasing a new cup. Because of that feature of their procedure, we can check if the physical-to-social warmth effect emerged after the cups were held by the first few participants, at which point the hot coffee (presumably) had gone from piping hot to warm.

He overlooks that his original study produced only weak evidence for the effect with a p-value of .0503, that is technically not below the .05 value for significance. As shown in the z-curve plot, results with a p-value of .0503 have only an average replicability of 13%. Moreover, the 95%CI for the effect size touches 0. Thus, the original study did not rule out that the effect size is extremely small and has no practical significance. To make any claims that the effect of holding a warm cup on affection is theoretically relevant for our understanding of affection would require studies with larger samples and more convincing evidence.

At the end of his commentary, John A. Bargh assures readers that he is purely motivated by a search for the truth.

Let me close by affirming that I share your goal of presenting the public with accurate information as to the state of the scientific evidence on any finding I discuss publicly. I also in good faith seek to give my best advice to the public at all times, again based on the present state of evidence. Your and my assessments of that evidence might differ, but our motivations are the same.

Let me be crystal clear. I have no reasons to doubt that John A. Bargh believes what he says. His conscious mind sees himself as a scientist who employs the scientific method to provide objective evidence. However, Bargh himself would be the first to acknowledge that our conscious mind is not fully aware of the actual causes of human behavior. I submit that his response to criticism of his work shows that he is less capable of being objective than he thinks he his. I would be happy to be proven wrong in a response by John A. Bargh to my scientific criticism of his work. So far, eminent social psychologists have preferred to remain silent about the results of their replicability audits.

Disclaimer

It is nearly certain that I made some mistakes in the coding of John A. Bargh’s articles. However, it is important to distinguish consequential and inconsequential mistakes. I am confident that I did not make consequential errors that would alter the main conclusions of this audit. However, control is better than trust and everybody can audit this audit.  The data are openly available and the data can be submitted to a z-curve analysis using a shinny app. Thus, this replicability audit is fully transparent and open to revision.

Postscript

Many psychologists do not take this work seriously because it has not been peer-reviewed. However, nothing is stopping them from conducting a peer-review of this work and to publish the results of their review as a commentary here or elsewhere. Thus, the lack of peer-review is not a reflection of the quality of this work, but rather a reflection of the unwillingness of social psychologists to take criticism of their work seriously.

If you found this audit interesting, you might also be interested in other replicability audits of eminent social psychologists.



Psychological Science is Self-Correcting

It is easy to say that science is self-correcting.  The notion of a self-correcting science is based on the naive model of science as an objective process that incorporates new information and updates beliefs about the world depending on the available evidence.  When new information suggests that old beliefs are false, the old beliefs are replaced by new beliefs.   

It has been a while since I read Kuhn’s book on paradigm shifts, but I do remember that a main point of the book was that science doesn’t work this way for a number of reasons.  

Thus, self-correction cannot be taken for granted. Rather, it is an attribute that needs to be demonstrated for a discipline to be an actual science. If psychological science wants to be a science, there should be empirical evidence that it is self-correcting. 

One piece of evidence for self-correction is that theories that are in doubt receive fewer citations.  Fortunately, modern software like the database WebofScience makes it very easy to count citations by year of publication.

Social Priming

In the past years, research on social priming has come under attack. Several replication studies failed to replicate key findings in this literature. In 2012, Nobel Laureate Daniel Kahneman wrote an open letter to John A. Bargh calling social priming “the poster child of doubts about doubts about the integrity of psychological research.” (cf. Train Wreck blog post). I have demonstrated with statistical methods that many of the published results in this literature were obtained with questionable research methods that inflate the risk of false positive results (Before You Know It).

If science is self-correcting, we should see a decrease in citations of social priming articles.

John A. Bargh

The graph below shows the citations of John A. Bargh’s articles by year. 2019 does not count because it just started. 2018 citation are still added but at a very low rate. So, the 2018 data can be interpreted.

The graph shows that John A. Bargh’s citation counts still increased after 2012, when Kahneman published the open letter. However, publishing is a slow process and many articles published in 2013 and 2014 had been written before 2012. Starting with 2015, we see a decrease in citations and this decrease continues to 2018. The decrease seems to be accelerating with a drop by 200 citations from 2017 to 2018.

In conclusion, there is some evidence of self-correction in psychology. However, Bargh may be an exception because an open letter by a Nobel Laureate is a rare and powerful impetus for self-correction.

Ap Dijksterhuis

Dijksterhuis is also known for work on unconscious processes and social priming. Importantly, a large replication study failed to replicate his professor-priming results in 2018 (Registered Replication Report).

The increase in citation counts stalled in 2011, even before the citation counts of John A. Bargh started to decrease. However, there was no clear decrease in the years from 2012 to 2017, while citation counts decreased by over 100 citations in 2018. Thus, there are some signs of self-correction here as well.

Fritz Strack

The work by Fritz Strack was also featured in Kahneman’s book. There have been two registered replication reports of work by Fritz Strack and both failed to replicate the original results (facial feedback, item-order effects).

Strack’s citation counts increased dramatically after 2012. However in 2018 they decreased by 150 counts. We need the 2019 data to see whether this is a blip or the beginning of a downward trend.

Susan T. Fiske

To make sure that the trends for social priming researchers are not just general trends we need a control condition. I picked Susan T. Fiske because she is an eminent social psychologist, but her work is different from social priming experiments. Here work is also more replicable than work by social priming researchers (social psychologists’ replicability rankings).

Fiske’s graph shows no decrease in 2018. Thus, the decreases seen for social priming researchers do not reflect a general trend in social psychology.

Conclusion

This blog post shows how citation counts can be used to examine whether psychological science is self-correcting, which is an essential feature of a science. There are some positive signs that the recent replication crisis in social psychology has triggered a process of self-correction. I suggest that further investigation of changes in citation counts are a fruitful area of research for meta-psychologists.

The Limited Utility of Network Models

This blog post is based on a commentary that was published in the European Journal of Personality Psychology in 2012. Republishing it as a blog post makes it openly accessible.

The Utility of Network Analysis for Personality Psychology
ULRICH SCHIMMACK and JUDITH GERE
European Journal of Personality, 26: 446–447 (2012)
DOI: 10.1002/per.1876

Abstract

We note that network analysis provides some new opportunities but also has some limitations: (i) network analysis relies on observed measures such as single items or scale scores; (ii) it is a descriptive method and, as such,
cannot test causal hypotheses; and (iii) it does not test the influence of outside forces on the network, such as dispositional influences on behaviour. We recommend structural equation modelling as a superior method that overcomes limitations of exploratory factor analysis and network analysis.

Article

Cramer et al. (2012) introduce network analysis (NA) as a new statistical tool for the study of personality that addresses some limitations of exploratory factor analysis (EFA). We concur with the authors that NA provides valuable new opportunities but feel forced by the situational pressure of a 1000 word limit to focus on some potential limitations of
NA.

We also compare NA to structural equation modelling (SEM) because we agree with the authors that SEM is currently the most powerful statistical method for the testing of competing (causal) theories of personality.

One limitation of EFA and NA is that these methods rely on observed measures to examine relationships between personality constructs. For example, Cramer et al. (2012) apply NA to correlations among ratings of single items. The authors recognize this limitation but do not present an alternative to this suboptimal approach.

A major advantage of SEM is that it allows researchers to create measurement models that can remove random and systematic measurement error from observed measures of personality constructs. Measurement models of multimethod data are particularly helpful to separate perception and rater biases from actual personality traits
(e.g. Gere & Schimmack, 2011; Schimmack, 2010).

Our second concern is that NA is presented as a statistical tool that can test dynamic process models of personality. Yet, NA is a descriptive method that provides graphical representations of patterns in correlation matrices. Thus, NA is akin to other descriptive methods (e.g. multidimensional scaling, cluster analysis and principal component analysis) that reveal patterns in complex data. These descriptive methods make no assumptions about causality. In contrast, SEM forces researchers to make a priori assumptions about causal processes and provides information about the ability of a causal theory to explain the observed pattern of correlations. Thus, we recommend SEM for theory testing and do not think it is appropriate to use NA for this purpose.

Specifically, we think it is questionable to make inferences about the Big Five model based on network graphs. Cramer et al. (2012) highlight the ability to visualize the centrality of items in a network as a major strength of NA. However, factor loading patterns and communalities in EFA provide similar information. In our opinion, the authors go beyond the statistical method of NA when they propose that activation of central components will increase the chances that neighbouring components will also become
more activated. This assumption is problematic for several reasons.

First, it is not clear what the authors mean by the notion of activation of personality components. Second, the connections in a network graph are not causal paths. An item could be central because it is influenced by many personality components (e.g. life satisfaction is influenced by neuroticism,
extraversion, agreeableness and conscientiousness) or because it is the cause of neighbouring items (life satisfaction influences neuroticism, extraversion, agreeableness and conscientiousness). Researchers interested in testing causal relationships should collect data that are informative about causality (e.g. twin data) and use SEM to test whether the
data favour one causal theory over another.

We are also concerned about the suggestion of Cramer et al. (2012) that NA provides an alternative account of classic personality constructs such as extraversion and neuroticism. It is important to make clear that this alternative view challenges the core assumption of many personality
theories that behaviour is influenced by personality dispositions.

That is, whereas the conception of neuroticism as a personality trait assumes that neuroticism has causal force (Funder, 1991), the conceptualization of neuroticism as a personality component implies that it does not have causal force. The authors compare personality constructs such as neuroticism with the concept of a flock. The term flock in the expression a flock of birds does not refer to an independent entity that exists apart from the individual birds, and it makes no sense to attribute the gathering of birds to the causal effect of flocking (the birds are gathered in the same place because they are a flock of birds). We prefer to compare neuroticism with the causal force of seasonal changes that make individual birds flock together.

END


Since we published this commentary, network models have become even more popular to make claims about important constructs like depression and other constructs. So far, we have only seen pretty pictures of item clusters, but no evidence that network models provide new insights into the causes of depression or dynamic developments over time. The reason is that the statistical tool is merely descriptive, whereas the articles talk a lot about things that go well beyond the empirical contribution of plotting correlations or partial correlations. In this regard, network articles remind me of the old days in personality psychology, where researchers told stories about their principle components. Instead researchers interested in individual differences should learn how to use structural equation modeling to test causality and to study stability and change of personality traits and states. Unfortunately, learning structural equation modeling is a bit more difficult than network analysis which requires no theory and does not test model fit. Maybe that is the reason for the popularity of network models. Easy to do and pretty pictures. Who can resist.

Ulrich Schimmack, March 1, 2019

Schachter and Singer (1962): The Experiment that Never Happened

One of the most famous experiments in psychology is Schachter and Singer’s experiment that was used to support the two-factor theory of emotions: emotions is sympathetic arousal plus cognition about the cause of the arousal (see Dror, 2017, Reisenzein, 2017, for historic reviews).

The classic article “Cognitive, social, and physiological determinants of emotional state” has been cited 2,799 times in WebofScience, and is a textbook classic.

Schachter and Wheeler (1962) summarize the “take-home message” of Schacthter and Singer (1962).

In their study of cognitive and physiological determinants of emotional states, Schachter and Singer (1962) have demonstrated that cognitive processes play a major role in the development of emotional states” (p. 121).

The “demonstration” was an experiment in which participants were injected with epinephrine to create a state of arousal or a placebo. This manipulation was crossed with a confederate who either displayed euphoric or angry behavior.

Schachter and Wheeler summarize the key findings.

In experimental situations designed to make subjects euphoric, those subjects who received injections of epinephrine were, on a variety of indices, somewhat more euphoric than subjects who received a placebo injection.

Similarly, in situations designed to make subjects angry and irritated, those who received epinephrine were somewhat angrier than subjects who received placebo.

[Note the discrepancy between the claim “play a major role” and “somewhat more”]

The proceed to make clear that this pattern, although expected, could also have been produced by chance alone.

In both sets of conditions, however, these differences between epinephrine and placebo subjects were significant, at best, at borderline levels of statistical significance.

[Not the discrepancy between “demonstrated” and “borderline significance”]

Schachter and Wheeler conducted another test of the two-factor theory. The study was essentially a conceptual replication and an extension of Schachter and Singer. The replication part of the study was that participants were again injected with a placebo or epinephrine. It is a conceptual replication because the target emotion was amusement, rather than anger or euphoria. Finally, the extension was a third condition in which participants were injected with Chlorpromazine; a sedative. This should suppress activation of sympathetic arousal and dampen amusement.

One dependent variable were observer ratings of amusement. As shown in Table 3, the means were in the predicted direction, but the difference between placebo and epinephrine conditions was not significant.

Ratings of the film were additional dependent variables. Means are again in the same direction, but p-values are not reported and the text mentions that some differences were significant only at borderline levels. The pattern makes clear that this would be the case for the contrasts of the Chlorpromazine condition with the other conditions, but not for the epinephrine – placebo contrast.

Based on these underwhelming and non-significant results, the authors concluded

The overall pattern of experimental results of this study and the Schachter and Singer (1962) experiment gives consistent support to a general formulation of emotion as a function of a state of physiological arousal and of an appropriate cognition (p. 127). 

This claim is false. The replication study actually confirmed that an epinephrine injection seems to have no statistically reliable influence on the intensity of emotions.

Dorr (2017) made an interesting historical observation that Schachter was angry (presumably, without injection of epinephrine) that editors added non-significant to some of the results in the Schachter and Singer (1962) article.

“Since the paper has appeared students have tittered at me, my colleagues look down at their plates.” The most serious issue, among several, was that Tables 6–9 were totally misleading. The “notation ‘ns’ in the p column,” as Schachter explained, “is meaningless. Nothing was tested” (Schachter, S.,
1962, Schachter to R. Solomon, May 3, 1962).”  (Dorr, 2017)

Nothing was tested and nothing was proven, but a theory was born and it lives on in the imagination of hundreds of contemporary psychologists. The failure to provide evidence for it in Schachter and Wheeler was largely ignored. The article has been cited only 145 times compared to 2,799 for Schachter and Singer.

One reason for the impact of Schachter and Singer is that it was published in Psychological Review, while Schachter and Wheeler was published in Journal ol Abnormal and Social Psychology, which later became the Journal of Personality and Social Psychology.

Psychological Review is the journal where a select few psychologists can make sweeping claims with very little evidence, in the hope that other researchers will provide evidence for it. Given that psychology only publishes confirmatory evidence, every Psychological Review is a self-fulfilling prophecy, and every proposed theory will receive empirical support (even if only with marginal significance), and will live forever.

So, what are the take-home messages from this blog post.

  1. The two-factor theory of emotions was never empirically supported.
  2. Just because it was published in Psych Review, doesn’t mean it is true.
  3. Psychology is not an evidence-based science, until it stops worshiping historically important articles as evidence for some eternal truth.
  4. It is not bullying if the target of scientific criticism is deceased.

References

Dror, O. E. (2017). Deconstructing the “Two Factors”: The Historical Origins of the Schachter–Singer Theory of Emotions. Emotion Review9(1), 7–16. https://doi.org/10.1177/1754073916639663Copy to Clipboard

Reisenzein, R. (2017). Varieties of Cognition-Arousal Theory. Emotion Review9(1), 17–26. https://doi.org/10.1177/1754073916639665

The Validity of Well-Being Measures: A Multiple-Indicator–Multiple-Rater Model

Zou, C., Schimmack, U., & Gere J. (2013). The Validity of Well-Being Measures: A Multiple-Indicator–Multiple-Rater Model. Psychological Assessment, 25(4), 1247–1254.

ABSTRACT

In the subjective indicators tradition, well-being is defined as a match between an individual’s actual life and his or her ideal life. Common well-being indicators are life-satisfaction judgments, domain satis- faction judgments, and measures of positive and negative affect (hedonic balance). These well-being indicators are routinely used to study well-being, but a formal measurement model of well-being is lacking. This article introduces a measurement model of well-being and examines the validity of self-ratings and informant ratings of well-being. Participants were 335 families (1 student with 2 parents, N = 1,005). The main findings were that (a) self-ratings and informant ratings are equally valid, (b) global life-satisfaction judgments and averaged domain satisfaction judgments are about equally valid, and (c) about 1/3 of the variance in a single indicator is valid. The main implication is that researchers should demonstrate convergent validity across multiple indicators by multiple raters.

Keywords: life satisfaction, affect, self-reports, informant-reports, multitrait–multimethod

Well-being is an important goal for many people, thus, social scientists from a variety of disciplines study well-being. A major problem for well-being scientists is that well-being is difficult to define and measure (Diener, Lucas, Schimmack, & Helliwell, 2009). These difficulties may threaten the validity of well-being measures. The aim of the present study is to examine the validity of the most commonly used measures of well-being.

A measure is valid if it measures what it is intended to measure. This definition of validity implies that it is important to define a construct (i.e., what is being measured?) before it is possible to evaluate the validity of a measure (Schimmack, 2010). Unfortu- nately, there is no agreement about the definition of the term well-being (Diener et al., 2009). It is therefore necessary to explain how we define the term well-being before we can examine the validity of well-being measures. We agree with philosophical arguments that well-being is a subjective concept (Diener, 1984; Sumner, 1996; see Diener, Suh, Lucas, & Smith, 1999, for a detailed discussion). A key criterion of a subjective definition of well-being is that the evaluation has to take the subjective values, motives, and ideals of individuals into account; that is, is his or her life going well for him or her? Accordingly, we define well-being as a match between an individual’s actual life and his or her ideal life. This definition is consistent with the prevalent definition of well-being in the social indicators tradition (Andrews & Withey, 1976; Cantril, 1965; Diener, 1984; Veenhoven & Jonkers, 1984). This definition of well-being led to the creation of subjective well-being indicators such as life-satisfaction judgments (Diener, 1984). These measures are routinely used to make inferences about the determinants of well-being. These inferences implicitly assume that well-being measures are valid, but the literature on the validity of these measures is sparse and controversial (Schwarz & Strack, 1999; Schimmack & Oishi, 2005; Schneider & Schimmack, 2009). Since there is no gold standard to validate well-being measures, convergent validity between self-ratings and informant ratings of well-being has been used as the primary evidence for the validity of well-being measures (Diener et al., 2009). However, a major limitation of previous studies is that they did not provide quanti- tative information about the amount of valid variance in different well-being measures (cf. Schneider & Schimmack, 2009). Our study addresses this problem and provides the first quantitative estimates of the amount of valid variance in the most widely used measures of well-being.

One problem in the estimation of effect sizes is that estimates based on small samples are imprecise because sampling error is substantial. To obtain data from a large sample, we used a round- robin design. In this design, participants are both targets and informants, thus, increasing the number of targets. To ensure that informants have valid information about targets’ well-being, we used families as units of analysis. Specifically, we recruited uni- versity students and their biological parents (see Table 1).

A round-robin design creates two problems for a standard struc- tural equation model. First, observations are not independent be- cause participants are recruited as triads rather than as individuals. Second, the distinction between the three raters (student, mother, & father) does not provide information about the validity of self-ratings because self-ratings are a function of rater and target (i.e., the diagonal in Table 1).

To overcome these problems, we made use of advanced features in the structural equation modeling program Mplus 5.0 (Muthén & Muthén, 2007). First, we used the CLUSTER command to obtain adjusted standard errors and fit indices that take the interdepen- dence among family members into account. Second, we rearranged the data to create variables with self-ratings (see Table 2). This creates missing data in the diagonal of the traditional round-robin design. To analyze these data with missing values we used the MODEL = COMPLEX function of Mplus (Muthén & Muthén, 2007). Thus, our model included 16 (4 raters X 4 measures) observed variables.

A Measurement Model of Well-Being

Quantitative estimates of validity require a formal measurement model in which variation in well-being (the match between indi- viduals’ actual and ideal lives) is an unobserved cause that pro- duces variation in observed well-being measures (e.g., self-ratings of life-satisfaction; cf. Schimmack, 2010). Our measurement model of well-being (see Figure 1) is similar to Diener et al.’s (1999) theoretical model of well-being. It is also related to the causal systems model of subjective well-being (Busseri & Sadava, 2011). In this model, positive affect and negative affect are distinct affective experiences. For most people, feeling good and not feeling bad is an important part of an ideal life, and the balance of positive versus negative affect serves as an important basis for life-satisfaction judgments (Schimmack, Radhakrishnan, Oishi, Dzokoto, & Ahadi, 2002; Suh, Diener, Oishi, & Triandis, 1998). Consistent with these assumptions, positive affect and negative affect are distinct components of hedonic balance (using a forma- tive measurement model), and hedonic balance influences well- being. The formative measurement model of hedonic balance makes no assumptions about the correlation between its compo- nents. As prior research often reveals a moderate negative corre- lation between positive affect and negative affect, our model allows for the two components to correlate with each other (Diener, Smith, & Fujita, 1995; Gere & Schimmack, 2011). The well-being factor is identified by two satisfaction measures, global life-satisfaction judgments and averaged domain satisfaction judgments. Prior studies often relied exclusively on global life-satisfaction judgments (Lucas, Diener, & Suh, 1996; Walker & Schimmack, 2008). The problem with this approach is that global life-satisfaction judgments can be influenced by focusing illusions (Kahneman, Krueger, Schkade, Schwarz, & Stone, 2006; but see Schimmack & Oishi, 2005). Focusing illusions could produce systematic measurement error in global life-satisfaction judgments that could attenuate the influence of hedonic balance on well- being. To address this concern, our model included averaged domain satisfaction judgments as a second indicator of well-being. As averaged domain satisfaction judgments are not susceptible to focusing illusions, the focusing illusion hypothesis predicts that averaged domain satisfaction judgments have a higher loading on the well-being factor (i.e., are more valid) than global life- satisfaction judgments.

Figure 1 does not show how our model incorporated systematic rater biases. For each rater, we created a single bias factor. This factor represents general evaluative biases in self-ratings and rat- ings of others that influence personality and well-being ratings (Anusic, Schimmack, Pinkus & Lockwood, 2009; Kim, Schim- mack, & Oishi, 2012; Schimmack, Schupp, & Wagner, 2008).

The Present Study

Model fit was assessed using standard criteria of acceptable model fit such as a comparative fit index (CFI) < .95, root-mean- square error of approximation (RMSEA) < .06, and standardized root-mean-square residual (RMSR) < .08 (Schermelleh-Engel, Moosbrugger, & Muller, 2003). Due to the large sample size of the present data (N = 1,005), tests of model comparison using p-values will often lead to misleading results (cf. Raftery, 1995). Therefore, we used the Bayesian information criterion (BIC) for model comparisons. Models with lower BIC values are preferable because they are more parsimonious. This is especially important in new research areas because small effects are less likely to replicate. Following Raftery’s (1995) standards, a difference in BIC values greater than 10 can be interpreted as very strong evidence to support the model with the lower BIC value.

Method

Participants were 335 students at the University of Toronto and their parents (335 triads; N = 1,005). Of the 335 students, 235 were women and 100 were men, and the age ranged from 17 to 30 years (Mage = 19.56, SD = 2.23). The age of mothers ranged from 37 to 63 years (Mage = 48.25, SD = 5.08). The age of fathers ranged from 38 to 72 years (Mage = 51.67, SD = 5.67). Students were required to be living with both of their biological parents so that each member of the family had good knowledge of one another. Students from the university took part in the study for either $25 or course credit. Their parents each received $25 for participating in the study. Two hundred thirty-five students came to the laboratory with their parents to complete the study. One hundred students and their parents completed the study in their homes.

Participants who came into the laboratory filled out consent forms, and these participants were seated in separate rooms to ensure that reports were made independently. They filled out a series of questionnaires about themselves and about the other two members of their families. They were then debriefed and thanked for their participation. Students who took the questionnaires home met with a researcher who gave them detailed instructions and the questionnaire packages. Participants were asked to fill out the questionnaires in separate rooms and refrain from talking about their responses until all members of the family have completed the questionnaire. Each family member received an envelope, into which the family member placed his or her own completed ques- tionnaire, and he or she sealed the envelope and signed it across the flap. Once the questionnaire packages were completed, partici- pants returned the questionnaire packages, and they were debriefed and thanked for their participation.

Measures

Since well-being is defined as an evaluation of an individual’s actual life, the assessment of well-being has to be retrospective. For this reason, we asked participants to think about the past 6 months when answering the questions. Additionally, since global judgments of life satisfaction can be influenced by temporarily accessible information (Schimmack & Oishi, 2005; Schwarz & Strack, 1999), the global self-ratings of life satisfaction were assessed first.

Global life evaluation. For the global evaluative judgments, the first three items of the Satisfaction With Life Scale were used (SWLS; Diener, Emmons, Larsen, & Griffin, 1985). The items ask participants to evaluate their lives on a 7-point Likert scale ranging from 1 (strongly disagree) to 7 (strongly agree). The first three items (“In most ways my life is close to my ideal”; “The conditions of my life are excellent”; “I am satisfied with my life”) were chosen because they have been shown to have better psychometric prop- erties than the last two items of the scale (Oishi, 2006). Consistent with prior studies, the internal consistency of the three-item scale was good, alphas > .80 (C= .83 for students; C= .89 for mothers; C = .89 for fathers). The items for the informant reports were virtually the same, but the wording was changed to an informant report format (e.g., Kim et al., 2012). Informants were instructed to fill out the scale from the target’s perspective. For example, students serving as informants for their father would rate “In most ways my father thinks that his life is close to his ideal.” Ratings were made on 7-point Likert scales. The internal consistency of informant-ratings was similar to the internal consistency of self- ratings (ranged from C = .85 to C = .93).

Averaged domain satisfaction. Domain satisfaction was as- sessed with single-item indicators for six important life domains, using satisfaction judgments (I am satisfied with..). The life do- mains were romantic life, work/academic life, health, recreational life, housing, and friendships. Responses were made on 7-point Likert scales ranging from 1 (strongly disagree) to 7 (strongly agree). The domains were chosen based on previous studies show- ing that these domains are rated as moderately to very important (Schimmack, Diener, & Oishi, 2002). We averaged these items to obtain an alternative measure of life evaluations. The informant version of the questionnaire changed the stem from “I am . . . ” to “My son/daughter/mother/father is . . . ” and “my” to “his/her.”

Positive and negative affect. Positive and  negative  affect were assessed using the Hedonic Balance Scale (Schimmack et al., 2002). The scale has three items for positive affect (pleasant, positive, good) and three items for negative affect (unpleasant, negative, bad). The items for positive and negative affect were averaged separately to create composites for positive and negative affect, respectively. All of the self-ratings for positive affect had a reliability of over .80 (C = .82 for students; C = .85 for mothers; C = .85 for fathers). Similarly, all of the self-ratings for negative affect had a reliability of over .75 (C = .80 for students; C = .75 for mothers; C = .78 for fathers). For the informant reports, “. . . how often do you experience the following feelings?” was re- placed with “. . . how often does your mother/father/son/daughter experience the following feelings?” All of the informant reports had reliabilities of over .75 (range from C = .75 to C = .89).

Results

Multitrait–Multimethod Matrix

Table 3 shows the correlations among the 16 variables created by crossing the four indicators (life satisfaction, domain satisfac- tion, positive affect, and negative affect) with the four raters (self, student informant, mother informant, and father informant). Note that since the self cannot also serve as the informant for the self, correlations between self-reports and informant reports are based on 66% of all observations. The correlations between the self- report measures were based on 100% of the observations.

Correlations between the same construct assessed with different methods (i.e., convergent validity coefficients) are bolded. All of the convergent validity coefficients were significantly greater than zero and exceeded a minimum value of r = .25. Convergent validity correlations for affective indicators (positive affect and negative affect) were lower than correlations for the evaluative indicators (life satisfaction and domain satisfaction). These find- ings replicate the results of a meta-analysis (Schneider & Schim- mack, 2009).

Table 3 can also be used to examine whether each indicator measures well-being in a slightly different manner. Twenty-two out of 24 cross-indicator– cross-rater correlations were weaker than the convergent validity coefficients, indicating that the dif- ferent indicators have unique variance. This finding replicates Lucas et al.’s (1996). However, Table 3 also shows that all well-being measures are related to each other. This pattern of results is consistent with the assumption that all measures reflect a common construct.Table 3 also shows stronger same-rater correlations than cross- rater correlations. This pattern is consistent with our assumption that ratings by a single rater are influenced by an evaluative bias (Anusic et al., 2009; Campbell & Fiske, 1959). Most important, Table 3 provides new information about informant–informant agreement. One notable pattern in the data is that the correlations between informant ratings by mothers (mother informant) and fathers (father informant) were stronger than correlations of infor- mant ratings by parents with those by students as informants. There are two possible explanations for this pattern. First, it is possible that students’ informant reports are less valid than par- ents’ informant ratings. However, this interpretation of the data is inconsistent with the finding that self-ratings were more highly correlated with students’ informant ratings than with parents’ informant ratings. Therefore, we favor the second explanation that parents’ informant ratings share method variance. This interpreta- tion is also consistent with other multirater studies that have demonstrated shared method variance between parents’ ratings of their children’s personality (Funder, Kolar, & Blackman, 1995).

Structural Equation Modeling

We fitted the measurement model in Figure 1 to our data. In the first model, we did not constrain coefficients. This model served as the base-model for model comparisons to more parsimonious models with constrained coefficients. The first model with uncon- strained coefficients had acceptable fit to the data, x2(df = 78) = 104.41, CFI = 0.995, RMSEA = 0.018, standardized root-mean- square residual (SRMR) = 0.026; BIC = 31,102. Factor loadings of ratings by different raters of the same measure (e.g., life- satisfaction) showed very similar loadings. We therefore specified a model that constrained factor loadings and residuals for the four raters to be equal. This model implies that ratings by different raters are equally valid. The model with constrained parameters maintained good fit and had a lower (i.e., superior) BIC value, x2(df = 102) = 148.18, CFI = 0.991, RMSEA = 0.021, SRMR = 0.041; BIC = 30,993. In the next model, we constrained the loadings on the rater-specific bias factors to be equal across raters. Again, model fit remained acceptable, and BIC decreased, indicating that rater bias is similar across raters x2(df = 117) = 188.48, CFI = 0.986, RMSEA = 0.025, SRMR = 0.068; BIC = 30,936. We retained this model as the final model. The parameter estimates of the final model and their 95% confidence intervals are listed in Table 4. For ease of interpretation, the main parameter estimates are also included in Figure 1.

The main finding was that the life-satisfaction factor and the average domain satisfaction factor had very  high  loadings  on  the well-being factor. Thus, our results provide no support for the hypothesis that focusing illusions undermine the validity of global life-satisfaction judgments. We also found a very strong effect of hedonic balance on the well-being factor. Yet, all three measures of well-being had significant residual variances, indicating that the measures are not redundant. Most important, about 20% of the variance in well-being was not accounted for by hedonic balance. This suggests that affective measures and evaluative judgments can show divergent patterns of correlations with predictor variables.

The factor loadings of the observed variables on the factor representing the shared variance among raters (e.g., self-ratings of life satisfaction [LS] on LS factor) can be interpreted as validity coefficients for specific constructs (e.g., the validity of a self-rating of life-satisfaction as a measure of life-satisfaction; cf. Schimmack, 2010). The validity coefficients of the four types of indicators were very similar (see Table 4). The validity coefficients suggest that about one third (29% to 38%) of the variance in a single indicator by a single rater (e.g., self-ratings of life- satisfaction) is valid variance.

It is important to keep in mind that these estimates examine the validity of a single rater with regard to a specific measure of well-being rather than the validity of these measures as measures of well-being. To examine the validity of specific measures as measures of the well-being factor in our measurement model, we need to estimate indirect effects of the well-being factor on specific measures. For example, self-ratings of life satisfaction load at .60 on the life satisfaction factor. However, this does not mean that self-ratings of life satisfaction capture 36% (.6*.6) of valid variance of well-being, because life satisfaction is not a perfect indicator of well-being. Based on our model, the life satisfaction factor loads at .96 on the well-being factor. We also need to take this measurement error into account to examine the validity of self- ratings of life satisfaction in assessing well-being (.96*.60 = .58, valid variance = 33%).

Discussion

Our study provides the first quantitative estimates of the validity of various well-being measures using a theoretically grounded model of well-being. Our main findings were that (a) about one third of the variance in a single well-being indicator is valid variance, (b) self-ratings are neither significantly more nor less valid than ratings by a single well-acquainted informant, (c) a large portion of the valid variance in a specific type of indicator is shared across indicators, and (d) hedonic balance and evaluative judgments have some unique variance.

We found no support for the focusing illusion hypothesis. If the distinction between hedonic balance and global life-satisfaction judgments were caused by a focusing illusion, the factor loading of life satisfaction on well-being should have been lower than the factor loading of the average domain satisfaction judgment. However, the actual results showed a slightly reversed pattern. This suggests that unique variance in evaluative judgments reflects valid well-being variance because individuals do not rely exclusively on hedonic balance to evaluate their lives. This finding provides empirical support for philosophical arguments against purely hedonistic definitions of well-being (Sumner, 1996). At the same time, the overlap between evaluative judgments and hedonic balance is substantial, indicating that positive experiences make an important contribution to well-being for most individuals. Another noteworthy finding was that global life-satisfaction judgments and averaged domain satisfaction judgments were approximately equally valid. This finding contradicts previous findings that averaged domain satisfaction judgments were more valid in a study with friends as informants (Schneider & Schimmack,
2010). Future research needs to examine whether the type of informant is a moderator. For example, it is possible that global life-satisfaction judgments are more difficult to make, which gives family members an advantage over friends. Subsequently, we discuss the main implications of our findings for the use of well-being measures in the assessment of individuals’ well-being and for the use of well-being measures in policy decisions.

Validity of Well-Being Indicators

Our results suggest that about one third of the variance in a
single well-being indicator by a single rater is valid variance. This
finding has important implications for the interpretation of studies
that rely on a single well-being indicator as a measure of wellbeing.
For example, many important findings about well-being are
based on a single global life-satisfaction rating in the German
Socio-Economic Panel (e.g., Lucas & Schimmack, 2009). It is
well-known that observed effect sizes in these studies are attenuated
by random measurement error and that it would be desirable
to correct effect size estimates for unreliability (Schmidt & Hunter,
1996). However, systematic measurement error can further attenuate
observed effect sizes. Schimmack (2010) proposed that quantitative
estimates of validity could be used to disattenuate observed
effect sizes for invalidity. To illustrate the implications of correcting
for invalidity in well-being indicators, we use Kahneman et
al.’s (2006) finding that household income was a moderate predictor
of self-reported life-satisfaction (r .32). Our findings
suggest that this observed relationship underestimates the relationship
between household income and well-being. To disattenuate
the observed relationship, the observed correlation has to be divided
by the validity coefficient (i.e., .96 .60 .58). Thus, the
corrected estimate of the true effect size would increase to r .56
(.32/.58), which is considered a strong effect size (Cohen, 1992).
Researchers may be reluctant to trust adjusted effect sizes because
they rely on assumptions about validity. However, the common
practice of relying on observed relationships as estimates of
effect sizes also relies on an implicit assumption, namely, that the
observed measure is perfectly valid. In comparison to an assumption of 100% valid variance in a single global life-satisfaction judgment, our estimate of about one-third valid variance is more realistic and supported by empirical evidence. Nevertheless, our findings should only be treated as a first estimate and a benchmark for future studies. Future research needs to replicate our findings and examine moderating factors of validity in well-being measures.

Self-Reports Versus Informant Reports

Schneider and Schimmack (2009) noted that previous studies failed to compare the validity of self-ratings and informant ratings. Our results suggest that self-ratings and ratings by a single well- acquainted informant are approximately equally valid. While this is a surprising finding given the subjective nature of well-being, it is not uncommon in personality psychology to find evidence of equal or sometimes greater validity in informant ratings than self-ratings. For instance, informant reports of personality often provide better predictive validity than self-reports (e.g., Kolar, Funder, & Colvin, 1996). Since we did not have any outcome measure of well-being (e.g., suicide) in the present study, we could not test for the predictive validity of self- and informant reports. However, this is an important avenue for future research. To our knowledge, no study has compared self-ratings and informant ratings using life-events that are known to influence well-being such as marriage, divorce, or unemployment (Diener, Lucas, & Scollon, 2006).

Informant ratings also have an important advantage over self- ratings. Namely, it is possible to obtain ratings from multiple informants, but there is only one self to provide self-ratings. Aggregation of informant ratings can substantially increase the validity of informant ratings. We computed well-being indicators for single raters and multiple raters using the following weights (Well-Being = 1.5 Life Satisfaction + 1.5 Domain Satisfaction + 2 Positive Affect – 1 Negative Affect) and computed the corre- lation with the well-being factor in Figure 1. The correlations were r = .62 for self-ratings, r = .77 for an aggregate of three informant ratings, and r = .81 for an aggregate of all four ratings. Although the difference between .62 and .77 may not seem impressive, it implies that aggregation across raters can increase the amount of valid variance from one third to two thirds of the observed vari- ance. This finding suggests that clinicians can benefit considerably from obtaining well-being measures from multiple informants to assess individuals’ well-being.

Limitations

Our study has numerous limitations. The use of a convenience sample from a specific population means that the generalizability of our findings needs to be examined in samples drawn from other populations. However, our results are broadly consistent with meta-analytic findings (Schneider & Schimmack, 2009). Another limitation was that parents are not independent raters and appear to share rating biases. In the future, it would be desirable to obtain ratings from independent raters (e.g., friends & parents). Finally, our conclusions are limited by the assumptions of our model. While it is possible to fit other models to our data in Table 3 (e.g., Busseri & Sadava, 2011), the alternative models each have their own limitations. Future studies should test these alternative models to examine if they may reveal different or unique findings from the present study. We encourage readers to fit alternative models to the correlation matrix in Table 3 and examine whether these model provide better fit to our data. We consider our model merely as a plausible first attempt to create a measurement model of well- being that can underpin empirical studies of well-being.

Conclusions

Although the study of happiness has been of great interest to many researchers and the general public, the validity of well-being measures has not improved for the past 50 years (Schneider & Schimmack, 2009). In order for well-being researchers to provide accurate information about the determinants of well-being, it is crucial to use a valid method to assess well-being. If invalid measures are used, findings that rely on such measures will also lack validity. From the current study, we found that only about one third of the variance in a self-report measure of well-being is valid. In order to increase the validity of well-being measures, multiple methods of well-being should be used. When better measures are used, researchers can also be more confident that their findings can be trusted.

References

Andrews, F. M., & Withey, S. B. (1976). Social indicators of well-being: America’s perception of life quality. New York, NY: Plenum.

Anusic, I., Schimmack, U., Pinkus, R. T., & Lockwood, P. (2009). The nature and structure of correlations among Big Five ratings: The halo- alpha-beta model. Journal of Personality and Social Psychology, 97, 1142–1156. doi:10.1037/a0017159

Busseri, M. A., & Sadava, S. W. (2011). A review of the tripartite structure of subjective well-being: Implications for conceptualization, operation- alization, analysis, and synthesis. Personality and Social Psychology Review, 15, 290 –314. doi:10.1177/1088868310391271

Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait–multimethod matrix. Psychological Bulletin, 56, 81–105. doi:10.1037/h0046016

Cantril, H. (1965). The pattern of human concerns (Vol. 4). New Bruns- wick, NJ: Rutgers University Press.

Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155–159. doi:10.1037/0033-2909.112.1.155

Diener, E. (1984). Subjective well-being. Psychological Bulletin, 95, 542–

575. doi:10.1037/0033-2909.95.3.542

Diener, E., Emmons, R. A., Larsen, R. J., & Griffin, S. (1985). The Satisfaction With Life Scale. Journal of Personality Assessment, 49, 71–75. doi:10.1207/s15327752jpa4901_13

Diener, E., Lucas, R. E., Schimmack, U., & Helliwell, J. F. (2009). Well-being for public policy. New York, NY: Oxford University Press. doi:10.1093/acprof:oso/9780195334074.001.0001

Diener, E., Lucas, R. E., & Scollon, C. N. (2006). Beyond the hedonic treadmill: Revising the adaptation theory of well-being. American Psy- chologist, 61, 305–314.

Diener, E., Smith, H., & Fujita, F. (1995). The personality structure of affect. Journal of Personality and Social Psychology, 69, 130 –141. doi:10.1037/0022-3514.69.1.130

Diener, E., Suh, E. M., Lucas, R. E., & Smith, H. L. (1999). Subjective well-being: Three decades of progress. Psychological Bulletin,  125, 276 –302. Funder, D. C., Kolar, D. C., & Blackman, M. C. (1995). Agreement among judges of personality: Interpersonal-relations, similarity, and acquain- tanceship. Journal of Personality and Social Psychology, 69, 656 – 672. doi:10.1037/0022-3514.69.4.656

Funder, D. C., Kolar, D. C., & Blackman, M. C. (1995). Agreement among judges of personality: Interpersonal-relations, similarity, and acquain- tanceship. Journal of Personality and Social Psychology, 69, 656 – 672. doi:10.1037/0022-3514.69.4.656

Gere, J., & Schimmack, U. (2011). A multi-occasion multi-rater model of affective dispositions and affective well-being. Journal of Happiness Studies, 12, 931–945. doi:10.1007/s10902-010-9237-3

Kahneman, D., Krueger, A. B., Schkade, D., Schwarz, N., & Stone, A. A. (2006). Would you be happier if you were richer? A focusing illusion. Science, 312, 1908 –1910. doi:10.1126/science.1129688

Kim, H., Schimmack, U., & Oishi, S. (2012). Cultural differences in self- and other-evaluations of well-being: A study of European and Asian Canadians. Journal of Personality and Social Psychology, 102, 856 – 873. doi:10.1037/a0026803

Kolar, D. W., Funder, D. C., & Colvin, C. R. (1996). Comparing the accuracy of personality judgments by the self and knowledgeable others. Journal of Personality, 64, 311–337. doi:10.1111/j.1467-6494.1996

.tb00513.x

Lucas, R. E., Diener, E., & Suh, E. (1996). Discriminant validity of well-being measures. Journal of Personality and Social Psychology, 71, 616 – 628. doi:10.1037/0022-3514.71.3.616

Lucas, R. E., & Schimmack, U. (2009). Income and well-being. How big is the gap between the rich and the poor? Journal of Research in Personality, 43, 75–78. doi:10.1016/j.jrp.2008.09.004

Muthén, L. K., & Muthén, B. O. (2007). Mplus user’s guide (5th ed.). Los Angeles, CA: Muthén & Muthén.

Oishi, S. (2006). The concept of life satisfaction across cultures: An IRT analysis. Journal of Research in Personality, 40, 411– 423. doi:10.1016/ j.jrp.2005.02.002

Raftery, A. E. (1995). Bayesian model selection in social research. Soci- ological Methodology, 25, 111–164. doi:10.2307/271063

Schermelleh-Engel, K., Moosbrugger, H., & Muller, H. (2003). Evaluating the fit of structural equation models: Tests of significance and descrip- tive goodness-of-fit measures. Methods of Psychological Research, 8, 23–74.

Schimmack, U. (2010). What multi-method data tell us about construct validity. European Journal of Personality, 24, 241–257. doi:10.1002/ per.771

Schimmack, U., Diener, E., & Oishi, S. (2002). Life-satisfaction is a momentary judgement and a stable personality characteristic: The use of chronically accessible and stable sources. Journal of Personality, 70, 345–384. doi:10.1111/1467-6494.05008

Schimmack, U., & Oishi, S. (2005). The influence of chronically and temporarily accessible information on life satisfaction judgments. Jour- nal of Personality and Social Psychology, 89, 395– 406. doi:10.1037/ 0022-3514.89.3.395

Schimmack, U., Radhakrishnan, P., Oishi, S., Dzokoto, V., & Ahadi, S. (2002). Culture, personality, and subjective well-being: Integrating pro- cess models of life satisfaction. Journal of Personality and Social Psychology, 82, 582–593.

Schimmack, U., Schupp, J., & Wagner, G. G. (2008). The influence of environment and personality on the affective and cognitive component of subjective well-being. Social Indicators Research, 89, 41– 60. doi: 10.1007/s11205-007-9230-3

Schmidt, F. L., & Hunter, J. E. (1996). Measurement error in psychological research: Lessons from 26 research scenarios. Psychological Methods, 1, 199 –223. doi:10.1037/1082-989X.1.2.199

Schneider, L., & Schimmack, U. (2009). Self-informant agreement in well-being ratings: A meta-analysis. Social Indicators Research, 94, 363–376. doi:10.1007/s11205-009-9440-y

Schneider, L., & Schimmack, U. (2010). Examining sources of self- informant agreement in life-satisfaction judgments. Journal of Research in Personality, 44, 207–212. doi:10.1016/j.jrp.2010.01.004

Schwarz, N., & Strack, F. (1999). Reports of subjective well-being: Judg- mental processes and their methodological implications. In D. Kahne- man, E. Diener, & N. Schwarz (Eds.), Well-being: The foundations of hedonic psychology (pp. 61– 84). New York, NY: Russell-Sage.

Suh, E., Diener, E.,Oishi, S., & Triandis, H. C. (1998). The shifting basis of life satisfaction judgments across cultures: Emotions versus norms. Journal of Personality and Social Psychology, 74, 482– 493.

Sumner, L. W. (1996). Welfare, happiness, and ethics. New York, NY: Oxford University Press.

Veenhoven, R., & Jonkers, T. (1984). Conditions of happiness (Vol. 2). Dordrecht, the Netherlands: Reidel.

Walker, S. S., & Schimmack, U. (2008). Validity of a happiness implicit association test as a measure of subjective well-being. Journal of Re- search in Personality, 42, 490 – 497. doi:10.1016/j.jrp.2007.07.005

The Validation Crisis in Psychology

Most published psychological measures are unvalid.  (subtitle)
*unvalid = the validity of the measure is un-known.

This blog post served as a first draft for a manuscript that is currently under review at Meta-Psychology. You can find the latest version here (pdf).

Introduction

8 years ago, psychologists started to realize that they have a replication crisis. Many published results do not replicate in honest replication attempts that allow the data to decide whether a hypothesis is true or false.

The replication crisis is sometimes attributed to the lack of replication studies before 2011. However, this is not the case. Most published results were replicated successfully. However, these successes were entirely predictable from the fact that only successful replications would be published (Sterling, 1959). These sham replication studies provided illusory evidence for theories that have been discredited over the past eight years by credible replication studies.

New initiatives that are called open science are likely to improve the replicability of psychological science in the future, although progress towards this goal is painfully slow.

This blog post addresses another problem in psychological science. I call it the validation crisis. Replicability is only one necessary feature of a healthy science. Another necessary feature of a healthy science is the use of valid measures. This feature of a healthy science is as obvious as the need for replicability. To test theories that relate theoretical constructs to each other (e.g., construct A influences construct B for individuals drawn from population P under conditions C), it is necessary to have valid measures of constructs. However, it is unclear which criteria a measure has to fulfill to have construct validity. Thus, even successful and replicable tests of a theory may be false because the measures that were used lacked construct validity.

Construct Validity

The classic article on “Construct Validity” was written by two giants in psychology; Cronbach and Meehl (1955). Every graduate student of psychology and surely every psychologists who published a psychological measure should be familiar with this article.

The article was the result of an APA task force that tried to establish criteria, now called psychometric properties, for tests to be published. The result of this project was the creation of the construct “Construct validity”

The chief innovation in the Committee’s report was the term construct validity. (p. 281).

Cronbach and Meehl provide their own definition of this construct.

Construct validation is involved whenever a test is to be interpreted
as a measure of some attribute or quality which is not “operationally
defined” (p. 282).

In modern language, construct validity is the relationship between variation in observed test scores and a latent variable that reflects corresponding variation in a theoretical construct (Schimmack, 2010).

Thinking about construct validity in this way makes it immediately obvious why it is much easier to demonstrate predictive validity, which is the relationship between observed tests scores and observed criterion scores than to establish construct validity, which is the relationship between observed test scores and a latent, unobserved variable. To demonstrate predictive validity, one can simply obtain scores on a measure and a criterion and compute the correlation between the two variables. The correlation coefficient shows the amount of predictive validity of the measure. However, because constructs are not observable, it is impossible to use simple correlations to examine construct validity.

The problem of construct validation can be illustrated with the development of IQ scores. IQ scores can have predictive validity (e.g., performance in graduate school) without making any claims about the construct that is being measured (IQ tests measure whatever they measure and what they measure predicts important outcomes). However, IQ tests are often treated as measures of intelligence. For IQ tests to be valid measures of intelligence, it is necessary to define the construct of intelligence and to demonstrate that observed IQ scores are related to unobserved variation in intelligence. Thus, construct validation requires clear definitions of constructs that are independent of the measure that is being validated. Without clear definition of constructs, the meaning of a measure reverts essentially to “whatever the measure is measuring,” as in the old saying “Intelligence is whatever IQ tests are measuring. This saying shows the problem of research with measures that have no clear construct and no construct validity.

In conclusion, the challenge in construct validation research is to relate a specific measure to a well-defined construct and to establish that variation in test scores are related to variation in the construct.

What are Constructs

Construct validation starts with an assumption. Individuals are assumed to have an attribute, today we may say personality trait. Personality traits are typically not directly observable (e.g., kindness rather than height), but systematic observation suggests that the attribute exists (some people are kinder than others across time and situations). The first step is to develop a measure of this attribute (e.g., a self-report measure “How kind are you?”). If the test is valid, variation in the observed scores on the measure should be related to the personality trait.

A construct is some postulated attribute of people, assumed to be reflected in test performance (p. 283).

The term “reflected” is consistent with a latent variable model, where unobserved traits are reflected in observable indicators. In fact, Cronbach and Meehl argue that factor analysis (not principle component analysis!) provides very important information for construct validity.

We depart from Anastasi at two points. She writes, “The validity of
a psychological test should not be confused with an analysis of the factors
which determine the behavior under consideration.” We, however,
regard such analysis as a most important type of validation. (p. 286).

Factor analysis is useful because factors are unobserved variables and factor loadings show how strongly an observed measure is related to variation in a an unobserved variable; the factor. If multiple measures of a construct are available, they should be positively correlated with each other and factor analysis will extract a common factor. For example, if multiple independent raters agree in their ratings of individuals’ kindness, the common factor in these ratings may correspond to the personality trait kindness, and the factor loadings provide evidence about the degree of construct validity of each measure (Schimmack, 2010).

In conclusion, factor analysis provides useful information about construct validity of measures because factors represent the construct and factor loadings show how strongly an observed measure is related to the construct.

It is clear that factors here function as constructs (p. 287).

Convergent Validity

The term convergent validity was introduced a few years later in another seminal article on validation research by Campbell and Fiske (1959). However, the basic idea of convergent validity was specified by Cronbach and Meehl (1955) in the section “Correlation matrices and factor analysis”

If two tests are presumed to measure the same construct, a correlation between them is predicted (p. 287).

If a trait such as dominance is hypothesized, and the items inquire about behaviors subsumed under this label, then the hypothesis appears to require that these items be generally intercorrelated (p. 288)

Cronbach and Meehl realize the problem of using just two observed measures to examine convergent validity. For example, self-informant correlations are often used in personality psychology to demonstrate validity of self-ratings. However, a correlation of r = .4 between self-ratings and informant ratings is open to very different interpretations. The correlation could reflect very high validity of self-ratings and modest validity of informant ratings or the opposite could be true.

If the obtained correlation departs from the expectation, however, there is no way to know whether the fault lies in test A, test B, or the formulation of the construct. A matrix of intercorrelations often points out profitable ways of dividing the construct into more meaningful parts, factor analysis being
a useful computational method in such studies. (p. 300)

A multi-method approach avoids this problem and factor loadings on a common factor can be interpreted as validity coefficients. More valid measures should have higher loadings than less valid measures. Factor analysis requires a minimum of three observed variables, but more is better. Thus, construct validation requires a multi-method assessment.

Discriminant Validity

The term discriminant validity was also introduced later by Campbell and Fiske (1959). However, Cronbach and Meehl already point out that high or low correlations can support construct validity. Crucial for construct validity is that the correlations are consistent with theoretical expectations.

For example, low correlations between intelligence and happiness do not undermine the validity of an intelligence measure because there is no theoretical expectation that intelligence is related to happiness. In contrast, low correlations between intelligence and job performance would be a problem if the jobs require problem solving skills and intelligence is an ability to solve problems faster or better.

Only if the underlying theory of the trait being measured calls for high item
intercorrelations do the correlations support construct validity (p. 288).

Quantifying Construct Validity

It is rare to see quantitative claims about construct validity. Most articles that claim construct validity of a measure simply state that the measure has demonstrated construct validity as if a test is either valid or invalid. However, the previous discussion already made it clear that construct validity is a quantitative construct because construct validity is the relation between variation in a measure and variation in the construct and this relation can vary . If we use standardized coefficients like factor loadings to assess the construct validity of a measure, construct validity can range from -1 to 1.

Contrary to the current practices, Cronbach and Meehl assumed that most users of measures would be interested in a “construct validity coefficient.”

There is an understandable tendency to seek a “construct validity
coefficient. A numerical statement of the degree of construct validity
would be a statement of the proportion of the test score variance that is
attributable to the construct variable. This numerical estimate can sometimes be arrived at by a factor analysis” (p. 289).

Cronbach and Meehl are well-aware that it is difficult to quantify validity precisely, even if multiple measures of a construct are available because the factor may not be perfectly corresponding with the construct.

Rarely will it be possible to estimate definite “construct saturations,” because no factor corresponding closely to the construct will be available (p. 289).

And nobody today seems to remember Cronbach and Meehl’s (1955) warning that rejection of the null-hypothesis, the test has zero validity, is not the end goal of validation research.

It should be particularly noted that rejecting the null hypothesis does not finish the job of construct validation (p. 290)

The problem is not to conclude that the test “is valid” for measuring- the construct variable. The task is to state as definitely as possible the degree of validity the test is presumed to have (p. 290).

One reason why psychologists may not follow this sensible advice is that estimates of construct validity for many tests are likely to be low (Schimmack, 2010).

The Nomological Net – A Structural Equation Model

Some readers may be familiar with the term “nomological net” that was popularized by Cronbach and Meehl. In modern language a nomological net is essentially a structural equation model.

The laws in a nomological network may relate (a) observable properties
or quantities to each other; or (b) theoretical constructs to observables;
or (c) different theoretical constructs to one another. These “laws”
may be statistical or deterministic.

It is probably no accident that at the same time as Cronbach and Mehl started to think about constructs as separate from observed measures, structural equation model was developed as a combination of factor analysis that made it possible to relate observed variables to variation in unobserved constructs and path analysis that made it possible to relate variation in constructs to each other. Although laws in a nomological network can take on more complex forms than linear relationships, a structural equation model is a nomological network (but a nomological network is not necessarily a structural equation model).

As proper construct validation requires a multi-method approach and demonstration of convergent and discriminant validity, SEM is ideally suited to examine whether the observed correlations among measures in a mulit-trait-multi-method matrix are consistent with theoretical expectations. In this regard, SEM is superior to factor analysis. For example, it is possible to model shared method variance, which is impossible with factor analysis.

Cronbach and Meehl also realize that constructs can change as more information becomes available. It may also occur that the data fail to provide evidence for a construct. In this sense, construct validiation is an ongoing process of improved understanding of unobserved constructs and how they are related to observable measures.

Ideally this iterative process would start with a simple structural equation model that is fitted to some data. If the model does not fit, the model can be modified and tested with new data. Over time, the model would become more complex and more stable because core measures of constructs would establish the construct validity, while peripheral relationships may be modified if new data suggest that theoretical assumptions need to be changed.

When observations will not fit into the network as it stands, the scientist has a certain freedom in selecting where to modify the network (p. 290).

Too often psychologists use SEM only to confirm an assumed nomological network and it is often considered inappropriate to change a nomological network to fit observed data. However, SEM is as much testing of an existing construct as exploration of a new construct.

The example from the natural sciences was the initial definition of gold as having a golden color. However, later it was discovered that the pure metal gold is actually silver or white and that the typical yellow color comes from copper impurities. In the same way, scientific constructs of intelligence can change depending on the data that are observed. For example, the original theory may assume that intelligence is a unidimensional construct (g), but empirical data could show that intelligence is multi-faceted with specific intelligences for specific domains.

However, given the lack of construct validation research in psychology, psychology has seen little progress in the understanding of such basic constructs such as extraversion, self-esteem, or wellbeing. Often these constructs are still assessed with measures that were originally proposed as measures of these constructs, as if divine intervention led to the creation of the best measure of these constructs and future research only confirmed their superiority.

Instead many claims about construct validity are based on conjectures than empirical support by means of nomological networks. This was true in 1955. Unfortunately, it is still true over 50 years later.

For most tests intended to measure constructs, adequate criteria do not exist. This being the case, many such tests have been left unvalidated, or a finespun network of rationalizations has been offered as if it were validation. Rationalization is not construct validation. One who claims that his test reflects a construct cannot maintain his claim in the face of recurrent negative results because these results show that his construct is too loosely defined to yield verifiable inferences (p. 291).

Given the difficulty of defining constructs and finding measures for it, even measures that show promise in the beginning might fail to demonstrate construct validity later and new measures should show higher construct validity than the early measures. However, psychology shows no development in measures of the same construct. The most widely used measure of self-esteem is still Rosenberg’s scale from 1965 and the most widely used measure of wellbieng is still Diener et al.’s scale from 1984. It is not clear how psychology can make progress, if it doesn’t make progress in the development of nomological networks that provide information about constructs and about the construct validity of measures.

Cronbach and Meehl are clear that nomological networks are needed to claim construct validity.

To validate a claim that a test measures a construct, a nomological net surrounding the concept must exist (p. 291).

However, there are few attempts to examine construct validity with structural equation models (Connelly & Ones, 2010; Zou, Schimmack, & Gere, 2013). [please share more if you know some]

One possible reason is that construct validation research may reveal that authors initial constructs need to be modified or their measures have modest validity. For example, McCrae, Zonderman, Costa, Bond, and Paunonen (1996) dismissed structural equation modeling as a useful method to examine the construct validity of Big Five measures because it failed to support their conception of the Big Five as orthogonal dimensions with simple structure.

Recommendations for Users of Psychological Measures

The consumer can accept a test as a measure of a construct only when there is a strong positive fit between predictions and subsequent data. When the evidence from a proper investigation of a published test is essentially negative, it should be reported as a stop sign to discourage use of the test pending a reconciliation of test and construct, or final abandonment of the test (p. 296).

It is very unlikely that all hunches by psychologists lead to the discovery of useful constructs and development of valid tests of these constructs. Given the lack of knowledge about the mind, it is rather more likely that many constructs turn out to be non-existent and that measures have low construct validity.

However, the history of psychological measurement has only seen development of more and more constructs and more and more measures to measure this increasing universe of constructs. Since the 1990s, constructs have doubled because every construct has been split into an explicit and an implicit version of the construct. Presumably, there is even implicit political orientation or gender identity.

The proliferation of constructs and measures is not a sign of a healthy science. Rather it shows the inability of empirical studies to demonstrate that a measure is not valid or that a construct may not exist. This is mostly due to self-serving biases and motivated reasoning of test developers. The gains from a measure that is widely used are immense. Thus, weak evidence is used to claim that a measure is valid and consumers are complicit because they can use these measures to make new discoveries. Even when evidence shows that a measure may not work as intended (e.g.,
Bosson et al., 2000), it is often ignored (Greenwald & Farnham, 2001).

Conclusion

Just like psychologist have started to appreciate replication failures in the past years, they need to embrace validation failures. Some of the measures that are currently used in psychology are likely to have insufficient construct validity. If this was the decade of replication, the 2020s may become the decade of validation, and maybe the 2030s may produce the first replicable studies with valid measures. Maybe this is overly optimistic, given the lack of improvement in validation research since Cronbach and Meehl (1955) outlined a program of construct validation research. Ample citations show that they were successful in introducing the term, but they failed in establishing rigorous criteria of construct validity. The time to change this is now.

The Implicit Association Test at Age 21: No Evidence for Construct Validity

PREPRINT (UNDER REVIEW)

Abstract

The Implicit Association Test (IAT) is 21 years old. Greenwald et al. (1998) proposed that the IAT measures individual differences in implicit social cognition.  This claim requires evidence of construct validity. I review the evidence and show that there is insufficient evidence for this claim.  Most important, I show that few studies were able to test discriminant validity of the IAT as a measure of implicit personality characteristics and that a single-construct model fits multi-method data as well or better than a dual-construct models.  Thus, the IAT appears to be a measure of the same personality characteristics that are measured with explicit measures. I also show that the validity of the IAT varies across personality characteristics. It has low validity as a measure of self-esteem, moderate validity as a measure of racial bias, and high validity as a measure of political orientation.  The existing evidence also suggests that the IAT measures stable characteristics rather than states and has low predictive validity of single behaviors. Based on these findings, it is important that users of the IAT clearly distinguish between implicit measures and implicit constructs. The IAT is an implicit measure, but there is no evidence that it measures implicit constructs.

Keywords:  Personality, Individual Differences, Social Cognition, Measurement, Construct Validity, Convergent Validity, Discriminant Validity, Structural Equation Modeling

The Implicit Association Test at Age 21: No Evidence for Construct Validity

Twenty-one years ago, Greenwald, McGree, and Schwartz (1998) published one of the most influential articles in personality and social psychology.  It is already the 4th most cited article (4582 citations in Web of Science) in the Journal of Personality and Social Psychology and will be number 3 this year. As the title “Measuring Individual Differences in Social Cognition” suggests, the article introduced a new individual difference measure that has been used in hundreds of studies to measure attitudes, stereotypes, self-concepts, well-being, and personality traits. Henceforth, I will refer to these constructs as personality characteristics.

A Critical Evaluation of Greenwald’s (1998) Evidence for Discriminant Validity

The Implicit Association Test (IAT) uses reaction times in classification tasks to measure individual differences in the strength of associations (Nosek et al., 2007).  However, the main purpose of the IAT is not to measure associations or to provide an indirect measure of personality characteristics.  The key constructs that the IAT was designed to measure are individual differences in implicit personality characteristics as suggested in the title of Greenwald et al.’s (1998) seminal article “Measuring Individual Differences in Implicit Cognition.” 

The notion of implicit cognition is based on a conception of human information processing that largely takes place outside of consciousness, and the IAT was supposed to provide a window into the unconscious. “There has been an increased interest in measuring aspects of thinking and feeling that may not be easily accessed or available to consciousness. Innovations in measurement have been undertaken with the purpose of bringing under scrutiny new forms of cognition and emotion that were previously undiscovered” (Nosek, Greenwald, & Banaji, 2007, p. 265). 

Thus, the IAT was not just a new way of measuring the same individual differences that were already measured with self-report measures.  It was designed to measure information that is “simply unreachable, in the same way that memories are sometimes unreachable [by introspection]” (Nosek et al., 2007, p. 266).

The promise to measure individual differences that were not accessible to introspection explains the appeal of the IAT, and many articles used the IAT to make claims about individual differences in implicit forms of self-esteem, prejudice, or craving for drugs. Thus, the hypothesis that the IAT measures something different from self-report measures is a fundamental feature of the construct validity of the IAT. In psychometrics, the science of test validation, this property of a measure is known as discriminant validity (Campbell & Fiske, 1959).  If the IAT is a measure of implicit individual differences that are different from explicit individual differences, the IAT should demonstrate discriminant validity from self-report measures.  Given the popularity of the IAT, one might expect ample evidence for the discriminant validity of the IAT.  However, due to methodological limitations this is actually not the case.

Confusion about Convergent and Discriminant Validity

Greenwald et al.’s seminal article promised a measure of individual differences, but failed to provide evidence for the convergent or discriminant validity of the IAT.  Study 1 with N = 32 participants showed that, on average, participants preferred flowers to insects and musical instruments to weapons. These average tendencies cannot be used to validate the IAT as a measure of individual differences. However, Greenwald et al. (1998) also reported correlations across N = 32 participants between the IAT and explicit measures.  These correlations were low.  Greenwald et al. (1998) suggest that this finding provides evidence of discriminant validity. “This conceptual divergence between the implicit and explicit measures is of course expected from theorization about implicit social cognition” (p. 1470).  However, these low correlations are uninformative because discriminant validity requires a multi-method approach.  As the IAT was the only implicit measure, low correlations with explicit measures may simply show that the IAT has low validity as a measure of individual differences.  

Experiment 2 used the IAT with 17 Korean and 15 Japanese American students to assess their attitudes towards Koreans vs. Japanese.  In this study, Greenwald et al. found “unexpectedly the feeling thermometer explicit rating was more highly correlated with the IAT measure (average r = .59) than it was with another explicit attitude measure, the semantic differential (r = .43)” (p. 1473). This finding actually contradicts the hypothesis that the IAT measures some construct that is not measured with self-ratings because discriminant validity implies higher same-method than cross-method correlations (Campbell & Fiske, 1959).

Study 3 introduced the race IAT to measure prejudice with the IAT with a sample of 26 participants.  In this small sample, IAT scores were only weakly and not significantly correlated with explicit measures.  The authors realize that this finding is open to multiple interpretations. “Although these correlations provide no evidence for convergent validity of the IAT, nevertheless because of the expectation that implicit and explicit measures of attitude are not necessarily correlated-neither do they damage the case for construct validity of the IAT” (p. 1476).  In other words, the low correlations might reflect discriminant validity, but it could also show low convergent validity if the IAT and explicit measures measure the same construct.

The discussion has a section on “Discriminant Validity of IAT Attitude Measures,” although the design of the studies makes it impossible to provide evidence for discriminant validity. Nevertheless, Greenwald et al. (1998) claimed that they provided evidence for the discriminant validity of the IAT as a measure of implicit cognitions. “It is clear that these implicit-explicit correlations should be taken not as evidence for convergence among different methods of measuring attitudes but as evidence for divergence of the constructs represented by implicit versus explicit attitude measures” (p. 1477).   The scientific interpretation of these correlations is that they provide no empirical evidence about the validity of the IAT because multiple measures of a single construct are needed to examine construct validity (Campbell & Fiske, 1959). Thus, unlike most articles that introduce a new measure of individual differences, Greenwald et al. (1998) did not examine the psychometric properties of the IAT.  In this article, I examine whether evidence gathered over the past 21 years has provided evidence of construct validity of the IAT as a measure of implicit personality characteristics.

First Problems for the Construct Validity of the IAT

The IAT was not the first implicit measure in social psychology. Several different measures had been developed to measure self-esteem with implicit measures. A team of personality psychologists conducted the first multi-method validation study of the IAT as a measure of implicit self-esteem (Bosson, Swan, & Pennebaker, 2000).  The main finding in this study was that several implicit measures, including the IAT, had low convergent validity.  However, this finding has been largely ignored and researchers started using the self-esteem IAT as a measure of some implicit form of self-esteem that operates outside of conscious awareness (Greenwald & Farnham, 2000).

At the same time, attitude researchers also found weak correlations between the race IAT and other implicit measures of prejudice. However, this lack of convergent validity was also ignored.  An influential review article by Fazio and Olson (2003) suggested that low correlations might be due to different mechanisms. While it is entirely possible that evaluative priming and the IAT have different mechanisms, it is not relevant for the ability of either measure to be a valid measure of personality characteristics. Explicit ratings probably also rely on a different mechanism as the IAT.  The mechanics of measurement have to be separated from the constructs that the measures aim to measure.

Continued Confusion about Discriminant Validity

Nosek et al. (2007) examined evidence for the construct validity of the IAT at age 7.  The section on convergent and discriminant validity lists a few studies as evidence for discriminant validity.  However, closer inspection of these studies show that they suffer from the same methodological limitation as Greenwald et al.’s (1998) seminal study.  That is, constructs were assessed with a single implicit method; the IAT.  Thus, it was impossible to examine construct validity of the IAT as a measure of implicit personality characteristics.

Take Nosek and Smyth’s (2007) “A Multi-trait-multi-method validation of the Implicit Association Test” as an example. The title clearly alludes to Campbell and Fiske’s approach to construct validation.  The data were 7 explicit ratings and 7 IATs of 7 attitude pairs (e.g., flower vs. insect).  The authors fitted several structural equation models to the data and claimed that a model with separate, yet correlated, explicit and implicit factors fitted the data better than a model with a single factor for each attitude pair.  This claim is invalid because each attitude pair was assessed with a single IAT and parcels were used to correct for unreliability.  This measurement model assumes that all of the reliable variance in an IAT that is not shared with explicit ratings or with IATs of other attitudes reflects implicit individual differences. However, it is also possible that this variance reflects systematic measurement error that is unique to a specific IAT.  A proper multi-method approach requires multiple independent measures of the same construct.   As demonstrated with real multi-method data below, there is consistent evidence that the IAT has systematic method variance that is unique to a specific IAT. 

Nevertheless, Nosek and Smyth’s (2007) multi-attitude study provided some interesting information. The correlation of the 7 means of the IAT and the 7 means of the explicit ratings was r = .86. For example, implicit and explicit measures showed a preference for flowers over insects and a dislike of evolution versus creation.  If implicit measures reflect distinct, unconscious processes, it is not clear why the means correspond to those based on self-reports. However, this finding is easily explained by a single-attitude model, where the mean structure depends on the mean structure of the latent attitude variable.

In sum, Nosek et al.’s claim that the IAT has demonstrated discriminant validity is based on a misunderstanding of Campbell and Fiske’s (1959) approach to construct validation. A proper assessment of construct validity requires demonstration of convergent validity before it is possible to demonstrate discriminant validity, and to demonstrate convergent validity it is necessary to use multiple independent measures of the same construct.  Thus, to demonstrate construct validity of the IAT as a measure of implicit personality characteristics requires multiple independent implicit measures.

First Evidence of Discriminant Validity in a Multi-Method Study

Cunningham, Preacher, and Banaji (2001) reported the results of the first multi-method study of prejudice. Participants were 93 students with complete data. Each student completed a single explicit measure of prejudice, the Modern Racism Scale (McConahay, 1986), and three implicit measures: (a) the standard race IAT (Greenwald et al., 1998), a response window IAT (Cunningham et al., 2001), and a response-window evaluative priming task (Fazio et al., 1986). The assessment was repeated on four occasions two weeks apart.

I used the published correlation matrix to reexamine the claim that a single-factor model does not fit the data. First, I was able to reproduce the model fit of the published dual-attitude model with MPLUS8.2 (original fit: chi2(100, N = 93) = 111.58, p = .20; NNFI = .96; CFI = .97; RMSEA = 0.041 (90% confidence interval: 0.00, 0.071); reproduced fit: chi2(100, N = 93) = 112, CFI = .977, RMSEA = 0.036, 90%CI = .000 to .067.  Thus, the model fit of the reproduced model serves as a comparison standard for the alternative models that I examined next.

The original model is a hierarchical model with an implicit attitude factor as a second-order factor, and method-specific first-order factors. Each first-order factor has four indicators for four repeated measurements with the same method.  This model imposes constraint on the first order loadings because they contribute to the first-order relations among indicators of the same method and to the second order relations of different implicit methods to each other.

An alternative way to model multi-method data are bi-factor models (Chen, West, & Sousa, 2006). A bifactor model allows for all measures to be directly related to the general trait factor that corresponds to the second-order factor in a hierarchical model.  However, bi-factor models may not be identified if there are no method factors. Thus, a first step is to allow for method-specific correlated residuals and to examine whether these correlations are positive.

The model with a single factor and method-specific residual correlations fit the data better than the hierarchical model, chi2(80, N = 93) = 87, CFI = .988, RMSEA = 0.029, 90%CI = .000 to .065.  Inspection of the residual correlations showed high correlations for the Modern Racism scale, but less evidence for method-specific variance for the implicit measures.  The response window IAT had no significant residual correlations.  This explains the high factor loading of the respond window IAT in the hierarchical model.  It does not suggest that this is the most valid measure. Rather, it shows that there is little method specific variance. Fixing these residual correlations to zero, improved model fit, chi2(86, N = 93) = 91, CFI = .991, RMSEA = 0.025, 90%CI = .000 to .062. I then tried to create method factors for the remaining methods. For the IAT, a method factor could only be created for the first three occasions. However, model fit for this model decreased unless occasion 2 was allowed to correlate with occasion 4.  This unexpected finding is unlikely to reflect a real relationship.  Thus, I retained the model with a method factor for the first three occasions only, chi2(89, N = 93) = 97, CFI = .986, RMSEA = 0.029, 90%CI = .000 to .064.  I was able to fit a method factor for evaluative priming, but model fit decreased, chi2(91, N = 93) = 101, CFI = .983, RMSEA = 0.033, 90%CI = .000 to .065, and the first occasion did not load on the method factor. Model fit could be improved by fixing the loading to zero and by allowing for an additional correlation between occasion 1 and 3, chi2(91, N = 93) = 98, CFI = .988, RMSEA = 0.027, 90%CI = .000 to .062.  However, there is no rational for this relationship and I retained the more parsimonious model.  Fitting the measurement model for the modern racism scale also decreased fit, but fit was better than for the model in the original article, chi2(94, N = 93) = 107, CFI = .977, RMSEA = 0.038, 90%CI = .000 to .068.  This was the final model (Figure 1).

The most important results are the factor loadings of the measures on the trait factor. Factor loadings for the Modern racism scale ranged from .35 to .45 (M = .40). Factor loadings for the standard IAT ranged from .43 to .54 (M = .47). Factor loadings for the response window IAT ranged from .41 to .69 (M = .51).  The evaluative priming measures had the lowest factor loadings ranging from .13 to .47 (M = .29).  Thus, there is no evidence that implicit measures are more strongly related to each other than to explicit measures, as stated in the original article.

In terms of absolute validity, all of these validity coefficients are low, suggesting that a single standard IAT measure on a single occasion has .47^2 = 22% valid variance.  Most important, these results suggest that the Modern Racism Scale and the IAT measure a single construct and that the low correlation between implicit and explicit measures reflects low convergent validity rather than high discriminant validity. 

In conclusion, a reexamination of Cunningham et al.’s data shows that the data do not provide evidence of discriminant validity and that the IAT may simply be an alternative measure of the same construct that is being measured with explicit measures like the Modern Racism Scale. Thus, the study provides no evidence for the construct validity of the IAT as a measure of implicit individual differences in race attitudes.

Meta-Analysis of Implicit – Explicit Correlations

Hofmann, Gawronski, Geschwendner, and Le (2005) conducted a meta-analysis of 126 studies that had reported correlations between an IAT and an explicit measure of the same construct. Notably, over one-hundred studies had been conducted without using multiple-implicit measures. The mono-method approach taken in these studies suggests that authors took construct validity of the IAT for granted, and used the IAT as a measure of implicit constructs.  As a result, these studies provide no test of the construct validity of the IAT.

Nevertheless, the meta-analysis produced an interesting result.  Correlations between implicit and explicit measures varied across personality characteristics.  Correlations were lowest for self-esteem, which is consistent with Bosson et al.’s (2000) finding, and highest for simple attitude objects like consumer products (e.g. Pepsi vs. Coke).  Any theory of implicit attitude measures has to explain this finding.  One explanation could be that explicit measures of self-esteem are less valid than explicit-measures of preferences for consumer goods. However, it is also possible that the validity of the IAT varies.  Once more, a comparison of different personality characteristics with multiple methods is needed to test this competing theories.

Problems with Predictive Validity

Ten years after the IAT was published another problem emerged.  Some critics voiced concerns that the IAT, especially the race IAT, lacks predictive validity (Blanton, Jaccard, Klick, Mellers, Mitchell, & Tetlock (2009).  To examine the predictive validity of the IAT, Greenwald and colleagues (2009) published a meta-analysis of IAT-criterion correlations. The key finding was that “for 32 samples with criterion measures involving Black–White interracial behavior, predictive validity of IAT measures significantly exceeded that of self-report measures” (p. 17).  Specifically, the authors reported a correlation of r = .24 for the IAT and a criterion and a correlation of r = .12 for an explicit measure and a criterion, and that these correlations were significantly different from each other.  A few years later, Oswald, Mitchell, Blanton, Jaccard, and Tetlock (2013) published a critical reexamination of the literature and reported different results. “IATs were poor predictors of every criterion category other than brain activity, and the IATs performed no better than simple explicit measures” (p. 171).  The only exception were fMRI studies with extremely small samples that produced extremely large correlations, often exceeding the reliability of the IAT.  It is well known that these correlations are inflated and difficult to replicate (Vul, Harris, Winkielman, & Hashler, 2009).  Moreover, correlations with neural activity are not evidence that IAT scores predict behavior.

More recently, Greenwald and colleagues published a new meta-analysis (Kurdi et al., 2018). This meta-analysis produced weaker criterion correlations than the previous meta-analysis.  The median IAT-criterion correlation was r = .050.  This is also true if the analysis is limited to studies with the race IAT.  After correcting for random measurement error, the authors report on average correlation of r = .14.  However, correction for unreliability yields hypothetical correlations that could be obtained if the IAT were perfectly reliable, which it is not. Thus, for the practical evaluation of the IAT as a measure of individual differences, it is more important how much the actual IAT scores can predict some validation criterion.  With small IAT-criterion correlations around r = .1, large samples would be required to have sufficient power to detect effects, especially incremental effects above and beyond explicit measures. Given that most studies had sample sizes of less than 100 participants, “most studies were vastly underpowered” (Kurdi et al., 2018, p. 1). Thus, it is now clear that IAT scores have low predictive validity, but it is not clear whether IAT scores have any predictive validity, when they have predictive validity, and whether they have predictive validity after controlling for explicit predictors of behavior.

Greenwald et al.’s (2009) 2008 US Election Study

In 2008, a historic event occurred in the United States. US voters had the opportunity to elect the first Black president. Although the outcome is now a historic fact, it was uncertain before the election how much Barak Obama’s racial background would influence White voters.  There was also considerable concern that voters might not reveal their true feelings. This provided a great opportunity to test the validity of implicit measures of racial bias.  If White voters are influenced by racial bias, IAT scores should predict voting intentions above and beyond explicit measures. According to the abstract of the article, the results confirm this prediction. “The implicit race attitude measures (Implicit Association Test and Affect Misattribution Procedure) predicted vote choice independently of the self-report race attitude measures, and also independently of political conservatism and symbolic racism. These findings support construct validity of the implicit measures” (p. 242).

These claims were based on results of multiple regression analyses. “When entered after the self-report measures, the two implicit measures incrementally explained 2.1% of vote intention variance, p=.001, and when political conservativism was also included in the model, “the pair of implicit measures incrementally predicted only 0.6% of voting intention variance, p = .05.”  (p. 247).

I tried to reproduce these results with the published correlation matrix and failed to do so.  A multiple regression analysis with explicit measures, implicit measures, and political orientation as predictors showed non-significant effects for the IAT, b = .002, se = .024, t = .087, p = .930 and the AMP, b = .033, se = .023, t = 1.470, p = .142. I also obtained the raw data from Anthony Greenwald, but I was unable to recreate the sample size of N = 1,057. Instead I obtained a similar sample size of N = 1,035.  Performing the analysis on this sample also produced non-significant results; IAT, b = -.003, se = .044, t = .070, p = .944 and the AMP, b = -.014, se = .042, t = 0.344, p = .731.

To fully explore the relationship among the variables in this valuable dataset, I fitted a structural equation model to the raw data (N = 1,035).  The model had good fit, chi2(9) = 18.27, CFI = .995,  RMSEA = .032 90%CI(.009-.052). As shown in Figure 2, the IAT did not have incremental predictive validity as the residual variance was unrelated to voting. There is also no evidence of discriminant validity because the residuals of the two measures are not correlated. However, the model does show that a ProWhite bias predicts voting above and beyond political orientation.  Thus, the results do support the hypothesis that racial bias influenced voting in the 2008 election.  This bias is reflected in explicit and implicit measures.  Interestingly, the validity coefficients in this study differ from those in Cunningham et al.’s study with undergraduate students.  The factor loadings suggest that the IAT is the most valid measure of racial bias with .59^2 = 36% valid variance as a measure of explicit attitudes. This makes the IAT as valid as the feeling thermometer, which is more valid than the Modern Racism Scale in Cunningham’s study.  This finding has been replicated in subsequent studies (Axt, 2018).   

In conclusion, a reexamination of the 2008 election study shows that the data are entirely consistent with a single-attitude model and that there is no evidence for incremental predictive validity or discriminant validity in these data. However, the study does show some predictive validity of the IAT and convergent validity with explicit measures. Thus, the results provide no construct validity of the IAT as a measure of implicit individual differences, but the results can also be interpreted as evidence for validity as a measure of the same construct that is measured with explicit measures.  This shows that claims about validity vary as a function of the construct that is being measured.  A scale is a good measure of weight, but not of intelligence.  The results here suggest that the race IAT is a moderately valid measure of racial bias, but an invalid measure of implicit bias, which may not even exist because scientific claims about implicit bias require valid measures of implicit bias.

Reexamining a Multi-Trait Multi-Method Study

The most recent and extensive multi-trait multi-method validation study of the IAT was published last year (Bar-Anan & Vianello, 2018).  The abstract claims that the results provide clear support for the validity of the IAT as a measure of implicit cognitions, including implicit self-esteem. “The evidence supports the dual-attitude perspective, bolsters the validation of 6 indirect measures, and clears doubts from countless previous studies that used only one indirect measure to draw conclusions about implicit attitudes” (p. 1264). 

Below I show that these claims are not supported by the data, and that single-attitude models fit the data as well as dual-attitude models. I also show that dual-attitude models show low convergent validity across implicit measures, while IAT variants share method variance because they rely on the same mechanisms to measure attitudes.

Bar-Anan and Vianello (2018) fitted a single model to measures of self-esteem, racial bias, and political orientation. This makes the model extremely complex and produced some questionable results (e.g., the implicit and explicit method factors were highly correlated; some measured had negative loadings on the method factors).  In structural equation modeling, it is good practice to fit smaller models before creating a larger model.  Thus, I first examined construct validity for each domain separately before I fitted a model that integrates models into a single unified model.

Race IAT

I first fitted a dual-attitude model to measures of racial attitudes and included contact as the criterion variable. I did not specify a causal relationship between contact and attitudes because attitudes can influence contact and vice versa.  The dual-attitude model had good fit, chi2(48) = 109.41; CFI = .975; RMSEA = 0.010 (90% confidence interval: 0.007, 0.012).  The best indicator of the explicit factor was the preference rating (Figure 3).  The best indicator of the implicit factor was the BIAT.  However, all IAT-variants had moderate to high loadings on the implicit factor. In contrast, the evaluative priming measure had a low loading on the implicit factor and the AMP had a moderate loading on the explicit factor and no significant loading on the implicit factor.  These results show that Bar-Anan and Vianello’s model failed to distinguish between IAT-specific method variance and method variance for implicit measures in general. The present results show that IAT-variants share little valid variance or method variance with conceptually distinct implicit measures.

Not surprisingly, a single-attitude model with an IAT method factor (Figure 4) also fit the data well, chi2(46) = 112.04; CFI = .973; RMSEA = 0.010 (90% confidence interval: 0.008, 0.013).  Importantly, the model has no shared method variance between conceptually different explicit measures like preference ratings and the Modern Racism Scale (MRS).  The AMP and the EP both are valid measures of attitudes but with relatively modest validity. The BIAT has a validity of .46, with 22% explained variance. This result is more consistent with Cunningham et al. (2001) than Greenwald et al. (2009) data.  The model also shows a clear relationship between contact and less pro-White bias. Finally, the model shows that the IAT method factor is unrelated to contact. Thus, any relationship between IAT scores and contact is explained by the shared variance with explicit measures.

These results show that Bar-Anan and Vianello’s (2018) conclusion are not supported by the data. Although a dual-attitude model can be fitted to the data, it shows low convergent validity across different implicit measures, and a single-attitude model fits the data as well as a dual-attitude model.

Political Orientation

Figure 5 shows the dual-attitude model for political orientation.  The explicit factor is defined by a simple rating of preference for republicans versus democrats, the modern racism scale, the right-wing-authoritarianism scale, and ratings of Hillary Clinton.  The implicit factor is defined by the IAT, the brief IAT, the Go-NoGo Task, and single category IATs.  The remaining two implicit measures, the Affect Misattribution Task, and Evaluative Priming are allowed to load on both factors.  Voting in the previous election is predicted by explicit attitudes.  The model has good fit to the data, chi2(48) = 99.34; CFI = .991; RMSEA = 0.009 (90% confidence interval: 0.006, 0.011).  The loading pattern shows that the AMP and EP load on the implicit factor.  This supports the hypothesis that all implicit measures have convergent validity. However, the loadings for the IATs are much higher. In the dual-attitude framework this would imply that the IAT is a much more valid measure of implicit attitudes than the AMP or EP.  Evidence for discriminant validity is weak. The correlation between the explicit and the implicit factor is r = .89.  The correlation in the original article was r = .91.  Nevertheless, the authors concluded that the data favor the two-factor model because constraining the correlation to 1 reduced model fit.

However, it is possible to fit a single-construct model by allowing for an IAT-variant method factor, chi2(50) = 86.25; CFI = .993; RMSEA = 0.007 (90% confidence interval: 0.005, 0.010).  This model (Figure 6) shows that voting is predicted by a single latent factor that represents political orientation and that simple self-report measures of political orientation are the most valid measure of political orientation.  The IAT shows stronger correlations with explicit measures because it is a more valid measure of political orientation,  .74^2 = 55% valid variance, than the race IAT (22% valid variance).   

Self-Esteem

Figure 7 shows the results for a dual-attitude model of self-esteem.  Model fit was good, although CFI was lower than in the previous model due to weaker factor loadings, chi2(16) = 28.62; CFI = .950; RMSEA = 0.008 (90% confidence interval: 0.003, 0.013).  The model showed a moderate correlation between the explicit and implicit factors, r = .46, which is stronger than in the original article, r = .29, but clearly suggestive of two distinct factors. However, the nature of these two factors is less clear. The implicit factor is defined by the three IAT measures, whereas the AMP and EP have very low loadings on this factor.  This is also true in the original article with loadings of .24 for AMP and .13 for EP.  Thus, the results confirm Bosson’s seminal finding that different implicit measures have low convergent validity. 


As the Implicit Factor was mostly defined by the IAT measures, it was also possible to fit a single-factor model mode with an IAT measurement factor (Figure 8), chi2(16) = 31.50; CFI = .938; RMSEA = 0.009 (90% confidence interval: 0.004, 0.013). However, some of the results of this model are surprising.

According to this model, the validity coefficient of the widely used Rosenberg self-esteem scale is only r = .35, suggesting that only 12% of the variance in the Rosenberg self-esteem scale is valid variance. In addition, the IAT and the BIAT would be equally valid measures of self-esteem.  Thus, previous results of low implicit-explicit correlations for self-esteem (Bosson et al., 2000; Hofmann et al., 2005) would imply low validity of implicit and explicit measures.  This finding would have dramatic implications for the interpretation of low self-esteem-criterion correlations.  A valid self-esteem-criterion correlation of r = .3, would produce only an observed correlation of r = .30*.35 = .11 with the Rosenberg self-esteem scale or the IAT.  Correlations of this magnitude require large samples (N = 782) to have an 80% probability to obtain a significant result with alpha = .05 or N = 1,325 with alpha = .005.  Thus, most studies that tried to predict performance criteria form self-esteem were underpowered.  However, the results of this study are limited by the use of an online sample and the lack of proper criterion variables to examine predictive validity.  The main conclusion from this analysis is that a single-factor model with an IAT method factor fit the data well and that the dual attitude model failed to demonstrate convergent validity across different implicit measures; a finding that replicates Bosson et al. (2000), which Bar-Anan and Vianello do not cite.

A Unified Model

After establishing well-fitting models for each personality characteristic, it is possible to fit a unified model. Importantly, no changes to the individual models should be made because a decrease in fit can be attributed to the new relationships across different personality characteristics.  Without any additional modifications, the overall model in Figure 9 had good fit,  XX.  Correlations among the IAT method factors showed significant positive correlations of the method factor for race with the method factor for self-esteem (r = .4) and political orientation (r = .2), but a negative correlation for the method factors for self-esteem and political orientation (r = -.3).  This pattern of correlations is inconsistent with a simple method factor that is expected to produce positive correlations. Thus, it is impossible to fit a general method factor to different IATs. This finding replicates Nosek and Smyth’s (2007) findings.

Correlations among the personality characteristics replicate the finding with Greenwald et al.’s (2009) data that Republicans are more likely to have a pro-white bias, r = .4.  Political orientation is unrelated to self-esteem, r = .0, but Pro-White bias tends to be positively related to self-esteem, r = .2.  

In conclusion, the present results show that Bar-Anan and Vianello’s claims are not supported by the data.  Their data do not provide clear evidence for discriminant validity of implicit and explicit constructs.  The data are fully consistent with the alternative hypothesis that the IAT and other implicit measures measure the same construct that is being measured with implicit factors. Thus, the data provide no support for the construct validity of the IAT as a measure of implicit personality characteristics.

Validity of the Self-Esteem IAT

Bosson et al. (2000) seminal article raised first concerns about the construct validity of the self-esteem IAT. Since then, other critical articles have been published; none of which are cited in Kurdi et al. (2018). Gawronski, LeBel, and Peters (2007) wrote a PoPS article on the construct validity of implicit self-esteem. They fond no conclusive evidence that(a) the self-esteem IAT measures unconscious self-esteem or that (b) low correlations are due to self-report biases in explicit measures of self-esteem. Walker and Schimmack (2008) used informant ratings to examine predictive validity of the self-esteem IAT. Informant ratings are the most widely used validation criterion in personality research, but have not been used by social psychologists. One advantage of informant ratings is that they also measure general personality characteristics rather than specific behaviors, which ensures higher construct-criterion correlations due to the power of aggregation (Epstein, 1980).  Walker and Schimmack (2008) found that informant ratings of well-being were more strongly correlated with explicit self-ratings of well-being than with a happiness or a self-esteem IAT. 

The most recent and extensive review was conducted by Falk and Heine (2014) who found that “the validity evidence for the IAT in measuring ISE [implicit self-esteem] is strikingly weak” (p. 6).  They also point out that implicit measures of self-esteem “show a remarkably consistent lack of predictive validity” (p. 6).  Thus, an unbiased assessment of the evidence is fully consistent with the analyses of Bar-Anan and Vianello’s data that also found low validity of the self-esteem IAT as a measure of self-esteem.

Currently, a study by Falk, Heine, Takemura, Zhang, and Hsu (2013) provides the most comprehensive examination of convergent and discriminant validity of self-esteem measures. I therefore used structural equation modeling of their data to see how consistent the data are with a dual-attitude model or a single-attitude model.  The biggest advantage of the study was the inclusion of informant ratings of self-esteem, which makes it possible to model method-variance in self-ratings (Anusic et al., 2009).  Previous research showed that self-ratings of self-esteem have convergent validity informant ratings of self-esteem (Simms, Zelazny, Yam, & Gros, 2010; Walker & Schimmack, 2008).  I also included the self-report measures of positive affect and negative affect to examine criterion validity.

It was possible to fit a single-factor model to the data (Figure 10), chi2(67) = 115.85; CFI = .964; RMSEA = 0.050 (90% confidence interval: 0.034, 0.065).  Factor loadings show the highest loadings for self-ratings on the self-competence scale and the Rosenberg self-esteem scale. However, informant ratings also had significant loadings on the self-esteem factor, as did self-ratings on the narcissist personality inventory.  A measure of halo bias in self-ratings of personality (SEL) also had moderate loadings, which confirms previous findings that self-esteem is related to evaluative biases in personality ratings (Anusic et al., 2009).  The false uniqueness measure (FU; Falk et al., 2015) had modest validity. In contrast, the implicit measures had no significant loadings on this factor.  In addition, the residual correlations among the implicit measures were weak and not significant. Given the lack of positive relations among implicit measures it was impossible to fit a dual-attitude model to these data.

It is not clear why Bar-Anan and Vianello’s data failed to show higher validity of explicit measures, but the current results are consistent with moderate validity of explicit self-ratings in the personality literature (Simms et al., 2010). Thus, there is consistent evidence that implicit self-esteem measures have low validity as measures of self-esteem and there is no evidence that they are measures of implicit self-esteem.

Explaining Variability in Explicit-Implicit Correlations

One well-established phenomenon in the literature is that correlations between IAT scores and explicit measures vary across domains (Bar-Anan & Vianello, 2018; Hofmann et al., 2005).  As shown earlier, correlations for political orientation are strong, correlations for racial attitudes are moderate, and correlations for self-esteem are weak.  Greenwald and Banaji (2017) offer a dual-attitude explanation for this finding. “The plausible interpretations of the more common pattern of weak implicit– explicit correlations are that (a) implicit and explicit measures tap distinct constructs or (b) they might be affected differently by situational influences in the research situation (cf. Fazio & Towles-Schwen, 1999; Greenwald et al., 2002) or (c) at least one of the measures, plausibly the self-report measure in many of these cases, lacks validity” (p. 868). 

The evidence presented here offers a different explanation.  IAT-explicit correlations and IAT-criterion correlations increase with the validity of the IAT as a measure of the same personality characteristics that are measured with explicit measures.  Thus, low correlations of the self-esteem IAT with explicit measures of self-esteem show low validity of the self-esteem IAT.  High correlations of the political orientation IAT with explicit measures of political orientation show high validity of the IAT as a measure of political orientation; not implicit political orientation.  Finally, modest correlation between the race IAT and explicit measures of racial bias show moderate validity of the race IAT as a measure of racial bias. However, the validity of the race IAT as a measure of racial bias (not implicit racial bias!) varies considerably across studies. This variation may be due to the variability of racial bias in samples which may be lower in student samples.  Thus, contrary to Greenwald and Banaji’s claims, the problem is not with the explicit measures, but with the IAT.

An important question is why the self-esteem IAT is less valid than the political orientation IAT.  I propose that one cause of variation in the validity of the IAT is related to the proportion of respondents on the two ends of a personality characteristic. To test this hypothesis, I used Bar-Anan and Vianello’s data.  To determine the direction of the IAT score, I used a value of 0 as the neutral point.  As predicted, 90% of participants associated self with good, 78% associated White is good, and 69% associated Democrat with good.  Thus, validity decreases with the proportion of participants who are on one side of the bipolar dimension.

Next, I regressed the preference measure on a simple dichotomous predictor that coded the direction of the IAT.  I standardized the preference measure and report standardized and unstandardized regression coefficients.  Standardized regression coefficients are influenced by the distribution of the predictor variable and should show the expected pattern. In contrast, unstandardized coefficients are not sensitive to the proportions and should not show the pattern. I also added the IAT scores as predictors in a second step to examine the incremental predictive validity that is provided by the reaction times.

The standardized coefficients are consistent with predictions (Table 1). However, the unstandardized coefficients also show the same pattern. Thus, other factors also play a role. The amount of incremental explained variance by reaction times shows no differences between the race and the political orientation task.  Most of the differences in validity are due to the direction of the attitude (4% explained variance for race bias vs. 38% explained variance for political orientation).

Table 1

SE       B = .310, SE = .142; b = .093, se = .043; r2 = .009, Δr2 = .002, z = 1.09

Race    B = .467, SE = .010, b = .193, se = .041, r2 = .041, Δr2 = .060, z = 5.79

PO       B = 1.380, SE = .080, b = .637, se = .037, r2 = .380, Δr2 = .070, z = 7.83

The results show the importance of taking the proportion of respondents with opposing personality characteristics into account. The IAT is least valid when most participants are high or low on a personality characteristic, and it is most valid when participants are split into two equally large groups. 

In conclusion, I provided an alternative explanation of variation in explicit-implicit correlations that is consistent with the data.  Implicit-explicit correlations vary at least partially as a function of the validity of the IAT as a measure of the same construct that is measured with explicit measures, and the validity of the IAT varies as a function of the proportion of respondents who are high versus low on a personality characteristic. As most respondents associate the self with good, and reaction times contribute little to the validity of the IAT, the IAT has particularly low validity as a measure of self-esteem.

The Elusive Malleability of Implicit Attitude Measures

Numerous experimental studies have tried to manipulate situational factors in order to change scores on implicit attitude measures (Lai, Hoffman, & Nosek, 2013).  Many of these studies focused on implicit measures of prejudice in order to develop interventions that could reduce prejudice. However, most studies were limited to brief manipulations with immediate assessment of attitudes (Lai et al., 2013).  The results of these studies are mixed.  In a seminal study, Dasgupta and Greenwald (2001) exposed participants to images of admired Black exemplars and disliked White exemplars. They reported that this manipulation had a large effect on IAT scores. However, these days the results of this study are less convincing because it has become apparent that large effect sizes from small samples often do not replicate (Open Science Collaboration, 2015). Consistent with this skepticism, Joy-Gaba and Nosek (2010) had difficulties replicating this effect with much larger samples and found only an average effect size of d = .08.  With effect sizes of this magnitude, other reports of successful experimental manipulations were extremely underpowered.   Another study with large samples found stronger effects (Lai et al., 2016).  The strongest effect was observed for instruction to fake the IAT.  However, Lai et al. also found that none of these manipulations had lasting effects in a follow-up assessment. This finding suggests that even when changes are observed, they reflect context-specific method variance rather than actual changes in the construct that is being measured. 

This conclusion is also supported by one of the few longitudinal IAT studies. Cunningham et al.’s (2001) multi-method study repeated the measurement of racial bias on four separate occasions.  The model shown in Figure 1 shows no systematic relationships between measures taken on the same occasion, and adding these relationships shows non-significant correlated residuals. Thus, in this sample naturally occurring factors did not change race bias. This finding suggests that the IAT and explicit measures measure stable personality characteristics rather than context-specific states.

Only a few serious intervention studies with the IAT have been conducted (Lai et al., 2013).  The most valuable evidence so far comes from studies that examined the influence of living with an African American roommate on White students’ racial attitudes (Shook & Fazio, 2008; Shook, Hopkins, & Koech, 2016).  One study found effects on an implicit measure, F(1,236) = 4.33, p = .04 (Shook & Fazio, 2008), but not on an explicit measure (Shook, 2007).  The other study found effects on explicit attitudes, F(1,107) = 7.34, p = .008 but no results for implicit measures were reported (Shook, Hopkins, & Koech, 2016). Given the small sample sizes of these studies, inconsistent results are to be expected. 

In conclusion, the existing evidence shows that implicit and explicit attitude measures are highly stable over time (Cunningham et al., 2001). I also concur with Joy-Gaba and Nosek that moving scores on implicit bias measures “may not be as easy as implied by the existing experimental demonstrations” (p. 145), and a multi-method assessment is needed to distinguish effects on specific measures from effects on personality characteristics (Olsen & Fazio, 2003).

Future studies of attitude change need a multi-method approach, powerful interventions, adequate statistical power, and multiple repeated measurements of attitudes to distinguish mere occasion-specific variability (malleability) from real attitude change (Anusic & Schimmack, 2016). Ideally, the study would also include informant ratings. For example, intervention studies with roommates could use African Americans as informants to rate their White roommates’ racial attitudes and behaviors.  The single-attitude model predicts that implicit and explicit measures will show consistent results and that variation in effect sizes is explained by the validity of each measure. 

Discussion

Does the IAT Measure Implicit Constructs?

Construct validation is a difficult and iterative process because scientific evidence can alter the understanding of constructs.  However, construct validation research has to start with a working definition of a construct.  The IAT was introduced as a measure of individual differences in implicit social cognition, and implicit social cognitions were defined as aspects of thinking and feeling that may not be easily accessed or available to consciousness (Nosek, Greenwald, & Banaji, 2007, p. 265).  This definition is vague, but it makes a clear prediction that the IAT should measure personality characteristics that cannot be measured with self-reports.  This leads to the prediction that explicit measures and the IAT have discriminant validity.  To demonstrate discriminant validity, unique variance in the IAT has to be related to other indicators of implicit personality characteristics.  This can be demonstrated with incremental predictive validity or convergent validity with other measures of implicit personality characteristics.  Consistent with this line of reasoning, numerous articles have claimed that the IAT has construct validity as a measure of implicit personality characteristics because it shows incremental predictive validity (Greenwald et al., 2009; Kurti et al., 2018) or because the IAT shows convergent validity with other implicit measures and discriminant validity with explicit measures (Bar-Anan & Vianello, 2018).  I demonstrated that all of these claims were false and that the existing evidence provides no evidence for the construct validity of the IAT as a measure of implicit personality characteristics.  The main problem is that most studies that used the IAT assumed construct validity rather than testing it.  Hundreds of studies used the IAT as a single measure of implicit personality characteristics and made claims about implicit personality traits based on variation in IAT scores.  Thus, hundreds of studies made claims that are not supported by empirical evidence simply because it has not been demonstrated that the IAT measures implicit personality constructs.  In this regard the IAT is not alone.  Aside from the replication crisis in psychology (OSC, 2015), psychological science suffers from an even more serious validation crisis. All empirical claims rest on the validity of measures that are used to test theoretical claims. However, many measures in psychology are used without proper validation evidence.  Personality research is a notable exception.  In response to criticism of low predictive validity (Mischell, 1968), personality psychologists embarked on a program of research that demonstrated predictive validity and convergent validity with informant ratings (Funder, $$$).  Another problem is that psychologists treat validity as a qualitative construct, leading to any evidence of validity to support claims that a measure is valid, as if it were 100% valid. However, most measures in psychology have only moderate validity (Schimmack, 2010). Thus, it is important to quantify validity and to use a multi-method approach to increase validity.  The popularity of the IAT reveals the problems with using measures without proper validation evidence.  Social psychologists have influenced public discourse, if not public policy, about implicit racial bias.  Most of these claims are based on findings with the IAT, assuming that IAT scores reflect implicit bias. As demonstrated here, these claims are not valid because the IAT lacks construct validity as a measure of implicit bias.  In the future, psychologists need to be more careful when they make claims based on new measures with limited knowledge about their validity.  Maybe psychological organizations should provide clear guidelines about minimal standards that need to be met before a measure can be used, just like there are guidelines for validity evidence for personality assessment.  In conclusion, psychology suffers as much from a validation crisis as it suffers from a replication crisis.  Fixing the replication crisis will not improve psychology if replicable results are obtained with invalid measures.

The Silver Lining

Psychologists are often divided into opposing camps (e.g. nature vs. nurture; person vs. situation; the IAT is valid vs. invalid).  Many fans of implicit measures are likely to dislike what I had to say about the IAT.  However, my position is different from previous criticisms of the IAT as being entirely invalid (Oswald et al., 2013).  I have demonstrated with several multi-method studies that the IAT has convergent validity with other measures of some personality characteristics. In some domains this validity is too low to be meaningful.  In other domains, the validity of explicit measures is so high that using the IAT is not necessary. However, for sensitive attitudes like racial attitudes, the IAT offers a promising complementary measure to explicit measures of racial attitudes.  Validity coefficients ranged from 20% to 40%.  As the IAT does not appear to share method variance with explicit measures, it is possible to improve the measurement of racial bias by using explicit and implicit measures and to aggregate scores to obtain a more valid measure of racial bias than either an explicit or an implicit measure can provide.  The IAT may also offer benefits in situations where socially desirable responding is a concern.  Thus, the IAT might complement other measures of personality characteristics. This changes the interpretation of explicit-IAT correlations. Rather than (mis)interpreting low correlations as evidence of discriminant validity, high correlations can reveal convergent validity. Similarly, improvements in implicit measures should produce higher correlations with explicit measures.  How useful the IAT and other implicit measures are for the measurement of other personality characteristics has to be examined on a case by case basis. Just like it is impossible to make generalized statements about the validity of self-reports, the validity of the IAT can vary across personality characteristics.  

Conclusion

Social psychologists have always distrusted self-report, especially for the measurement of sensitive topics like prejudice.  Many attempts were made to measure attitudes and other constructs with indirect methods.  The IAT was a major breakthrough because it has relatively high reliability compared to other methods.  Thus, creating the IAT was a major achievement that should not be underestimated because the IAT lacks construct validity as a measure of implicit personality characteristics. Even creating an indirect measure of attitudes is a formidable feat. However, in the early 1990s, social psychologists were enthralled by work in cognitive psychology that demonstrated unconscious or uncontrollable processes. Implicit measures were based on this work and it seemed reasonable to assume that they might provide a window into the unconscious. However, the processes that are involved in the measurement of personality characteristics with implicit measures are not the personality characteristics that are being measured.  There is nothing implicit about being a Republican or Democrat, gay or straight, or low self-esteem.  Conflating implicit processes in the measurement of personality constructs with implicit personality constructs has created a lot of confusion. It is time to end this confusion. The IAT is an implicit measure of personality with varying validity.  It is not a window into people’s unconscious feelings, attitudes or personalities.

References

Axt, J. R. (2018). The Best Way to Measure Explicit Racial Attitudes Is to Ask About Them. Social Psychological and Personality Science, 9, 896-906. https://doi.org/10.1177/1948550617728995

Anusic, I., & Schimmack, U. (2016). Stability and change of personality traits, self-esteem, and well-being: Introducing the meta-analytic stability and change model of retest correlations. Journal of Personality and Social Psychology, 110(5), 766-781. http://dx.doi.org/10.1037/pspp0000066

Anusic, I., Schimmack, U., Pinkus, R., & Lockwood, P. (2009). The nature and structure of correlations among Big Five ratings: the halo-alpha-beta model. Journal of Personality and Social Psychology, 97 6, 1142-56.

Bar-Anan, Y., & Vianello, M. (2018). A multi-method multi-trait test of the dual-attitude perspective. Journal of Experimental Psychology: General, 147(8), 1264-1272. http://dx.doi.org/10.1037/xge0000383

Blanton, H., Jaccard, J., Klick, J., Mellers, B., Mitchell, G., & Tetlock, P. E. (2009). Strong claims and weak evidence: Reassessing the predictive validity of the IAT. Journal of Applied Psychology, 94(3), 567-582. http://dx.doi.org/10.1037/a0014665

Bosson, J. K., Swann, W. B., Jr., & Pennebaker, J. W. (2000). Stalking the perfect measure of implicit self-esteem: The blind men and the elephant revisited? Journal of Personality and Social Psychology, 79(4), 631-643. http://dx.doi.org/10.1037/0022-3514.79.4.631

Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81-105. http://dx.doi.org/10.1037/h0046016

Chen, F., West, S.G., & Sousa, K.H. (2006) A Comparison of Bifactor and Second-Order Models of Quality of Life, Multivariate Behavioral Research, 41:2, 189-225,
DOI: 10.1207/s15327906mbr4102_5

Cunningham, W. A., Preacher, K. J., & Banaji, M. R. (2001). Implicit attitude measures: Consistency, stability, and convergent validity. Psychological Science, 12, 163-170. http://dx.doi.org/10.1111/1467-9280.00328

Dasgupta, N., & Greenwald, A. G. (2001). On the malleability of automatic attitudes: Combating automatic prejudice with images of admired and disliked individuals. Journal of Personality and Social Psychology, 81, 800–814. doi:10.1037/0022-3514.81.5.800

Epstein, S. (1980). The stability of behavior: II. Implications for psychological research. American Psychologist, 35(9), 790-806. http://dx.doi.org/10.1037/0003-066X.35.9.790

Falk, C. F., Heine, S. J., Takemura, K. , Zhang, C. X. and Hsu, C. (2015). Are Implicit Self-Esteem Measures Valid for Assessing Individual and Cultural Differences. Journal of Personality, 83: 56-68. doi:10.1111/jopy.12082

Falk, C., & Heine, S.J. (2015). What is implicit self-esteem, and does it vary across cultures? Personality and Social Psychology Review, 19, 177-98.

Greenwald, A. G., & Farnham, S. D. (2000). Using the Implicit Association Test to measure self-esteem and self-concept. Journal of Personality and Social Psychology, 79(6), 1022-1038. http://dx.doi.org/10.1037/0022-3514.79.6.1022

Fazio, R. H., & Olson, M. A. (2003). Implicit measures in social cognition. research: Their meaning and use. Annual Review of Psychology, 54, 297–327. http://dx.doi.org/10.1146/annurev.psych.54.101601.145225

Fazio, R.H., Sanbonmatsu, D.M., Powell, M.C., & Kardes, F.R. (1986). On the automatic activation of attitudes. Journal of Personality and Social Psychology, 50, 229–238.

Joy-Gaba, J. A., & Nosek, B. A. (2010). The surprisingly limited malleability of implicit racial evaluations. Social Psychology, 41, 137–146. doi:10.1027/1864-9335/a000020

Gawronski, B., LeBel, E. P., & Peters, K. R. (2007). What do implicit measures tell us?: Scrutinizing the validity of three common assumptions. Perspectives on Psychological Science, 2(2), 181-193. http://dx.doi.org/10.1111/j.1745-6916.2007.00036.x

Greenwald, A.G., McGhee, D.E., & Schwartz, J.L.K. (1998). Measuring individual differences in implicit cognition: The Implicit Association Test. Journal of Personality and Social Psychology, 74, 1464–1480.

Greenwald, A. G., Poehlman, T. A., Uhlmann, E. L., & Banaji, M. R. (2009). Understanding and using the Implicit Association Test: III. Meta-analysis of predictive validity. Journal of Personality and Social Psychology, 97, 17–41. http://dx.doi.org/10.1037/a0015575

Greenwald, A. G., Smith, C. T., Sriram, N., Bar-Anan, Y., & Nosek, B. A. (2009). Race attitude measures predicted vote in the 2008 U. S. Presidential Election. Analyses of Social Issues and Public Policy, 9, 241–253.

Hofmann, W., Gawronski, B., Gschwendner, T., Le, H., & Schmitt, M. (2005). A meta-analysis on the correlation between the Implicit Association Test and explicit self-report measures. Personality and Social Psychology Bulletin, 31, 1369 –1385. http://dx.doi.org/10.1177/0146167205275613

Kurdi, B., Seitchik, A. E., Axt, J. R., Carroll, T. J., Karapetyan, A., Kaushik, N., . . . Banaji, M. R. (2018). Relationship between the Implicit Association Test and intergroup behavior: A meta-analysis. American Psychologist. Advance online publication. http://dx.doi.org/10.1037/amp0000364

Lai, C.K., Hoffman, K.M., & Nosek, B.A. (2013). Reducing Implicit Prejudice. XX

McConahay, J.B. (1986). Modern racism, ambivalence, and the modern racism scale. In J.F. Dovidio & S.L. Gaertner (Eds.), Prejudice, discrimination, and racism (pp. 91–125). Orlando, FL: Academic Press

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), 1-8.

Oswald, F. L., Mitchell, G., Blanton, H., Jaccard, J., & Tetlock, P. E. (2013). Predicting ethnic and racial discrimination: A meta-analysis of IAT criterion studies. Journal of Personality and Social Psychology, 105(2), 171-192. http://dx.doi.org/10.1037/a0032734

Pelham, B. W., & Swann, W. B. (1989). From self-conceptions to self-worth: On the sources and structure of global self-esteem. Journal of Personality and Social Psychology, 57, 672– 680

Rosenberg, M. (1965). Society and the Adolescent Self-image. Princeton, NJ: Princeton University Press.

Schneider, D. J. (1973). Implicit personality theory: A review. Psychological Bulletin, 79(5), 294-309. http://dx.doi.org/10.1037/h0034496

Simms, L.J., Zelazny, K., Yam, W.H., & Gros, D.F. (2010). Self-informant Agreement for Personality and Evaluative Person Descriptors: Comparing Methods for Creating Informant Measures. European Journal of Personality, 24 3, 207-221.

Vul, E, Harris, C, Winkielman, P., & Pashler, (2009). Puzzlingly High Correlations in fMRI Studies of Emotion, Personality, and Social Cognition, Perspectives on Psycholical Science, 4, 274-90. doi: 10.1111/j.1745-6924.2009.01125.x.

Walker, S. S., & Schimmack, U. (2008). Validity of a happiness implicit association test as a measure of subjective well-being. Journal of Research in Personality, 42(2), 490-497. http://dx.doi.org/10.1016/j.jrp.2007.07.005