Category Archives: Validity

Brain Nosek explains the IAT

I spent 20 minutes, actually more than 20 minutes because I had to rewind to transcribe, listening to a recent podcast in which Brain Nosek was asked some questions about the IAT and implicit bias training (The Psychology Podcast, August 1, 2019).

Scott Barry Kaufman: How do you see the IAT now and how did you see it when you started work on Project Implicit? How discrepant are these stats of mind?

Brian Nosek: I hope I have learned a lot from all the research that we have done on it over the years. In the big picture I have the same view that I have had since we did the first set of studies. It is a great tool for research purposes and we have been able to learn a lot about the tool itself and about human behavior and interaction with the tool and a lot about the psychology of things that are [gap] occur with less control AND less awareness than just asking people how they feel about topics. So that has been and continues to be a very productive research area for trying to understand better how humans work.

And then the main concern that we had at onset and that is actually a lot of the discussion of even creating the website is the same anticipated some of the concerns and overuses that happened with the IAT in the present and that is the natural – I don’t know if natural is the right word – the common desire that people have for simple solutions and thinking well a measure is a direct indicator of something that we care about and it shouldn’t have any error in measurement and it should be applicable to lots and lots of situations.  And thus lots of potential of misuse of the IAT despite it being a very productive research tool and education too.  I like the experience of doing it and delivering to an audience and the discussion it provokes; what is it that it means, what does it mean about me, what does it mean about the world; those are really productive intellectual discussions and debates.  But the risk part the overapplication of the IAT for selection processes. We should use this. We should [?] use this for deciding who gets a job or not; we should [?] use this who is on a jury or not. Those are the kind of real-world applications of it as a measure that go far beyond its validity.  And so this isn‘t exact answering your question because even at the very beginning when we launched the website we said explicitly it should not be used for these purposes and I still believe this to be true. What has changed over time is the refinement of where it is we understand the evidence base against some of the major questions. And what is amazing about it is that there has been so much research and we still don’t have a great handle on really big questions relating to the IAT and measures like it.  So this is just part of [unclear]  how hard it is to actually make progress in the study of human behavior.   

Scott Barry Kaufman:  Let’s talk shop for a second [my translation; enough with the BS]. My dissertation at Yale a couple of year after years was looking at the question are there individual differences in implicit cognition.  And the idea was to ask this question because from a trait perspective I felt that was a huge gap in the literature. There was so much research on the reliability and validity of IQ tests for instance, but I wanted to ask the question if we adapt some of these implicit cognition measures from the social psychological experimental literature for an individual differences paradigm you know are they reliable and stable differences. And I have a whole appendix of failed experiments – by the way, you should tell how to publish that some day but we’ll get to that in a second, but so much of my dissertation, I am putting failed in quotes because you know I mean that was useful information … it was virtually impossible to capture reliable individual differences that cohered over time but I did find one that did and I published that as a serial reaction time task, but anyway, before we completely lose my audience which is a general audience I just want to say that I am trying to link this because for me one of the things that I am most wary about with the IAT is like – and this might be more of a feature than a bug – but it may be capturing at this given moment in time when a person is taking the test it is capturing a lot of the societal norms and influences are on that person’s associations but not capturing so much an intrinsic sort of stable individual differences variable. So I just wanted to throw that out and see what your current thoughts on that are.

Brian Nosek:   Yeah, it is clear that it is not trait like in the same way that a measure like the Big Five for personality is trait-like.  It does show stability over time, but much more weakly than that.  Across a variety of topics you might see a test-retest correlation for the IAT measuring the same construct of about .5  The curiosity for this is;  I guess it is a few curiosities. One is does that mean we have have some degree of trait variance because there is some stability over time and what is the rest? Is the rest error or is it state variance in some way, right. Some variation that is meaningful variation that is sensitive to the context of measurement. Surely it is some of both, but we don’t know how much. And there isn’t yet a real good insight on where the prediction components of the IAT are and how it anticipates behavior, right.  If we could separate in a real reliable way the trait part, the state part, and the error part, than we should be able to uniquely predict different type of things between the trait, the state, and the trait components. Another twist which is very interesting that is totally understudied in my view is the variations in which it is state or trait like seems to vary by the topic you are investigating. When you do a Democrat – Republican IAT, to what extent do people favor one over the other, the correlation with self-report is very strong and the stability over time is stronger than when you measure Black-White or some of the other types of topics. So there is also something about the attitude construct itself that you are assessing that is not as much measurement based but that is interacting with the measure that is anticipating the extent to which it is trait or state like. So these are all interesting things that if I had time to study them would be the problems I would be studying, but I had to leave that aside

Scott Barry Kaufman: You touch on a really interesting point about this. How would you measure the outcome of this two-day or week- training thing? It seems that would not be a very good thing to then go back to the IAT and see a difference between the IAT, IAT pre and IAT-post, doesn’t seem like the best outcome you know you’d want, I mean you ….

Brian Nosek I mean you could just change the IAT and that would be the end of it. But, of course, if that doesn’t actually shift behavior then what was the point?

Scott Barry Kaufman:  to what extent are we making advances in demonstrating that there are these implicit influences on explicit behavior that are outside of our value system? Where are we at right now? 

[Uli, coughs, Bargh, elderly priming]

Brian Nosek: Yeah, that is a good question. I cannot really comment on the micro-aggression literature. I don’t follow that as a distinct literature, but on the general point I think it is the big picture story is pretty clear with evidence which is we do things with automaticity, we do things that are counterproductive to our interests all the time, and sometimes we recognize we are doing it, sometimes we don’t, but a lot of time it is not controllable.  But that is a very big picture, very global, very non-specific point.

If you want to find out what 21 years of research on the IAT have shown, you can read my paper (Schimmack, in press, PoPS). In short,

  • most of the variance in the race IAT (Black-White) is random and systematic measurement error.
  • Up to a quarter of the variance reflects racial attitudes that are also reflected in self-report measures of racial attitudes; most clearly in direct ratings of feelings towards Blacks and Whites.
  • there is little evidence that any of the variance in IAT scores reflects some implicit attitudes that are outside of people’s awareness
  • there is no reliable evidence that IAT scores predict discriminatory behavior in the real world
  • visitors of Project Implicit are given invalid feedback that they may hold unconscious biases and are not properly informed about the poor psychometric properties of the test.
  • Founders of Project Implicit have not disclosed how much money they make from speaking engagements related to Project Implicit, royalties from the book “Blindspot,” and do not declare conflict of interest in IAT-related publications.
  • It is not without irony that educators on implicit bias may fail to realize that they have an implicit bias in reading the literature and to dismiss criticism.

How Valid are Short Big-Five Scales?

The first measures of the Big Five used a large number of items to measure personality. This made it difficult to include personality measures in studies as the assessment of personality would take up all of the survey time. Over time, shorter scales became available. One important short Big Five measure is the BFI-S (Lang et al., 2011).  This 15-item measure has been used in several national representative, longitudinal studies such as the German Socio-Economic Panel (Schimmack, 2019a). These results provide unique insights into the stability of personality (Schimmack, 2019b) and the relationship of personality with other constructs such as life-satisfaction (Schimmack, 2019c). Some of these results overturn textbook claims about personality. However, critics argue that these results cannot be trusted because the BFI-S is an invalid measure of personality.

Thus, it is is critical importance to evaluate the validity of the BFI-S. Here I use Gosling and colleagues data to examine the validity of the BFI-S. Previously, I fitted a measurement model to the full 44-item BFI (Schimmack, 2019d). It is straightforward to evaluate the validity of the BFI-S by examining the correlation of the 3-item BFI-S scale scores with the latent factors based on all 44 BFI items. For comparison purposes, I also show the correlations for the BFI scale scores. The complete results for individual items are shown in the previous blog post (Schimmack, 2019d).

The measurement model for the BFS has seven independent factors. Five factors represent the Big Five and two factors represent method factors. One factor represents acquiescence bias. The other factor represents evaluative bias that is present in all self-ratings of personality (Anusic et al., 2009). As all factors are independent, the squared coefficients can be interpreted as the amount of variance that a factor explains in a scale score.

The results show that the BFI-S scales are nearly as valid as the longer BFI scales (Table 1).


For example, the factor-scale correlations for neuroticism, extraversion, and agreeableness are nearly identical. The biggest difference was observed for openness with a correlation of r = .76 for the BFI-scale and r = .66 for the BFI-S scale. The only other notable systematic variance in scales is the evaluative bias influence which tends to be stronger for the longer scales with the exception of conscientiousness. In the future, measurement models with an evaluative bias factor can be used to select items with low loadings on the evaluative bias factor to reduce the influence of this bias on scale scores. Given these results, one would expect that the BFI and BFI-S produce similar results. The next analyses tested this prediction.

Gender Differences

I examined gender differences three ways. First, I examined standardized mean differences at the level of latent factors in a model with scalar invariance (Schimmack, 2019d). Second, I computed standardized mean differences with the BFI scales. Finally, I computed standardized mean differences with the BFI-S scales. Table 2 shows the results. Results for the BFI and BFI-S scales are very similar. The latent mean differences show somewhat larger differences for neuroticism and agreeablness because these mean differences are not attenuated by random measurement error. The latent means also show very small gender differences for the method factors. Thus, mean differences based on scale scores are not biased by method variance.

Table 2. Standardized Mean Differences between Men and Women


Note. Positive values indicate higher means for women than for men.

In short, there is no evidence that using 3-item scales invalidates the study of gender differences.

Age Differences

I demonstrated measurement invariance for different age groups (Schimmack, 2019d). Thus, I used simple correlations to examine the relationship between age and the Big Five. I restricted the age range from 17 to 70. Analyses of the full dataset suggest that older respondents have higher levels of conscientiousness and agreeableness (Soto, John, Gosling, & Potter, 2011).

Table 3 shows the results. The BFI and the BFI-S both show the predicted positive relationship with conscientiousness and the effect size is practically identical. The effect size for the latent variable model is stronger because the relationship is not attenuated by random measurement error. Other relationships are weaker and also consistent across measures except for Openness. The latent variable model reveals the reason for the discrepancies. Three items (#15 ingenious, #l35 like routine work, and #10 sophisticated in art) showed unique relationships with age. The art-related items showed a unique relationship with age. The latent factor does not include the unique content of these items and shows a positive relationship between openness and age. The scale scores include this content and show a weaker relationship. The positive relationship of openness with age for the latent factor is rather surprising as it is not found in nationally representative samples (Schimmack, 2019b). One possible explanation for this relationship is that older individuals who take an online personality test are more open.


In sum, the most important finding is that the 3-item BFI-S conscientiousness scale shows the same relationship with age as the BFI-scale and the latent factor. Thus, the failure to find aging effects in the longitudinal SOEP data with the BFI-S cannot be attributed to the use of an invalid short measure of conscientiousness. The real scientific question is why the cross-sectional study by Soto et al. (2011) and my analysis of the longitudinal SOEP data show divergent results.


Science has changed since researchers are able to communicate and discuss research findings on social media. I strongly believe that open science outside of peer-controlled journals is beneficial for the advancement of science. However, the downside of social media of open science is that it becomes more difficult to evaluate expertise of online commentators. True experts are able to back up their claims with scientific evidence. This is what I did here. I showed that Brenton Wiernik’s comment has as much scientific validity as a Donald Trump tweet. Whatever the reason for the lack of personality change in the SOEP data will be, it is not the use of the BFI-S to measure the Big Five.

Personality Measurement with the Big Five Inventory

In one of the worst psychometric articles every published (although the authors still have a chance to retract their in press article before it is actually published), Hussey and Hughes argue that personality psychologists intentionally fail to test the validity of personality measures. They call this practice validity-hacking. They also conduct some psychometric tests of popular personality measures and claim that they fail to demonstrate structural validity.

I have demonstrated that this claim is blatantly false and that the authors failed to conduct a proper test of structural validity (Schimmack, 2019a). That is, the authors fitted a model to the data that is known to be false. Not surprisingly, they found that their model didn’t meet standard criteria of model fit. This is exactly what should happen when a false model is subjected to a test of structural validity. Bad models should not fit the data. However, a real test of structural validity requires fitting a plausible model to the data. I already demonstrated with several Big Five measures that these measures have good structural validity and that scale scores can be used as reasonable measures of the latent constructs (Schimmack, 2019b). Here I examine the structural validity of the Big Five Inventory (Oliver John) that was used by Hussay and Hughes.

While I am still waiting to receive the actual data that were used by Hussay and Hughes, I obtained a much larger and better dataset from Sam Gosling that includes data from 1 million visitors to a website that provides personality feedback (

For the present analyses I focused on the subgroup of Canadian visitors with complete data (N = 340,000). Subsequent analyses can examine measurement invariance with the US sample and samples from other nations. To examine the structure of the BFI, I fitted a structural equation model. The model has seven factors. Five factors represent the Big Five personality traits. The other two factores represent rating biases. One bias is an evaluative bias and the other bias is acquiescence bias. Initially, loadings on the method factors were fixed. This basic model was then modified in three ways. First, item loadings on the evaluative bias factor were relaxed to allow for some items to show more or less evaluative bias. Second, secondary loadings were added to allow for some items to be influenced by more than one factor. Finally, items of the same construct were allowed to covary to allow for similar wording or shared meaning (e.g., three arts items from the openness factor were allowed to covary). The final model and the complete results can be found on OSF (

Model fit was acceptable, CFI = .953, RMSEA = .030, SRMR = .032. In contrast, fitting a simple structure without method factors produced unacceptable fit for all three fit indices, CFI = .734, RMSEA = .068, SRMR = .110. This shows that the model specification by Hussey and Hughes accounted for the bad fit. It has been known for over 20 years that a simple structure does not fit Big Five data (McCrae et al., 1996). Thus, Hussay and Hughes claim that the BFI lacks validity is based on an outdated and implausible measurement model.

Table 1 shows the factor loading pattern for the 44 BFI items on the Big Five factors and the two method factors. It also shows the contribution of the seven factors to the scale scores that are used to provide visitors with personality feedback and in many research articles that use scale scores as proxies for the latent constructs.

emotionally stable24-0.610.270.18
full of energy110.34-0.110.580.20
generate enthusiasm160.070.440.110.500.20
shy and inhibited310.180.64-0.220.17
ingenious 150.570.090.21
active imagination200.130.53-
value art300.120.460.090.160.18
like routine work35-
like reflecting40-0.080.580.270.21
few artistic interests41-0.26-0.090.15
sophisticated in art440.070.44-
find faults w. others20.15-0.42-0.240.19
helpful / unselfish70.440.100.290.23
start quarrels 120.130.20-0.50-0.09-0.240.19
trusting 220.150.330.260.20
cold and aloof27-0.190.14-0.46-0.350.17
considerate and kind320.040.620.290.23
like to cooperate420.15-0.100.440.280.22
thorough job30.590.280.22
careless 8-0.17-0.51-0.230.18
reliable worker13-
persevere until finished280.560.260.20
follow plans380.10-0.060.460.260.20
easily distracted430.190.09-0.52-0.220.17

Most of the secondary loadings are very small, although they are statistically highly significant in this large sample. Most items also have the highest loading on the primary factor. Exceptions are the items blue/depressed, full of engery, and generate enthusiasm that have higher loadings on the evaluative bias factor. Except for two openness items, all items also have loadings greater than .3 on the primary factor. Thus, the loadings are consistent with the intended factor structure.

The most important results are the loadings of the scale scores on the latent factors. As the factors are all independent, squaring these coefficients shows the amount of explained variance by each factor. By far the largest variance component is the intended construct with correlations ranging from .76 for openness to .83 for extraversion. Thus, the lion share of the reliable variance in scale scores reflects the intended construct. The next biggest contributor is evaluative bias with correlations ranging from .36 for openness to .44 for extraversion. Although this means only 15 to 20 percent of the total variance in scale scores reflects evaluative bias, this systematic variance can produce spurious correlations when scale scores are used to predict other self-report measures (e.g., life satisfaction, Schimmack, 2019c).

In sum, a careful psychometric evaluation of the BFI shows that the BFI has good structural validity. The key problem is the presence of evaluative bias in scale scores. Although this requires caution in the interpretation of results obtained with BFI scales, it doesn’t justify the conclusion that the BFI is invalid.

Measurement Invariance

Hussey and Hughes also examined measurement invariance across age-groups and the two largest gender groups. They claimed that the BFI lacks measurement invariance, but this claim was based on a cunning misrepresentation of the results (Schimmack, 2019a). The claim is based on the fact that the simple-structure model does not fit in any group. However, fit did not decrease when measurement invariance was imposed on different groups. Thus, all groups showed the same structure and fit did not increase when measurement invariance was imposed, but this fact was hidden in the supplementary results.

I replicated their analyses with the current dataset. First, I fitted the model for the whole sample separately to the male and female samples. Fit for the male sample was acceptable, CFI = .949, RMSEA = .029, SRMR = .033. So was fit for the female sample, CFI = .947, RMSEA = .030, SRMR = .037.

Table 2 shows the results side by side. There are no notable differences between the parameter estimates for males and females (m/f). This finding replicates results with other Big Five measures (Schimmack, 2019a).

depressed/blue4.33/ .30-.18/-.11.19/ .20-.45/-.50.07/.05
relaxed9-.71/-.72.24/ .23.19/.18
tense14.52/ .49-.17/-.14.11/ .13-.27/-.32.20/ .20
worry19.58/ .57-.10/-.08.05/ .07-.22/-.22.17/ .17
emotionally stable24-.58/-.58.10/ .06.25/ .30.19/ .17
moody29.41/ .38-.26/-.25-.30/-.38.18/ .18
calm34-.55/-.59-.02/-.03.14/ .13.12/ .13-.27/-.24.21/ .19
nervous39.51/ .49-.21/.26-.10/-.10.08/ .08-.11/-.11-.27/-.25.18/ .17
SUM.78/ .77-.09/-.08-.01/-.01-.07/-.05-.02/-.02-.42-.46.05/ .04
talkative1.09/ .11.69/ .70-.10/-.08.24/ .24.19/ .18
reserved6-.55/-.60.08/.10.21/ .22.19/ .18
full of energy11.33/ .32-.09/-.04.56/ .59.21/ .20
generate enthusiasm16.04/ .03.44/ .43.12/ .13.48/ .50.20/ .20
quiet21-.79/-.82.03/ .04-.22/-.21.17/ .16
assertive26-.08/-.10.39/ .40.12/ .14-.23/-.25.18/ .17.26/ .24.20/ .18
shy and inhibited31.19/ .15.61/ .66.23/ .22.18/ .17
outgoing36.71/ .71.10/ .07.35/ .38.18/ .18
SUM-.02/-.02.82/ .82.04/ .05-.04-.06.00/ .00.45/ .44.07/ .06
original5.50/ .54-.12/-.12.40/ .39.22/ .20
curious10.40/ .42-.05/-.08.32/ .30.25/ .23
ingenious 150.00/0.00.60/ .56.18/ .16.10/ .04.22/ .20
active imagination20.50/ .55-.07/-.06-.17/-.18.29/ .26.23/ .21
inventive25-.07/ -.08.51/ .55-.12/-.10.37/ .34.21/ .19
value art30.10/ .03.43/ .52.08/ .07.17/ .14.18/ .19
like routine work35-.27/-.27.10/ .10.09/ .15-.22/-.21.17/ .16
like reflecting40-.09/-.08.58/ .58.28/ .26.22/ .20
few artistic interests41-.25/-.29-.10/-.09.16/ .15
sophisticated in art44.03/ .00.42/ .49-.08/-.08.09/ .09.16/ .16
SUM.01/ -.01-.01/-.01.74/ .78-.05/-.05-.03/-.06.38/ .34.20/ .19
find faults w. others2.14/ .17-.42/-.42-.24/-.24.19/ .19
helpful / unselfish7.45/ .43.09/.11.29/ .29.23/ .23
start quarrels 12.12/ .16.23/ .18-.49/-.49-.07/-.08-.24/-.24.19/ .19
forgiving17.49/ .46-.14/-.13.25/ .24.20/ .19
trusting 22-.14/-.16.38/ .32.27/ .25.21/ .19
cold and aloof27-.20/-.18.14/ .12.44/ .46-.34/-.37.18/ .17
considerate and kind32.02/.01.62/.61.28/ .30.22/ .23
rude37.10/ .12.12/ .12-.62/-.62-.13/-.08-.23/-.23.19/ .18
like to cooperate42.18/ .11-.09/-.10.43/ .45.28/ .29.23/ .22
SUM-.07/-.08.00/ .00-.07/-.07.78/ .77.03/ .03.43/ .44.04/ .04
thorough job3.58/ .59.29/ .28.23/ .22
careless 8-0.16-.49/-.51.24/ .23.19/ .18
reliable worker13-.10/-.09.09/ .10.55/ .55.30/ .31.24/ .24
disorganized18.13/ .16-.58/-.59-.21/-.20.17/ .15
lazy23-.52/-.51-.45/-.45.18/ .17
persevere until finished28.54/ .58.27/ .25.21/ .19
efficient33-.11/-.07.52/ .58.30/ .29.24/ .23
follow plans38.00/ .00-.06/-.07.45/ .44.27/ .26.21/ .20
easily distracted43.17/ .19.07/ .06-.53/-.53-.22/-.22.18/ .17
SUM-.05/-.05-.01/-.01-.05/-.06.04/ .04.81/ .82.43/ .41.03/ .03

I then fitted a multi-group model with metric invariance. Despite the high similarity between the individual models, model fit decreased, CFI = .925, RMSEA = .033, SRMR = .062. Although RMSEA and SRMR were still good, the decrease in fit might be considered evidence that the invariance assumption is violated. Table 2 shows that it is insufficient to examine changes in global fit indices. What matters is whether the decrease in fit has any substantial meaning. Given the results in Table 2, this is not the case.

The next model imposed scalar invariance. Before presenting the results, it is helpful to know what scalar invariance implies. Take extraversion as an example. Assume that there are no notable gender differences in extraversion. However, extraversion has multiple facets that are represented by items in the BFI. One facet is assertiveness and the BFI includes an assertiveness item. Scalar invariance implies that there cannot be gender differences in assertiveness if there are no gender differences in extraversion. It is obvious that this is an odd assumption because gender differences can occur at any level in the hierarchy of personality traits. Thus, evidence that scalar invariance is violated does not imply that we can not examine gender differences in personality. Rather, it would require further examination of the pattern of mean differences at the level of the factors and the item residuals.

However, imposing scalar invariance did not produce a decrease in fit, CFI = .921, RMSEA = .034, SRMR = .063. Inspection of the modification indices showed the highest modification index for item O6 “valuing art” with an implied mean difference of 0.058. This implies that there are no notable gender differences at the item-level. The pattern of mean differences at the factor level is consistent with previous studies, showing higher levels of neuroticism (d = .64) and agreeableness (d = .31), although the difference in agreeableness is relatively small compared to some other studies.

In sum, the results show that the BFI can be used to examine gender differences in personality and that the pattern of gender differences observed with the BFI is not a measurement artifact.

Age Differences

Hussey and Hughes used a median split to examine invariance across age-groups. The problem with a median split is that online samples tend to be very young. Figure 1 shows the age distribution for the Canadian sample. The median age is 22.

To create two age-groups, I split the sample into a group of under 30 and 30+ participants. The unequal sample size is not a problem because both groups are large given the large overall sample size (young N = 221,801, old N = 88,713). A published article examined age differences in the full sample, but the article did not use SEM to test measurement invariance (Soto, John, Gosling, & Potter, 2011). Given the cross-sectional nature of the data, it is not clear whether age differences are cohort differences or aging effects. Longitudinal studies suggest that age differences may reflect generational changes rather than longitudinal changes over time (Schimmack, 2019d). In any case, the main point of the present analyses is to examine measurement invariance across different age groups.

Fit for the model with metric invariance was similar to the fit for the gender model, CFI = .927, RMSEA = .033, SRMR = .062. Fit for the model with scalar invariance was only slightly weaker for CFI and better for RMSEA. More important, inspection of the modification indices showed the largest difference for O10 “sophisticated in art” with a standardized mean difference of .068. Thus, there were no notable differences between the two age groups at the item level.

The results at the factor level reproduced the finding with scale scores by Soto et al. (2011). The older group had a higher level of conscientiousness (d = .61) than the younger group. Differences for the other personalty dimensions were statistically small. There were no notable differences in response styles.

In sum, the results show that the BFI shows reasonable measurement invariance across age groups. Contrary to the claims by Hussey and Hughes, this finding is consistent with the results reported in Hussay and Hughes’s supplementary materials. These results suggest that BFI scale scores provide useful information about personality and that published articles that used scale scores produced meaningful results.


Hussey and Hughes accused personality researchers of validity hacking. That is, they do not report results of psychometric tests because these tests would show that personality measures are invalid. This is a strong claim that requires strong evidence. However, closer inspection of this claim shows that the authors used an outdated measurement model and misrepresented the results of their invariance analyses. Here I showed that the BFI has good structural validity and shows reasonable invariance across gender and age groups. Thus Hussay and Hughes’s claims are blatantly false.

So far, i have only examined the BFI, but I have little confidence in the authors’ conclusions about other measures like Rosenberg’s self-esteem scale. I am still waiting for the authors to share all of their data to examine all of their claims. At present, there is no evidence of v-hacking. Of course, this does not mean that self-ratings of personality are perfectly valid. As I showed, self-ratings of the Big Five are contaminated with evaluative bias. I presented a measurement model that can test for the presence of these biases and that can be used to control for rating biases. Future validation studies might benefit from using this measurement model as a basis for developing better measures and better measurement models. Substantive articles might also benefit from using a measurement model rather than scale scores, especially when the BFI is used as a predictor of other self-report measures to control for shared rating biases.

Measuring Well-Being in the SOEP

Psychology has a measurement problem. Big claims about personality, self-esteem, or well-being are based on sum-scores of self-ratings; or sometimes a single rating. This would be a minor problem if thorough validation research had demonstrated that sum-scores of self-ratings are valid measures of the constructs they are intended to represent, but such validation research is often missing. As a result, the validity of widely used measures in psychology and claims based on these measures is unknown.

The well-being literature is an interesting example of the measurement crisis because two opposing views about the validity of well-being measures co-exist. On the one hand, experimental social psychologists argue that life-satisfaction ratings are invalid and useless (Schwarz & Strack, 1999); a view that has been popularized by Noble Laureate Daniel Kahneman in his book “Thinking: Fast and Slow” (cf. Schimmack, 2018). On the other hand, well-being scientists often assume that life-satisfaction ratings are near perfect indicators of individuals’ well-being.

An editor of JPSP, which presumably means he or she is an expert, has no problem to mention both positions in the same paragraph without noting the contradiction.

There is a huge literature on well-being. Since Schwarz and Strack (1999), to take that arbitrary year as a starting point, there have been more than 11,000 empirical articles with “wellbeing” (or well-being or well being) in the title, according to PsychInfo. The vast majority of them, I submit, take the subjective evaluation of one’s own life as a perfectly valid and perhaps the best way to assess one’s own evaluation of one’s life. “

So, since Schwarz and Strack concluded that life-satisfaction judgments are practically useless, 11,000 articles have used life-satisfaction judgments as perfectly valid measures of life-satisfaction and nobody thinks this is a problem. No wonder, natural scientists don’t consider psychology a science.

The Validity of Well-Being Measures

Any attempt at validating well-being measures requires a definition of well-being that leads to testable predictions about correlations of well-being measures with other measures. Testing these predictions is called construct validation (Cronbach & Meehl, 1955; Schimmack, 2019).

The theory underlying the use of life-satisfaction judgments as measures of well-being assumes that well-being is subjective and that (healthy, adult) individuals are able to compare their actual lives to their ideal lives and to report the outcome of these comparison processes (Andrews & Whithey, 1973; Diener, Lucas, Schimmack, & Helliwell, 2009).

One prediction that follows from this model is that global life-satisfaction judgments should be correlated with judgments of satisfaction in important life domains, but not in unimportant life domains. The reason is that satisfaction with life as a whole should be related to satisfaction with (important) parts. It would make little sense for somebody to say that they are extremely satisfied with their life as a whole, but not satisfied with their family life, work, health, or anything else that matters to them. The whole point of asking a global question is the assumption that people will consider all important aspects of their lives and integrate this information into a global judgment (Andrews & Whithey, 1973). The main criticism of Schwarz and Strack (1999) was that this assumption does not describe the actual judgment process and that actual life-satisfaction judgments are based on transient and irrelevant information (e.g., current mood, Schwarz & Clore, 1983).

Top-Down vs. Bottom-Up Theories of Global and Domain Satisfaction

To muddy the waters, Diener (1984) proposed on the one hand that life-satisfaction judgments are, at least somewhat, valid indicators of life-satisfaction, while also proposing that correlations between satisfaction with life as a whole and satisfaction with domains might reflect a top-down effect.

A top-down effect implies that global life-satisfaction influences domain satisfaction. That is, health satisfaction is not a cause of life-satisfaction because good health is an important part of a good life. Instead, life-satisfaction is a content-free feeling of satisfaction that creates a halo in evaluations of specific life aspects independent of the specific evaluations of a life domain.

Diener overlooked that top-down processes invalidate life-satisfaction judgments as valid measures of wellbeing because a top-down model implies that global life-satisfaction judgments reflect only a general disposition to be satisfied without information about the actual satisfaction in important life domains. In the context of a measurement model, we can see that the top-down model implies that life-satisfaction judgments only capture the shared variance among specific life-satisfaction judgments, but fail to represent the part of satisfaction that reflects unique variance in satisfaction with specific life domains. In other words, top-down models imply that well-being does not encompass evaluations of the parts that make up an individuals entire life.

The problem that measurement models in psychology often consider unique or residual variances error variances that are often omitted from figures does not help. In the figure, the residual variances are shown and represent variation in life-aspects that are not shared across domains.

Some influential articles that examined top-down and bottom-up processes have argued in favor of top-down processes without noticing that this invalidates the use of life-satisfaction judgments as indicators of well-being or at least requires a radically different conception of well-being (well-being is being satisfied independent of how things are actually going in your life) (Heller, Watson, & Ilies, 2004).

An Integrative Top-Down vs. Bottom-Up Model

Brief et al. (1993) proposed an integrative model of top-down and bottom-up processes in life-satisfaction judgments. The main improvement of this model was to distinguish between a global disposition to be more satisfied and a global judgment of important aspects of life. As life-satisfaction judgments are meant to represent the latter, life-satisfaction judgments are the ultimate outcome of interest, not a measure of the global disposition. Brief et al. (1993) used neuroticism as an indicator for the global disposition to be less satisfied, but there are probably other factors that can contribute to a general disposition to be satisfied. The integrative model assumes that any influence of the general disposition is mediated by satisfaction with important life domains (e.g., health).

FIGURE 1. DisSat = Dispositional Satisfaction, DS1 = Domain Satisfaction 1 (e.g., health), DS2 = Domain Satisfaction 2, DS3 = Domain Satisfaction 3, LS = Life-Satisfaction.

It is important to realize that the mediation model separates two variances in domain satisfaction judgments, namely the variance that is explained by dispositional satisfaction and the variance that is not explained by dispositional satisfaction (residual variance). Both variances contribute to life-satisfaction. Thus, objective aspects of health that contribute to health satisfaction can also influence life-satisfaction. This makes the model an integrative model that allows for top-down and bottom-up effects.

One limitation of Brief et al.’s (1993) model was the use of neuroticism as sole indicator of dispositional satisfaction. While it is plausible that neuroticism is linked to more negative perceptions of all kinds of life-aspects, it may not be the only trait that matters.

Another limitation was the use of a health satisfaction as a single life domain. If people also care about other life domains, other domain satisfactions should also contribute to life-satisfaction and they could be additional mediators of the influence of neuroticism on life-satisfaction. For example, neurotic individuals might also worry more about money and financial satisfaction could influence life-satisfaction, making financial satisfaction another mediator of the influence of neuroticism on life-satisfaction.

One advantage of structural equation modeling is the ability to study constructs that do not have a direct indicator. This makes it possible to examine top-down effects without “direct” indicators of dispositional satisfaction. The reason is that dispositional satisfaction should influence satisfaction with various life domains. Thus, dispositional satisfaction is reflected in the shared variance among different domain satisfaction judgments and domain satisfaction judgments serve as indicators that can be used to measure dispositional satisfaction (see Figure 2).

Domain Satisfactions in the SOEP

It is fortunate that the creators of the Socio-Economic Panel in the 1980s included domain satisfaction measures and that these measures have been included in every wave from 1984 to 2017. This makes it possible to test the integrative top-down bottom-up model with the SOEP data.

The five domains that have been included in all surveys are health, household income, recreation, housing, and job satisfaction. However, job satisfaction is only available for those participants who are employed. To maximize the number of domains, I used all five domains and limited the analysis to working participants. The model can be used to build a model with four domains for all participants.

One limitation of the SOEP is the use of single-item indicators. This makes sense for expensive panel studies, but creates some psychometric problems. Fortunately, it is possible to estimate the reliability of single-item indicators in panel data by using Heise’s (1969) model which estimates reliability based on the pattern of retest correlations for three waves of data.

REL = r12 * r23 / r13

More data would be better and are available, but the goal was to combine the well-being model with a model of personality ratings that are available for only three waves (2005, 2009, & 2013). Thus, the same three waves for used to create an integrative top-down bottom-up model that also examined how domain satisfaction is related to global life-satisfaction across time.

The data set consisted of 3 repeated measures of 5 domain satisfaction judgments and a single life-satisfaction judgments for a total of 18 variables. The data were analyzed with MPLUS (see OSF for syntax and detailed results ).


Overall model fit was acceptable, CFI = .988, RMSEA = .023, SRMR = .029.

The first results are the reliability and stability estimates of the five domain satisfactions and global life satisfaction (Table 1). For comparison purposes, the last column shows the estimates based on a panel analyses with annual retests (Schimmack, Krause, Wagner, & Schupp, 2010). The results show fairly consistent stability across domains with the exception of job satisfaction. Job satisfaction is less stable than other domains. The four-year stability is high, but not as high as for personality traits (Schimmack, 2019). A comparison with the panel data shows higher stability, which indicates that some of the error variance in 4-year retest studies is reliable variance that fluctuates over the four-year retest period. However, the key finding is that there is high stability in domain satisfaction judgments and life-satisfaction judgments. which makes it theoretically interesting to examine the relationship between the stable variances in domain satisfaction and life-satisfaction.

Job Satisfaction0.620.620.89
Health Satisfaction0.670.790.940.93
Financial Satisfaction0.740.810.950.91
Housing Satisfaction0.660.810.950.89
Leisure Satisfaction0.670.800.950.92
Life Satisfaction0.660.780.940.89

Table 2 examines the influence of top-down processes on domain satisfaction. Results show the factor loadings of domain satisfaction on a common factor that reflects dispositional satisfaction; that is, a general disposition to report higher levels of satisfaction. The results show that somewhere between 30% and 50% of the reliable variance in life-satisfaction judgments is explained by a general disposition factor. While this leaves ample room for domain-specific factors to influence domain satisfaction judgments, the results show a strong top-down influence.

Job Satisfaction0.690.680.68
Health Satisfaction0.680.660.65
Financial Satisfaction0.600.610.63
Housing Satisfaction0.720.740.76
Leisure Satisfaction0.610.610.61

Table 3 shows the unique contribution of the disposition and the five domains to life-satisfaction concurrently and longitudinally.


The first notable finding is that the disposition factor accounts for the lion share of the explained variance in life-satisfaction judgments. The second important finding is that the relationship is very stable over time. The disposition measured at time 1 is an equally good predictor of life-satisfaction at time 1 (r = .56), time 2 (r = .59), and at time 3 (r = .57). This suggests that about one-third of the reliable variance in life-satisfaction judgments reflects a stable disposition to report higher or lower levels of satisfaction.

Regarding domain satisfaction, health is the strongest predictor with correlations between .21 and .33. Finances is the second strongest predictor with correlations between .14 and .34. For health satisfaction there is high stability over time. That is, time 1 health satisfaction predicts time 1 life-satisfaction nearly as well (r = .23) as time 3 life-satisfaction (r = .21). In contrast, financial satisfactions shows a bit more change over time with concurrent correlations at time 1 of r = .34 and a drop to r = .14 for life-satisfaction at time 3. This suggests that changes in financial satisfaction produces changes in life-satisfaction.

Job satisfaction has a weak influence on life-satisfaction with correlations ranging from r = .14 to .05. Like financial satisfaction, there is some evidence that changes in job satisfaction predict changes in life-satisfaction.

Housing and leisure have hardly any influence on life-satisfaction judgments with most relationships being less than .10. There is also no evidence that changes in these domain produce changes in life-satisfaction judgments.

These results show that most of the reliable variance in global life-satisfaction judgments remains unexplained and that a stable disposition accounts for most of the explained variance in life-satisfaction judgments.

Implications for the Validity of Life-Satisfaction Judgments

There are two ways to interpret the results. One interpretation is that is common in the well-being literature and hundreds of studies with the SOEP data is that life-satisfaction judgments are valid measures of well-being. Accordingly, well-being in Germany is determined mostly by a stable disposition to be satisfied. Accordingly, changing actual life-circumstances will have negligible effects on well-being. For example, Nakazato et al. (2011) used the SOEP data to examine the influence of moving on well-being. They found that decreasing housing satisfaction triggered a decision to move and that moving produces lasting increases in housing satisfaction. However, moving had no effect on life-satisfaction. This is not surprising given the present results that housing satisfaction has a negligible influence on life-satisfaction judgments. Thus, we would conclude that people are irrational by investing money in a better house, if we assume that life-satisfaction judgments are a perfectly valid measure of well-being.

The alternative interpretation is that life-satisfaction judgments are not as good as well-being researchers think they are. Rather than reflecting a weighted summary of all important aspects of life, they are based on accessible information that does not include all relevant information. The difference to Schwarz and Strack’s (1999) criticism is that bias is not due to temporarily accessible information (e.g., mood) that makes life-satisfaction judgments unreliable. As demonstrated here and elsewhere, a large portion of the variance in life-satisfaction judgments is stable. The problem is that the stable factors may be biases in life-satisfaction ratings rather than real determinants of well-being.

It is unfortunate that psychologist and other social sciences have neglected proper validation research of a measure that has been used to make major empirical claims about the determinants of well-being, and that this research has been used to make policy recommendation (Diener, Lucas, Schimmack, & Helliwell, 2009). The present results suggest that any policy recommendations based on life-satisfaction ratings alone are premature. It is time to take measurement more seriously and to improve the validity of measuring well-being.

Measuring Personality in the SOEP

The German Socio-Economic-Panel (SOEP) is a longitudinal study of German households. The core questions address economic issues, work, health, and well-being. However, additional questions are sometimes added. In 2005, the SOEP included a 15-item measure of the Big Five; the so-called BFI-S (Lang et al., 2011). As each personality dimension is measured with only three items, scale scores are rather unreliable measures of the Big Five. A superior way to examine personality in the SOEP is to build a measurement model that relates observed item scores to latent factors that represent the Big Five.

Anusic et al. (2009) proposed a latent variable model for an English version of the BFI-S.

The most important feature of this model is the modeling of method factors in personality ratings. An acquiescence factor accounts for general response tendencies independent of item content. In addition, a halo factor accounts for evaluative bias that inflates correlations between two desirable or two undesirable items and attenuates correlations between a desirable and an undesirable item. The Figure shows that the halo factor is bias because it correlates highly with evaluative bias in ratings of intelligence and attractiveness.

The model also includes a higher-order factor that accounts for a correlation between extraversion and openness.

Since the article was published I have modified the model in two ways. First, the Big Five are conceptualized as fully independent which is in accordance with the original theory. Rather than allowing for correlations among Big Five factors, secondary loadings are used to allow for relationships between extraversion and openness items. Second, halo bias is modeled as a characteristic of individual items rather than the Big Five. This approach is preferable because some items have low loadings on halo.

Figure 2 shows the new model.

I fitted this model to the 2005 data using MPLUS (syntax and output: ). The model had acceptable fit to the data, CFI = .962, RMSEA = .035, SRMR = .029.

Table 1 shows the factor loadings. It also shows the correlation of the sum scores with the latent factors.


The results show that all items load on their primary factor although some loadings are very small (e.g., forgiving). Secondary loadings tend to be small (< .2), although they are highly significant in the large sample. All items load on the evaluative bias factor, with some fairly large loadings for considerate, efficient, and talkative. Reserved is the most evaluatively neutral item. Acquiescence bias is rather weak.

The scale scores are most strongly related to the intended latent factor. The relationship is fairly strong for neuroticism and extraversion, suggesting that about 50% of the variance in scale scores reflects the intended construct. However, for the other three dimensions, correlations suggest that less than 50% of the variance reflects the intended construct. Moreover, the remaining variance is not just random measurement error. Evaluative bias contributes from 10% up to 25% of additional variance. Acquiescence bias plays a minor role because most scales have a reverse scored item. Openness is an exception and acquiescence bias contributes 10% of the variance in scores on the Openness scale.

Given the good fit of this model, I recommend it for studies that want to examine correlates of the Big Five or that want to compare groups. Using this model will produce better estimates of effect sizes and control for spurious relationships due to method factors.

A Quantitative Science Needs to Quantify Validity


This article was published in a special issue in the European Journal of Personality Psychology.   It examines the unresolved issue of validating psychological measures fro the perspective of a multi-method approach (Campbell & Fiske, 1959), using structural equation modeling.

I think it provides a reasonable alternative to the current interest in modeling residual variance in personality questionnaires (network perspective) and solves the problems of manifest personality measures that are confounded by systematic measurement error.

Although latent variable models of multi-method data have been used in structural analyses (Biesanz & West, 2004; deYoung, 2006), these studies have rarely been used to estimate validity of personality measures.  This article shows how this can be done and what assumptions need to be made to interpret latent factors as variance in true personality traits.

Hopefully, sharing this article openly on this blog can generated some discussion about the future of personality measurement in psychology.


What Multi-Method Data Tell Us About
Construct Validity
University of Toronto Mississauga, Canada

European Journal of Personality
Eur. J. Pers. 24: 241–257 (2010)
DOI: 10.1002/per.771  [for original article]


Structural equation modelling of multi-method data has become a popular method to
examine construct validity and to control for random and systematic measurement error in personality measures. I review the essential assumptions underlying causal models of
multi-method data and their implications for estimating the validity of personality
measures. The main conclusions are that causal models of multi-method data can be
used to obtain quantitative estimates of the amount of valid variance in measures of
personality dispositions, but that it is more difficult to determine the validity of personality measures of act frequencies and situation-specific dispositions.

Key words: statistical methods; personality scales and inventories; regression methods;
history of psychology; construct validity; causal modelling; multi-method; measurement


Fifty years ago, Campbell and Fiske (1959) published the groundbreaking article
Convergent and Discriminant Validation by the Multitrait-Multimethod Matrix.With close to 5000 citations (Web of Science, February 1, 2010), it is the most cited article in
Psychological Bulletin. The major contribution of this article was to outline an empirical
procedure for testing the validity of personality measures. It is difficult to overestimate the importance of this contribution because it is impossible to test personality theories
empirically without valid measures of personality.

Despite its high citation count, Campbell and Fiske’s work is often neglected in
introductory textbooks, presumably because validation is considered to be an obscure and complicated process (Borsboom, 2006). Undergraduate students of personality psychology learn little more than the definition of a valid measure as a measure that measures what it is supposed to measure.

However, they are not taught how personality psychologists validate their measures. One might hope that aspiring personality researchers learn about Campbell and Fiske’s multi-method approach during graduate school. Unfortunately, even handbooks dedicated to research methods in personality psychology pay relatively little attention to Campbell and Fiske’s (1959) seminal contribution (John & Soto, 2007; Simms & Watson, 2007). More importantly, construct validity is often introduced in qualitative terms.

In contrast, when Cronbach and Meehl (1955) introduced the concept of construct validity, they proposed a quantitative definition of construct validity as the proportion of construct-related variance in the observed variance of a personality measure. Although the authors noted that it would be difficult to obtain precise estimates of construct validity coefficients (CVCs), they stressed the importance of estimating ‘as definitely as possible the degree of validity the test is presumed to have’ (p. 290).

Campbell and Fiske’s (1959) multi-method approach paved the way to do so. Although Campbell and Fiske’s article examined construct validity qualitatively, subsequent developments in psychometrics allowed researchers to obtain quantitative estimates of construct validity based on causal models of multi-method data (Eid, Lischetzke, Nussbeck, & Trierweiler, 2003; Kenny & Kashy, 1992). Research articles in leading personality journals routinely report these estimates (Biesanz & West, 2004; DeYoung, 2006; Diener, Smith, & Fujita, 1995), but a systematic and accessible introduction to causal models of multi-method data is lacking.

The main purpose of this paper is to explain how causal models of multi-method data can be used to obtain quantitative estimates of construct validity and which assumptions these models make to yield accurate estimates.

I prefer the term causal model to the more commonly used term structural equation model because I interpret latent variables in these models as unobserved, yet real causal forces that produce variation in observed measures (Borsboom, Mellenbergh,&
van Heerden, 2003). I make the case below that this realistic interpretation of latent factors is necessary to use multi-method data for construct validation research because the assumption of causality is crucial for the identification of latent variables with construct variance (CV).

Campbell and Fiske (1959) distinguished absolute and relative (construct) validity. To
examine relative construct validity it is necessary to measure multiple traits and to look for evidence of convergent and discriminant validity in a multi-trait-multi-method matrix (Simms &Watson, 2007). However, to examine construct validity in an absolute sense, it is only necessary to measure one construct with multiple methods.

In this paper, I focus on convergent validity across multiple measures of a single construct because causal models of multi-method data rely on convergent validity alone to examine construct validity.

As discussed in more detail below, causal models of multi-method data estimate
construct validity quantitatively with the factor loadings of observed personality measures on a latent factor (i.e. an unobserved variable) that represents the valid variance of a construct. The amount of valid variance in a personality measure can be obtained by squaring its factor loading on this latent factor. In this paper, I use the terms construct validity coefficient (CVC) to refer to the factor loading and the term construct variance (CV) for the amount of valid variance in a personality measure.


A measure is valid if it measures what it was designed to measure. For example, a
thermometer is a valid measure of temperature in part because the recorded values covary with humans’ sensory perceptions of temperature (Cronbach & Meehl, 1955). A modern thermometer is a more valid measure of temperature than humans’ sensory perceptions, but the correlation between scores on a thermometer and humans’ sensory perceptions is necessary to demonstrate that a thermometer measures temperature. It would be odd to claim that highly reliable scores recorded by an expensive and complicated instrument measure temperature if these scores were unrelated to humans’ everyday perceptions of temperature.

The definition of validity as a property of a measure has important implications for
empirical tests of validity. Namely, researchers first need a clearly defined construct before they can validate a potential measure of the construct. For example, to evaluate a measure of anxiety researchers first need to define anxiety and then examine the validity of a measure as a measure of anxiety. Although the importance of clear definitions for construct validation research may seem obvious, validation research often seems to work in the opposite direction; that is, after a measure has been created psychologists examine what it measures.

For example, the widely used Positive Affect and Negative Affect Schedule (PANAS) has two scales named Positive Affect (PA) and Negative Affect (NA). These scales are based on exploratory factor analyses of mood ratings (Watson, Clark, & Tellegen, 1988). As a result, Positive Affect and Negative Affect are merely labels for the first two VARIMAX rotated principal components that emerged in these analyses. Thus, it is meaningless to examine whether the PANAS scales are valid measures of PA and NA. They are valid measures of PA and NA by definition because PA and NA are mere labels of the two VARIMAX rotated principal components that emerge in factor analyses of mood ratings.

A construct validation study would have to start with an a priori definition of Positive Affect and Negative Affect that does not refer to the specific measurement procedure that was used to create the PANAS scales. For example, some researchers have
defined Positive Affect and Negative Affect as the valence of affective experiences and
have pointed out problems of the PANAS scales as measures of pleasant and unpleasant
affective experiences (see Schimmack, 2007, for a review).

However, the authors of the PANAS do not view their measure as a measure of hedonic valence. To clarify their position, they proposed to change the labels of their scales from Positive Affect and Negative Affect to Positive Activation and Negative Activation (Watson,Wiese, Vaidya, & Tellegen, 1999). The willingness to change labels indicates that PANAS scales do not measure a priori defined constructs and as a result there is no criterion to evaluate the construct validity of the PANAS scales.

The previous example illustrates how personality measures assume a life of their own
and implicitly become the construct; that is, a construct is operationally defined by the
method that is used to measure it (Borsboom, 2006). A main contribution of Cambpell and Fiske’s (1959) article was to argue forcefully against operationalism and for a separation of constructs and methods. This separation is essential for validation research because validation research has to allow for the possibility that some of the observed variance is invalid.

Other sciences clearly follow this approach. For example, physics has clearly defined
concepts such as time or temperature. Over the past centuries, physicists have developed
increasingly precise ways of measuring these concepts, but the concepts have remained the same. Modern physics would be impossible without these advances in measurement.
However, psychologists do not follow this model of more advanced sciences. Typically, a
measure becomes popular and after it becomes popular it is equated with the construct. As a result, researchers continue to use old measure and rarely attempt to create better
measures of the same construct. Indeed, it is hard to find an example, in which one measure of a construct has replaced another measure of the same construct based on an empirical comparison of the construct validity of competing measures of the same construct (Grucza & Goldberg, 2007).

One reason for the lack of progress in the measurement of personality constructs could
be the belief that it is impossible to quantify the validity of a measure. If it were impossible to quantify the validity of a measure, then it also would be impossible to say which of two measures is more valid. However, causal models of multi-method data produce quantitative estimates of validity that allow comparisons of the validity of different measures.

One potential obstacle for construct validation research is the need to define
psychological constructs a priori without reference to empirical data. This can be difficult for constructs that make reference to cognitive processes (e.g. working memory capacity) or unconscious motives (implicit need for power). However, the need for a priori definitions is not a major problem in personality psychology. The reason is that everyday language provides thousands of relatively well-defined personality constructs (Allport & Odbert, 1936). In fact, all measures in personality psychology that are based on the lexical hypothesis assume that everyday concepts such as helpful or sociable are meaningful personality constructs. At least with regard to these relatively simple constructs, it is possible to test the construct validity of personality measures. For example, it is possible to examine whether a sociability scale really measures sociability and whether a measure of helpfulness really measures helpfulness.

Convergent validity

I start with a simple example to illustrate how psychologists can evaluate the validity of a
personality measure. The concept is people’s weight.Weight can be defined as ‘the vertical force exerted by a mass as a result of gravity’ ( In the present case, only the mass of human adults is of interest. The main question, which has real practical significance in health psychology (Kroh, 2005), is to examine the validity of self-report measures of weight because it is more economical to use self-reports than to weigh people with scales.

To examine the validity of self-reported weight as a measure of actual weight, it is
possible to obtain self-reports of weight and an objective measure of weight from the same individuals. If self-reports of weight are valid, they should be highly correlated with the objective measure of weight. In one study, participants first reported their weight before their weight was objectively measured with a scale several weeks later (Rowland, 1990). The correlation in this study was r (N =11,284) =.98. The implications of this finding for the validity of self-reports of weight depend on the causal processes that underlie this correlation, which can be examined by means of causal modelling of correlational data.

It is well known that a simple correlation does not reveal the underlying causal process,
but that some causal process must explain why a correlation was observed (Chaplin, 2007). Broadly speaking, a correlation is determined by the strength of four causal effects, namely, the effect of observed variable A on observed variable B, the effect of observed variable B on observed variable A, and the effects of an unobserved variable C on observed variable A and on observed variable B.

In the present example, the observed variables are the self-reported weights and those recorded by a scale. To make inferences about the validity of self-reports of weight it is necessary to make assumptions about the causal processes that produce a correlation between these two methods. Fortunately, it is relatively easy to do so in this example. First, it is fairly certain that the values recorded by a scale are not influenced by individuals’ self-reports. No matter how much individuals insist that the scale is wrong, it will not change its score. Thus, it is clear that the causal effect of self-reports on
the objective measure is zero. It is also clear that self-reports of weight in this study were
not influenced by the objective measurement of weight in this study because self-reports
were obtained weeks before the actual weight was measured. Thus, the causal effect of the objectively recorded scores on self-rating is also zero. It follows that the correlation of r =.98 must have been produced by a causal effect of an unobserved third variable. A
plausible third variable is individuals’ actual mass. It is their actual mass that causes the
scale to record a higher or lower value and their actual mass also caused them to report a specific weight. The latter causal effect is probably mediated by prior objective
measurements with other scales, and the validity of these scales would influence the
validity of self-reports among other factors (e.g. socially desirable responding). In combination, the causal effects of actual mass on self-reports and on the scale produce the observed correlation of r =.98. This correlation is not sufficient to determine how strong the effects of weight on the two measures are. It is possible that the scale was a perfect measure of weight. In this case, the correlation between weight and the values recorded by the scale is 1. It follows, that the size of the effect of weight on self-reports of weight (or the factor loading of self-reported weight on the weight factor) has to be r =.98 to produce the observed correlation of r =.98 (1 *.98 = .98). In this case, the CVC of the self-report measure of weight would be .98. However, it is also possible that the scale is a slightly imperfect measure of weight. For example, participants may not have removed their shoes before stepping on the scale and differences in the weight of shoes (e.g. boots versus sandals) could have produced measurement error in the objective measure of individuals’ true weight. It is also possible that changes in weight over time reduce the validity of objective scores as a validation criterion for self-ratings several weeks earlier. In this case, the estimate underestimates the validity of self-ratings.

In the present context, the reasons for the lack of perfect convergent validity are irrelevant. The main point of this example was to illustrate how the correlation between two independent measures of the same construct can be used to obtain quantitative estimates of the validity of a personality measure. In this example, a conservative estimate of the CVC of self-reported weight as a measure of weight is .98 and the estimated amount of CVin the self-report measure is 96% (.98^2 = .96).
The example of self-reported weight was used to establish four important points about
construct validity. First, the example shows that convergent validity is sufficient to examine construct validity. The question of how self-reports of weight are related to measures of other constructs (e.g. height, social desirable responding) can be useful to examine sources of measurement error, but correlations with measures of other constructs are not needed to estimate CVCs. Second, empirical tests of construct validity do not have to be an endless process without clear results (Borsboom, 2006). At least for some self-report measures it is possible to provide a meaningful answer to the question of their validity. Third, validity is a quantitative construct. Qualitative conclusions that a measure is valid because validity is not zero (CVC>0, p<.05) or that a measure is invalid because validity is not perfect (CVC<1.0, p<.05) are not very helpful because most measures are valid and invalid (0<CVC<1). As a result, qualitative reviews of validity studies are often the source of fruitless controversies (Schimmack & Oishi, 2005). The validity of personality measures should be estimated quantitatively like other psychometric properties such as reliability coefficients, which are routinely reported in research articles (Schmidt & Hunter, 1996).

Validity is more important than reliability because reliable and invalid measures are
potentially more dangerous than unreliable measures (Blanton & Jaccard, 2006). Moreover, it is possible that a less reliable measure is more valid than a more reliable measure if the latter measure is more strongly contaminated by systematic measurement error (John & Soto, 2007). A likely explanation for the emphasis on reliability is the common tendency to equate constructs with measures. If a construct is equated with a measure, only random error can undermine the validity of a measure. The main contribution of Campbell and Fiske (1959) was to point out that systematic measurement error can also threaten the validity of personality measures. As a result, high reliability is insufficient evidence for the validity of a personality measure (Borsboom & Mellenbergh, 2002).

The third point illustrated in this example is that tests of convergent validity require
independent measures. Campbell and Fiske (1959) emphasized the importance of
independent measures when they defined convergent validity as the correlation between ‘maximally different methods’ (p. 83). In a causal model of multi-method data the independence assumption implies that the only causal effects that produce a correlation between two measures of the same construct are the causal effect of the construct on the two measures. This assumption implies that all the other potential causal effects that can produce correlations among observed measures have an effect size of zero. If this assumption is correct, the shared variance across independent methods represents CV. It is then possible to estimate the proportion of the shared variance relative to the total observed variance of a personality measure as an estimate of the amount of CV in this measure. For example, in the previous example I assumed that actual mass was the only causal force that contributed to the correlation between self-reports of weight and objective scale scores. This assumption would be violated if self-ratings were based on previous measurements with objective scales (which is likely) and objective scales share method variance that does not reflect actual weight (which is unlikely). Thus, even validation studies with objective measures implicitly make assumptions about the causal model underling these correlations.

In sum, the weight example illustrated how a causal model of the convergent validity
between two measures of the same construct can be used to obtain quantitative estimates of the construct validity of a self-report measure of a personality characteristic. The following example shows how the same approach can be used to examine the construct validity of measures that aim to assess personality traits without the help of an objective measure that relies on well-established measurement procedures for physical characteristics like weight.


A Hypothetical Example

I use helpfulness as an example. Helpfulness is relatively easy to define as ‘providing
assistance or serving a useful function’ ( Helpful can be used to describe a single act or an individual. If helpful is used to describe a single act, helpful is not only a characteristic of a person because helping behaviour is also influenced by situational factors and interactions between personality and situational factors. Thus, it is still necessary to provide a clearer definition of helpfulness as a personality characteristic before it is possible to examine the validity of a personality measure of helpfulness.

Personality psychologists use trait concepts like helpful in two different ways. The most
common approach is to define helpful as an internal disposition. This definition implies
causality. There are some causal factors within an individual that make it more likely for
this individual to act in a helpful manner than other individuals. The alternative approach is to define helpfulness as the frequency with which individuals act in a helpful manner. An individual is helpful if he or she acted in a helpful manner more often than other people. This approach is known as the act frequency approach. The broader theoretical differences between these two approaches are well known and have been discussed elsewhere (Block, 1989; Funder, 1991; McCrae & Costa, 1995). However, the implications of these two definitions of personality traits for the interpretation of multi-method data have not been discussed. Ironically, it is easier to examine the validity of personality measures that aim to assess internal dispositions that are not directly observable than to do so for personality measures that aim to assess frequencies of observable acts. This is ironic because intuitively it seems to be easier to count the frequency of observable acts than to measure unobservable internal dispositions. In fact, not too long ago some psychologists doubted that internal dispositions even exist (cf. Goldberg, 1992).

The measurement problem of the act frequency approach is that it is quite difficult to
observe individuals’ actual behaviours in the real world. For example, it is no trivial task to establish how often John was helpful in the past month. In comparison it is relatively easy to use correlations among multiple imperfect measures of observable behaviours to make inferences about the influence of unobserved internal dispositions on behaviour.

Figure 1. Theoretical model of multi-method data. Note. T = trait (general disposition); AF-c, AF-f, AF-s  = act frequencies with colleague, friend and spouse; S-c, S-f, S-s =situational and person x situation interaction effects on act frequencies; R-c, R-f, R-s = reports by colleague, friend and spouse; E-c, E-f, E-s =errors in reports by
colleague, friend and spouse.

Figure 1 illustrates how a causal model of multi-method data can be used for this purpose. In Figure 1, an unobserved general disposition to be helpful influences three observed measures of helpfulness. In this example, the three observed measures are informant ratings of helpfulness by a friend, a co-worker and a spouse. Unlike actual informant ratings in personality research, informants in this hypothetical example are only asked to report how often the target helped them in the past month. According to Figure 1, each informant report is influenced by two independent factors, namely, the actual frequency of helpful acts towards the informant and (systematic and random) measurement error in the reported frequencies of helpful acts towards the informant. The actual frequency of helpful acts is also influenced by two independent factors. One factor represents the general disposition to be helpful that influences helpful behaviours across situations. The other factor represents situational factors and person-situation interaction effects. To fully estimate all coefficients in this model (i.e. effect sizes of the postulated causal effects), it would be necessary to separate measurement error and valid variance in act frequencies.

This is impossible if, as in Figure 1, each act frequency is measured with a single method,
namely, one informant report. In contrast, the influence of the general disposition is
reflected in all three informant reports. As a result, it is possible to separate the variance due to the general disposition from all other variance components such as random error,
systematic rating biases, situation effects and personsituation interaction effects. It is
then possible to determine the validity of informant ratings as measures of the general
disposition, but it is impossible to (precisely) estimate the validity of informant ratings as
measures of act frequencies because the model cannot distinguish reporting errors from
situational influences on helping behaviour.

The causal model in Figure 1 makes numerous independence assumptions that specify
Campbell and Fiske’s (1959) requirement that traits should be assessed with independent
methods. First, the model assumes that biases in ratings by one rater are independent of
biases in ratings by other raters. Second, it assumes that situational factors and
person by situation interaction effects that influence helping one informant are independent of the situational and personsituation factors that influence helping other informants. Third, it assumes that rating biases are independent of situation and person by situation interaction effects for the same rater and across raters. Finally, it assumes that rating biases and situation effects are independent of the global disposition. In total, this amounts to 21 independence assumptions (i.e. Figure 1 includes seven exogeneous variables, that is, variables that do not have an arrow pointing at them, which implies 21 (7×6/2) relationships that the model assumes to be zero). If these independence assumptions are correct, the correlations among the three informant ratings can be used to determine the variation in the unobserved personality disposition to be helpful with perfect validity. This variance can then be used like the objective measure of weight in the previous example as the validation criterion for personality measures of the general
disposition to be helpful (e.g. self-ratings of general helpfulness). In sum, Figure 1
illustrates that a specific pattern of correlations among independent measures of the same construct can be used to obtain precise estimates of the amount of valid variance in a single measure.

The main challenge for actual empirical studies is to ensure that the methods in a multi-method model fulfill the independence assumptions. The following examples demonstrate the importance of the neglected independence assumption for the correct interpretation of causal models of multi-method data. I also show how researchers can partially test the independence assumption if sufficient methods are available and how researchers can estimate the validity of personality measures that aggregate scores from independent methods. Before I proceed, I should clarify that strict independence of methods is unlikely, just like other null-hypotheses are likely to be false. However, small violations of the independence assumption will only introduce small biases in estimates of CVCs.

Example 1: Multiple response formats

The first example is a widely cited study of the relation between Positive Affect and
Negative Affect (Green, Goldman,&Salovey, 1993). I chose this paper because the authors
emphasized the importance of a multi-method approach for the measurement of affect,
while neglecting Campbell and Fiske’s requirement that the methods should be maximally different. A major problem for any empirical multi-method study is to find multiple independent measures of the same construct. The authors used four self-report measures with different response formats for this purpose. However, the variation of response formats can only be considered a multi-method study, if one assumes that responses on one response format are independent of responses on the other response formats so that correlations across response formats can only be explained by a common causal effect of actual momentary affective experiences on each response format. However, the validity of all self-report measures depends on the ability and willingness of respondents to report their experiences accurately. Violations of this basic assumption introduce shared method variance among self-ratings on different response formats. For example, socially desirable responding can inflate ratings of positive experiences across response formats. Thus, Green et al.’s (1993) study assumed rather than tested the validity of self-ratings of momentary affective experiences. At best, their study was able to examine the contribution of stylistic tendencies in the use of specific response formats to variance in mood ratings, but these effects are known to be small (Schimmack, Bockenholt, & Reisenzein, 2002). In sum, Green et al.’s (1993) article illustrates the importance of critically examining the similarity of methods in a multi-method study. Studies that use multiple self-report measures that vary response formats, scales, or measurement occasions should not be considered multi-method studies that can be used to examine construct validity.

Example 2: Three different measures

The second example of a multi-method study also examined the relation between Positive Affect and Negative Affect (Diener et al., 1995). However, it differs from the previous example in two important ways. First, the authors used more dissimilar methods that are less likely to violate the independence assumption, namely, self-report of affect in the past month, averaged daily affect ratings over a 6 week period and averaged ratings of general affect by multiple informants. Although these are different methods, it is possible that these methods are not strictly independent. For example, Diener et al. (1995) acknowledge that all three measures could be influenced by impression management. That is, retrospective and daily self-ratings could be influenced by social desirable responding, and informant ratings could be influenced by targets’ motivation to hide negative emotions from others. A common influence of impression management on all three methods would inflate validity estimates of all three methods.

For this paper, I used Diener et al.’s (1995) multi-method data to estimate CVCs for the
three methods as measures of general dispositions that influence people’s positive and
negative affective experiences. I used the data from Diener et al.’s (1995) Table 15 that are reproduced in Table 1. I used MPLUS5.1 for these analyses and all subsequent analyses (Muthen & Muthen, 2008). I fitted a simple model with a single latent variable that represents a general disposition that has causal effects on the three measures. Model fit was perfect because a model with three variables and three parameters has zero degrees of freedom and can perfectly reproduce the observed pattern of correlations. The perfect fit implies that CVC estimates are unbiased if the model assumptions are correct, but it also implies that the data are unable to test model assumptions.
These results suggest impressive validity of self-ratings of affect (Table 2). In contrast,
CVC estimates of informant ratings are considerably lower, despite the fact that informant ratings are based on averages of several informants. The non-overlapping confidence intervals for self-ratings and informant ratings indicate that this difference is statistically significant. There are two interpretations of this pattern. On the one hand, it is possible that informants are less knowledgeable about targets’ affective experiences. After all, they do not have access to information that is only available introspectively. However, this privileged information does not guarantee that self-ratings are more valid because individuals only have privileged information about their momentary feelings in specific situations rather than the internal dispositions that influence these feelings. On the other hand, it is possible that retrospective and daily self-ratings share method variance and do not fulfill the independence assumption. In this case, the causal model would provide inflated estimates of the validity of self-ratings because it assumes that stronger correlations between retrospective and daily self-ratings reveal higher validity of these methods, when in reality the higher correlation is caused by shared method effects. A study with three methods is unable to test these alternative explanations.

Example 3: Informants as multiple methods

One limitation of Diener et al.’s (1995) study was the aggregation of informant ratings.
Although aggregated informant ratings provide more valid information than ratings by a
single informant, the aggregation of informant ratings destroys valuable information about the correlations among informant ratings. The example in Figure 1 illustrated that ratings by multiple informants provide one of the easiest ways to measure dispositions with multiple methods because informants are more likely to base their ratings on different situations, which is necessary to reveal the influence of internal dispositions.

Example 3 shows how ratings by multiple informants can be used in construct validation research. The data for this example are based on multi-method data from the Riverside Accuracy Project (Funder, 1995; Schimmack, Oishi, Furr, & Funder, 2004). To make the CVC estimates comparable to those based on the previous example, I used scores on the depression and cheerfulness facets of the NEO-PI-R (Costa&McCrae, 1992). These facets are designed to measure affective dispositions. The multi-method model used self-ratings and informant ratings by parents, college friends and hometown friends as different methods.

Table 3 shows the correlation matrices for cheerfulness and depression. I first fitted a causal model that assumed independence of all methods to the data. The model also included sum scores of observed measures to examine the validity of aggregated informant ratings and an aggregated measure of all four raters (Figure 2). Model fit was evaluated using standard criteria of model fit, namely, comparative fit index (CFI)>.95, root mean square error of approximation (RMSEA)<.06 and standardized
root mean residuals (SRMR)<.08.

Neither cheerfulness, chi2 (df =2, N =222) = 11.30, p<.01, CFI =.860, RMSEA =.182, SRMR = .066, nor depression, chi2 (df =2, N = 222) = 8.31, p =.02,  CFI =. 915, RMSEA = .150, SRMR =.052, had acceptable CFI and RSMEA values.

One possible explanation for this finding is that self-ratings are not independent of informant ratings because self-ratings and informant ratings could be partially based on overlapping situations. For example, self-ratings of cheerfulness could be heavily influenced by the same situations that are also used by college friends to rate cheerfulness (e.g. parties). In this case, some of the agreement between self-ratings and informant ratings by college friends would reflect the specific situational factors of
overlapping situations, which leads to shared variance between these ratings that does not reflect the general disposition. In contrast, it is more likely that informant ratings are independent of each other because informants are less likely to rely on the same situations (Funder, 1995). For example, college friends may rely on different situations than parents.

To examine this possibility, I fitted a model that included additional relations between  self-ratings and informant ratings (dotted lines in Figure 2). For cheerfulness, an additional relation between self-ratings and ratings by college friends was sufficient to achieve acceptable model fit, chi2 (df =1, N =222) =0.08, p =.78, CFI =1.00, RMSEA =.000,
SRMR =.005. For depression, additional relations of self-ratings to ratings by college
friends and parents were necessary to achieve acceptable model fit. Model fit of this model was perfect because it has zero degrees of freedom. In these models, CVC can no longer be estimated by factor loadings alone because some of the valid variance in self-ratings is also shared with informant ratings. In this case, CVC estimates represent the combined total effect of the direct effect of the latent disposition factor on self-ratings and the indirect effects that are mediated by informant ratings.

I used the model indirect option of MPLUS5.1 to estimate the total effects in a model that  also included sum scores with equal weights for the three informant ratings and all four ratings.  Table 4 lists the CVC estimates for the four ratings and the two measures based on aggregated ratings.

The CVC estimates of self-ratings are considerably lower than those based on Diener
et al.’s (1995) data. Moreover, the results suggest that in this study aggregated informant
ratings are more valid than self-ratings, although the confidence intervals overlap. The
results for the aggregated measure of all four raters show that adding self-ratings to
informant ratings did not increase validity above and beyond the validity obtained by
aggregating informant ratings.

These results should not be taken too seriously because they are based on a single,
relatively small sample. Moreover, it is important to emphasize that these CVC estimates
depend on the assumption that informant ratings do not share method variance. Violation of this assumption would lead to an underestimation of the validity of self-ratings. For example, an alternative assumption would be that personality changes. As a result, parent ratings and ratings by hometown friends may share variance because they are based in part on situations before personality changed, whereas college friends’ ratings are based on more recent situations. This model fits the data equally well and leads to much higher estimates of CV in self-ratings. To test these competing models it would be necessary to include additional measures. For example, standardized laboratory tasks and biological measures could be added to the design to separate valid variance from shared rating biases by informants.

These inconsistent findings might suggest that it is futile to obtain wildly divergent quantitative estimates of construct validity. However, the same problem arises in other research areas and it can be addressed by designing better studies that test assumptions that cannot be tested in existing data sets. In fact, I believe that publication of conflicting validity estimates will stimulate research on construct validity, whereas the view of construct validation research as an obscure process without clear results has obscured the lack of knowledge about the validity of personality measures.


I used two multi-method datasets to illustrate how causal models of multi-method data can be used to estimate the validity of personality measures. The studies produced different results. It is not the purpose of this paper to examine the sources of disagreement. The results merely show that it is difficult to make general claims about the validity of commonly used personality measures. Until more precise information becomes available, the results suggest that about 30–70% of the variance in self-ratings and single informant ratings is CV. Until more precise estimates become available I suggest an estimate of 50 +/- 20% as a rough estimate of construct validity of personality ratings.

I suggest the verbal labels low validity for measures with less than 30% CV (e.g. implicit measures of well-being, Walker & Schimmack, 2008), moderate validity for measures with 30–70% CV (most self-report measures of personality traits) and high validity for measures with more than 70% CV (self-ratings of height and weight). Subsequently, I briefly discuss the practical implications of using self-report measures with moderate validity to study the causes and consequences of personality dispositions.

Correction for invalidity

Measurement error is nearly unavoidable, especially in the measurement of complex
constructs such as personality dispositions. Schmidt and Hunter (1996) provided
26 examples of how the failure to correct for measurement error can bias substantive
conclusions. One limitation of their important article was the focus on random
measurement error. The main reason is probably that information about random
measurement error is readily available. However, invalid variance due to systematic
measurement error is another factor that can distort research findings. Moreover, given
the moderate amount of valid variance in personality measures, corrections for invalidity are likely to have more dramatic practical implications than corrections for unreliability. The following examples illustrate this point.

Hundreds of twin studies have examined the similarity between MZ and DZ twins to
examine the heritability of personality characteristics. A common finding in these studies are moderate to large MZ correlations (r =.3–.5) and small to moderate (r =.1–.3) DZ correlations. This finding has led to the conclusion that approximately 40% of the variance is heritable and 60% of the variance is caused by environmental factors. However, this interpretation of twin data fails to take measurement error into account. As it turns out, MZ correlations approach, if not exceed, the amount of validity variance in personality measures as estimated by multi-method data. In other words, ratings by two different individuals of two different individuals (self-ratings by MZ twins) tend to correlate as highly with each other as those of a single individual (self ratings and informant ratings of a single target). This finding suggests that heritability estimates based on mono-method studies severely underestimate heritability of personality dispositions (Riemann, Angleitner, & Strelau, 1997). A correction for invalidity would suggest that most of the valid variance is heritable (Lykken&Tellegen, 1996). However, it is problematic to apply a direct correction for invalidity to twin data because this correction assumes that the independence assumption is valid. It is better to combine a multi-method assessment with a twin design (Riemann et al., 1997). It is also important to realize that multi-method models focus on internal dispositions rather than act frequencies. It makes sense that heritability estimates of internal dispositions are higher than heritability estimates of act frequencies because act frequencies are also influenced by situational factors.

Stability of personality dispositions

The study of stability of personality has a long history in personality psychology (Conley,
1984). However, empirical conclusions about the actual stability of personality are
hampered by the lack of good data. Most studies have relied on self-report data to examine this question. Given the moderate validity of self-ratings, it is likely that studies based on self-ratings underestimate true stability of personality. Even corrections for unreliability alone are sufficient to achieve impressive stability estimates of r =.98 over a 1-year interval (Anusic & Schimmack, 2016; Conley, 1984). The evidence for stability of personality from multi-method studies is even more impressive. For example, one study reported a retest correlation of r =.46 over a 26-year interval for a self-report measure of neuroticism (Conley, 1985). It seems possible that personality could change considerably over such a long time period. However, the study also included informant ratings of personality. Self-informant agreement on the same occasion was also r =.46. Under the assumption that self-ratings and informant ratings are independent methods and that there is no stability in method variance, this pattern of correlations would imply that variation in neuroticism did not change at all over this 26-year period (.46/.46 =1.00). However, this conclusion rests on the validity of the assumption that method variance is not stable. Given the availability of longitudinal multi-method data it is possible to test this assumption. The relevant information is contained in the cross-informant, cross-occasion correlations. If method  variance was unstable, these correlations should also be r =.46. In contrast, the actual correlations are lower, r =.32. This finding indicates that (a) personality dispositions changed and (b) there is some stability in the method variance. However, the actual stability of personality dispositions is still considerably higher (r =.32/.46 =.70) than one would have inferred from the observed retest correlation r =.46 of self-ratings alone. A retest correlation of r =.70 over a 26-year interval is consistent with other estimates that the stability of personality dispositions is about r =.90 over a 10-year period and r =.98 over a 1-year period (Conley, 1984; Terracciano, Costa, & McCrae, 2006) and that the majority of the variance is due to stable traits that never change (Anusic & Schimmack, 2016). The failure to realize
that observed retest correlations underestimate stability of personality dispositions can be costly because it gives personality researchers a false impression about the likelihood of finding empirical evidence for personality change. Given the true stability of personality it is necessary to wait a long time or to use large sample sizes and probably best to do both (Mroczek, 2007).

Prediction of behaviour and life outcomes

During the person-situation debate, it was proposed that a single personality trait predicts less than 10% of the variance in actual behaviours. However, most of these studies relied on self-ratings of personality to measure personality. Given the moderate validity of self-ratings, the observed correlation severely underestimates the actual effect of personality traits on behaviour. For example, a recent meta-analysis reported an effect size of conscientiousness on GPA of r =.24 (Noftle & Robins, 2007). Ozer (2007) points out
that strictly speaking the correlation between self-reported conscientiousness and GPA
does not represent the magnitude of a causal effect.

Assuming 40% valid variance in self-report measures of conscientiousness (DeYoung, 2006), the true effect size of a conscientious disposition on GPA is r =.38 (.24/sqrt(.40)). As a result, the amount of explained variance in GPA increases from 6% to 14%. Once more, failure to correct for invalidity in personality measures can be costly. For example, a personality researcher might identify seven causal factors that independently produce observed effect size estimates of r =.24, which suggests that these seven factors explain less than 50% of the variance in GPA (7 * .24^2 =42%). However, decades of future research are unable to uncover additional predictors of GPA. The reason could be that the true amount of explained variance is nearly 100% and that the unexplained variance is due to invalid variance in personality measures (7 * .38^2 =100%).


This paper provided an introduction to the logic of a multi-method study of construct
validity. I showed how causal models of multi-method data can be used to obtain
quantitative estimates of the construct validity of personality measures. I showed that
accurate estimates of construct validity depend on the validity of the assumptions
underlying a causal model of multi-method data such as the assumption that methods are independent. I also showed that multi-method studies of construct validity require
postulating a causal construct that can influence and produce covariances among
independent methods. Multi-method studies for other constructs such as actual behaviours or act frequencies are more problematic because act frequencies do not predict a specific pattern of correlations across methods. Finally, I presented some preliminary evidence that commonly used self-ratings of personality are likely to have a moderate amount of valid variance that falls broadly in a range from 30% to 70% of the total variance. This estimate is consistent with meta-analyses of self-informant agreement (Connolly, Kavanagh, & Viswesvaran, 2007; Schneider & Schimmack, 2009). However, the existing evidence is limited and more rigorous tests of construct validity are needed. Moreover studies with large, representative samples are needed to obtain more precise estimates of construct validity (Zou, Schimmack, & Gere, 2013). Hopefully, this paper will stimulate more research in this fundamental area of personality psychology by challenging the description of construct validity research as a Kafkaesque pursuit of an elusive goal that can never be reached (cf. Borsboom, 2006). Instead empirical studies of construct validity are a viable and important scientific enterprise that faces the same challenges as other studies in personality psychology that try
to make sense of correlational data.


Allport, G. W., & Odbert, H. S. (1936). Trait-names a psycho-lexical study. Psychological
Monographs, 47(1), 1–171.

Anusic, I. & Schimmack, U. (2016). Stability and change of personality traits, self-esteem, and well-being: Introducing the meta-analytic stability and change model of retest correlations. Journal of Personality and Social Psychology, Vol 110(5), May 2016, 766-781. 

Biesanz, J. C., &West, S. G. (2004). Towards understanding assessments of the Big Five: Multitraitmultimethod analyses of convergent and discriminant validity across measurement occasion and type of observer. Journal of Personality, 72(4), 845–876.

Blanton, H., & Jaccard, J. (2006). Arbitrary metrics redux. American Psychologist, 61(1), 62–71.

Block, J. (1989). Critique of the act frequency approach to personality. Journal of Personality and Social Psychology, 56(2), 234–245.

Borsboom, D. (2006). The attack of the psychometricians. Psychometrika, 71(3), 425–440.

Borsboom, D., &Mellenbergh, G. J. (2002). True scores, latent variables, and constructs: A comment on Schmidt and Hunter. Intelligence, 30(6), 505–514.

Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2003). The theoretical status of latent variables. Psychological Review, 110(2), 203–219.

Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitraitmultimethod matrix. Psychological Bulletin, 56(2), 81–105.

Chaplin,W. F. (2007). Moderator and mediator models in personality research: A basic introduction. In R.W. Robins, C. R. Fraley,&R. F. Krueger (Eds.), Handbook of research methods in personality psychology (602–632). New York, NY: Guilford Press.

Conley, J. J. (1984). The hierarchy of consistency: A review and model of longitudinal findings on adult individual differences in intelligence, personality and self-opinion. Personality and Individual Differences, 5(1), 11–25.

Conley, J. J. (1985). Longitudinal stability of personality traits: A multitrait-multimethod-multioccasion analysis. Journal of Personality and Social Psychology, 49(5), 1266–1282.

Connolly, J. J., Kavanagh, E. J., & Viswesvaran, C. (2007). The convergent validity between self and observer ratings of personality: A meta-analytic review. International Journal of Selection and Assessment, 15(1), 110–117.

Costa, J. P. T., & McCrae, R. R. (1992). Revised NEO Personality Inventory (NEOPI-R) and Five Factor Inventory (NEO-FFI) professional manual. Odessa, FL: Psychological Assessment Resources.

Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281–302.

DeYoung, C. G. (2006). Higher-order factors of the Big Five in a multi-informant sample. Journal of Personality and Social Psychology, 91(6), 1138–1151.

Diener, E., Smith, H., & Fujita, F. (1995). The personality structure of affect. Journal of Personality and Social Psychology, 69(1), 130–141.

Eid, M., Lischetzke, T., Nussbeck, F. W., & Trierweiler, L. I. (2003). Separating trait effects from trait-specific method effects in multitrait-multimethod models: A multiple-indicator CT-C(M-1) model. Psychological Methods, 8(1), 38–60.
[Assumes a gold standard method without systematic measurement error (e.g., an objective measure of height or weight is available]

Funder, D. C. (1991). Global traits—a Neo-Allportian approach to personality. Psychological Science, 2(1), 31–39.

Funder, D. C. (1995). On the accuracy of personality judgment—a realistic approach. Psychological Review, 102(4), 652–670.

Goldberg, L. R. (1992). The social psychology of personality. Psychological Inquiry, 3, 89–94.

Green, D. P., Goldman, S. L., & Salovey, P. (1993). Measurement error masks bipolarity in affect ratings. Journal of Personality and Social Psychology, 64(6), 1029–1041.

Grucza, R. A., & Goldberg, L. R. (2007). The comparative validity of 11 modern personality
inventories: Predictions of behavioral acts, informant reports, and clinical indicators. Journal of Personality Assessment, 89(2), 167–187.

John, O. P., & Soto, C. J. (2007). The importance of being valid: Reliability and the process of construct validation. In R. W. Robins, C. R. Fraley, & R. F. Krueger (Eds.), Handbook of research methods in personality psychology (461–494). New York, NY: Guilford Press.

Kenny, D. A., & Kashy, D. A. (1992). Analysis of the multitrait-multimethod matrix by confirmatory factor analysis. Psychological Bulletin, 112(1), 165–172.

Kroh, M. (2005). Effects of interviews during body weight checks in general population surveys. Gesundheitswesen, 67(8–9), 646–655.

Lykken, D., & Tellegen, A. (1996). Happiness is a stochastic phenomenon. Psychological Science, 7(3), 186–189.

McCrae, R. R.,&Costa, P. T. (1995). Trait explanations in personality psychology. European Journal of Personality, 9(4), 231–252.

Mroczek, D. K. (2007). The analysis of longitudinal data in personality research. In R.W. Robins, C. R. Fraley, & R. F. Krueger (Eds.), Handbook of research methods in personality psychology (pp. 543–556). New York, NY, US: Guilford Press.

Muthen, L. K., & Muthen, B. O. (2008). Mplus user’s guide (5th ed.). Los Angeles, CA: Muthen & Muthen. 

Noftle, E. E., & Robins, R. W. (2007). Personality predictors of academic outcomes: Big five
correlates of GPA and SAT scores. Journal of Personality and Social Psychology, 93(1), 116–130.

Ozer, D. J. (2007). Evaluating effect size in personality research. In R.W. Robins, C. R. Fraley, & R. F. Krueger (Eds.). New York, NY, US: Guilford Press.

Riemann, R., Angleitner, A., & Strelau, J. (1997). Genetic and environmental influences on personality: A study of twins reared together using the self- and peer-report NEO-FFI scales. Journal of Personality, 65(3), 449–475.

Robins, R. W., & Beer, J. S. (2001). Positive illusions about the self: Short-term benefits and long-term costs. Journal of Personality and Social Psychology, 80(2), 340–352.

Rowland, M. L. (1990). Self-reported weight and height. American Journal of Clinical Nutrition, 52(6), 1125–1133.

Schimmack, U. (2007). The structure of subjective well-being. In M. Eid, & R. J. Larsen (Eds.), The science of subjective well-being (pp. 97–123). New York: Guilford.

Schimmack, U., Bockenholt, U.,&Reisenzein, R. (2002). Response styles in affect ratings: Making a mountain out of a molehill. Journal of Personality Assessment, 78(3), 461–483.

Schimmack, U., & Oishi, S. (2005). The influence of chronically and temporarily accessible
information on life satisfaction judgments. Journal of Personality and Social Psychology,
89(3), 395–406.

Schimmack, U., Oishi, S., Furr, R. M., & Funder, D. C. (2004). Personality and life satisfaction: A facet-level analysis. Personality and Social Psychology Bulletin, 30(8), 1062–1075.

Schmidt, F. L.,& Hunter, J. E. (1996). Measurement error in psychological research: Lessons from 26 research scenarios. Psychological Methods, 1(2), 199–223.

Schneider, L., & Schimmack, U. (2009). Self-informant agreement in well-being ratings: A metaanalysis. Social Indicators Research, 94, 363–376.

Simms, L. J., & Watson, D. (2007). The construct validation approach to personality scale
construction. In R. W. Robins, C. R. Fraley, & R. F. Krueger (Eds.), Handbook of research
methods in personality psychology (240–258). New York, NY: Guilford Press.

Terracciano, A., Costa, J. P. T., & McCrae, R. R. (2006). Personality plasticity after age 30.
Personality and Social Psychology Bulletin, 32, 999–1009.

Walker, S. S., & Schimmack, U. (2008). Validity of a happiness Implicit Association Test as a measure of subjective well-being. Journal of Research in Personality, 42(2), 490–497.

Watson, D., Clark, L. A., & Tellegen, A. (1988). Development and validation of brief measures of positive and negative affect: The PANAS Scales. Journal of Personality and Social Psychology, 54(6), 1063–1070.

Watson, D., Wiese, D., Vaidya, J., & Tellegen, A. (1999). The two general activation systems of affect: Structural findings, evolutionary considerations, and psychobiological evidence. Journal of Personality and Social Psychology, 76(5), 820–838.

Zou, C., Schimmack, U., & Gere, J. (2013).  The validity of well-being measures: A multiple-indicator–multiple-rater model.  Psychological Assessment, 25, 1247-1254.