Personality Change in the MIDUS

In 2000, Costa, Herbst, McCrae and Siegler published the article “PERSONALITY AT MIDLIFE:
STABILITY, INTRINSIC MATURATION, AND RESPONSE TO LIFE EVENTS”. The article reported the biggest study of personality stability and change at that time.

Over 1000 participants (N = 1,779) took the NEO, a measure of the Big Five personality traits, 9 years apart. Participants were 39 to 45 years old at time 1. The main finding was that mean levels of personality hardly changed. If anything all scales, except for agreeableness showed a small decrease. This finding led to the conclusion that personality is largely stable in adulthood.

Six years later, Roberts, Walton, and Viechtbauer reported the results of a meta-analysis of personality change over the life course. The results of this meta-analysis were dramatically different. In particular, conscientiousness showed marked increases throughout adulthood. According to this meta-analysis, conscientiousness would still increase by about half a standard deviation from age 30 to age 75.

Sometimes, meta-analysis are considered superior to original studies because they incorporate all of the available evidence. However, meta-analyses are also problematic because they combine a heterogeneous set of studies. The main limitation of Roberts et al.’s (2006) meta-analysis was the lack of good data. Costa et al.’s (2000) article was by far the largest sample with adult (age > 30) participants. Other studies sometimes had samples of fewer than 100 participants or examined very brief time intervals that leave little time for changes in personality. For example, one study was based on 37 participants with a 2 year retest interval (Weinryb et al., 1992). Thus, the amount of (mean-level) change in personality in adulthood remains an open empirical question that can only be answered with better data.

Fortunately, longitudinal data from large samples are now available to shed new light on personality change in adulthood. A few days ago, I posted results based on three wavers spanning 8 years in the German Socio-Economic panel. The results showed mainly cohort effects and little evidence of personality change with age. The figure below shows the results for conscientiousness. Only the youngest cohort (on the right) shows some increases from 2005 to 2013.


Here I present the results of an analysis of the MIDUS data. To examine age and cohort effects I fitted a measurement model (Schimmack, 2019) to the three waves of the MIDUS. I also divided the sample into three cohorts of 30-40 year olds (1965-75), 40-50 year olds (1955-65) and 50-60 year olds (1945-55) in 1995. The measurement model had metric and scalar invariance for all 9 groups (3 cohorts x 3 waves) and had acceptable fit to the data, CFI = .952, RMSEA = .027, SRMR = .055. The MPLUS syntax can be found on OSF ( The sample sizes for the three cohorts were N = 1,625, N = 1,674, and N = 1,279, although not all participants completed all three waves. Results were similar when data were analyzed with listwise deletion. The standardized means of the latent variables were centered so that all group means are deviations from the overall mean.



The results for conscientiousness are difficult to interpret. Unlike the SOEP data, conscientiousness scores increase from Wave 1 to Wave 3 in all three cohorts. The effect size is modest for the 18 -year interval but would double for a longer period from age 30 to age 70. Thus, an exclusive focus on change over time would be consistent with Roberts et al.’s findings. However, the figure also shows that there are no cohort differences in conscientiousness. That is, 50-60 year olds in 1995 (cohort 1945-55) did not score higher than 40-50 year olds in 1995, although they are 20 years older. One possible explanation for this finding would be a cohort effect that offsets the age effect, but this cohort effect would imply that younger generations are more conscientious than older generations. The problem with this explanation is that there is no evidence or theory that would suggest such a cohort effect.

The alternative explanation would be period effects. Period effects would change conscientiousness scores of all cohorts in the same direction. However, there are also no theories or data to suggest that conscientiousness has increased from 1995 to 2009.

In conclusion, it remains unclear whether and how much conscientiousness levels increase with age. Although new and better data are available, the data are inconsistent and inconclusive.


The results for agreeableness are similar to those for conscientiousness. A focus on the longitudinal trends suggests that agreeableness increases with age, which mirrors Roberts et al.’s (2006) meta-analysis. This time, the oldest cohort also shows a pattern that is consistent with an age effect. However, other interpretations are possible. The SOEP data suggested a small cohort effect with younger cohorts being less agreeable. Thus, the differences between cohorts may not be age effects. The effect sizes over an 18-year interval are small, but might add up to the d = .4 effect size from age 30 to 75 suggested by Roberts et al.’s (2006) meta-analysis.

Roberts et al.’s (2006) meta-analysis also suggested that neuroticism decreases with age, while the SOEP data didn’t show an age-trend for neuroticism. The MIDUS data also show little evidence that neuroticism decreases with age. Longitudinal trends were only notable for two cohorts and the effect size of d = .2 over an 18-year period is small.



At least the results for openness are consistent with previous findings that openness is fairly stable during adulthood.


This is also the case for extraversion.


The bedrock of science are objective empirical observations that produce a consistent picture of a phenomenon. Obtaining such consistent evidence can be difficult. Studying personality change is difficult for many reasons. Following a large sample of participants over time is hard and costly. Even cross-sectional and longitudinal information in combination is sufficient to disentangle age effects from period effects or cohort effects. It doesn’t help when effect sizes are small. Even a moderate effect size of d = .5 over a period of 10 years, implies only a tiny effect size of d = .05 over a one-year period. Moreover, personality measures have only modest validity and are influenced by systematic measurement error that can produce spurious evidence of personality change.

The study of mean differences also has the problem that many causal factors can explain a time-trend in the data at the mean level, and that mean level changes are most likely the aggregated effects of several causal factors at the individual level (e.g., work experiences or health problems may have opposite effects on conscientiousness). Thus, progress is more likely to be made by focusing on individuals’ trajectories rather than mean levels.

The broader implications of these findings are that there is no evidence that personality changes in substantial ways throughout adulthood. This conclusion is limited to the Big Five, although Costa and McCrae also found little evidence for age effects at the level of more specific personality traits. Of course, 20-year olds behave differently than 40-year olds, or 60-year olds. However, these changes in actual behaviors are more likely the result of changing life-circumstances than changes in personality traits.

Open-SOEP: Cohort vs. Age Effects on Personality

The German Socio-Economic Panel (SOEP) is a unique and amazing project. Since 1984, representative samples of German families have been survived annually. This project has produced a massive amount of data and hundreds of publications. The traditional journal publications make it difficult to keep track with developments and to find related articles. A better way to make use of these data may be open science where researchers can quickly share information.

In 2005, the SOEP included a brief, 15-item, measure of the Big Five personality traits. These data were used for cross-sectional studies that related personality to other variables measured in the SOEP such as well-being (Rammstedt, 2007). In 2009, the SOEP repeated the measurement of the Big Five. This provided longitudinal data for analyses of stability and change of personality. Researchers rushed to analyze the data and to report their findings. JPSP published two independent articles based on the same data (Lucas & Donnellan, 2011; Specht, Egloff, Schmukle, 2011). Both articles examined age-differences across birth-cohorts and over time. Ideally age-effects would show up in both analyses and produce similar trends in the data. Both articles also paid little attention to cohort differences in personality (i.e., Germans born in 1920 who grew up during Nazi times might differ from Germans born in 1950 who grew up during the revolutionary 60s).

In 2017, the Big Five questions were administered again, which makes it easier to spot age-trends and to distinguish age-effects from cohort effects. Recently, the first article based on the three-waves of data was published in JPSP (Wagner, Lüdtke, & Robitzsch, 2019). The article focused on retest correlations (consistency of individual differences over time), and did not examine mean levels of personality. The article does not mention cohort effects.

Cohort/Culture Effects

Like many Western countries, German culture has changed tremendously during the 20st century. In addition, German culture has been shaped by unique historical events such as the rise and fall of Hitler, the second world war, followed by the Wirschaftswunder, the division of the country into a democratic and a socialist country and the unification of Germany after the fall of the Berlin Wall. The SOEP data provide a unique opportunity to examine whether personality is shaped by culture.

So far, studies of cultural influences on personality have mostly relied on cross-cultural comparisons of Western cultures with non-Western cultures. The main finding of these studies is that citizens of modern, individualistic nations tend to be more extraverted and open to experiences than citizens in traditional, collectivistic cultures.

Based on these findings, one might expect higher levels of extraversion and openness in younger generations of Germans who grew up in a more individualistic culture than their parents and grandparents.


The data are the Big Five ratings for the three waves in the SOEP (vp, zp, & bdp). Data were prepared and analyzed using R (see OSF for R-code). The three items for each of the Big Five scales were summed and analyzed as a function of 7 cohorts spanning 10 years (born 1978 to 1988 age 17-27 to age 77 to 87) and three waves (2005, 2009, 2013). The overall mean was subtracted from each of the 21 means and the mean differences were divided by the pooled standard deviation. This way, mean differences in the figures are standardized mean differences to ease interpretation of effect sizes.


Openness to Experience

Openness to experience showed a clear cohort effect (Figure 1) with the lowest scores for the oldest cohort (1918-28) and the highest scores for the youngest cohort (1978 to 1988). The difference between the youngest and oldest cohorts is d = .72, which is considered a large effect size. In comparison, there is no clear age trend in Figure 1. While, scores decrease from t1 to t2, they increase from t2 to t3. All differences between t1 and t2 are small, |d| < .2.


Extraversion also shows a cohort effect in the predicted direction, but the effect size is smaller, d = .34.

In contrast, there are no age effects and the overall difference between 2005 and 2013 is d = -0.01.


I next examined conscientiousness because studies of age effects tend to show the largest age effects for this Big Five dimension. Regarding cohort effects, one might expect a decrease because older generations worked very hard to rebuild post-war Germany.

Consistent with the developmental literature, the youngest age-cohort shows an increase in conscientiousness from 2005 to 2013, although the effect size is small (d = .21). The other age-cohorts show very small decreases in conscientiousness except for the oldest age-cohort that shows a small decrease, d = -.22. Regarding cohort effects, there is no general tend, but the youngest cohort shows very low levels of conscientiousness even in 2013 when they are 25 to 35 years old.


Developmental studies suggest that agreeableness increases as people get older. However, the SOEP data do not confirm this trend.

Within each cohort, agreeableness scores decrease although the effect sizes are very small. The overall decrease from 2005 to 2013 is d = -.09. In contrast, there is a clear cohort effect with agreeableness being the highest in the oldest generation. The decrease tends to level of for the last three generations. The effect size is moderate, d = -.38.


The main result for neuroticism is that there is neither a pronounced cohort effect, d = -.09, nor age effect, d = -.13.


Previous analysis of personality data in the SOEP have focused on age effects and interpreted cross-sectional differences between older and younger Germans as age effects. However, these analyses were based on only two waves of data, which makes it difficult to interpret changes in personality scores over time. The third wave shows that some of the trends did not continue and suggest that there are no notable effects of aging in the SOEP data. The only age-effect consistent with the literature is an increase in conscientiousness in the youngest cohort of 17 to 27-year olds.

However, the data are consistent with cohort effects that are consistent with cross-cultural studies. The more individualistic a culture becomes, the more open and extraverted individuals become. Deeper analysis might help to elucidate which factors contribute to these changes (e.g., education level). The results also suggested that agreeableness decreased which might be another consequence of increasing individualism.

Overall, the results suggest that personality is influenced by cultural factors during adolescence and early adulthood, but that personality remains fairly stable throughout adulthood. This conclusion is also supported by other longitudinal studies (e.g., MIDUS) that show little changes in Big Five scores over time. Maybe Costa and McCrae were not entirely wrong when they compared personality to plaster that can be shaped while it is setting, but remains stable after it is dried.

The Hierarchy of Consistency Revisited

In 1984, James J. Conley published one of the most interesting studies of personality stability. However, this important article was published in Personality and Individual Differences and has been ignored. Even today, the article has only 184 citations in WebofScience. In contrast, the more recent meta-analysis of personality stability by Roberts and DelVeccio (2001) has 1,446 citations.

Sometimes more recent and more citations doesn’t mean better. The biggest problem in studies of stability is that random and occasion specific measurement error attenuates observed retest correlations. Thus, observed retest correlations are prone to underestimate the true stability of personality traits. With a single retest-correlation it is impossible to separate measurement error from real change. However, when more then two repeated measurements are observed, it is possible to separate random measurement error from true change, using a statistical approach that was developed by Heise (1969).

The basic idea of Heise’s model is that change accumulates over time. Thus, if traits change from T1 to T2 and from T2 to T3, the trait changed even more from T1 to T3.

Without going into mathematical details, the observed retest correlation from T1 to T3 should match the product of the retest correlations from T1 to T2 and T2 to T3.

For example, if r12 = .8 and r 23 = .8, r13 should be .8 * .8 = .64.

The same is also true if the retest correlations are not identical. Maybe more change occurred from T1 to T2 than from T2 to T3. The total stability is still a function of the product of the two partial stabilities. For example, r12 = .8 and r23 = .5 yields r13 = .8 * .5 = .4.

However, if there is random measurement error, the r13 correlation will be larger than the product of the r12 and r23 correlations. For example, using the above example and a reliability of .8, we get r12 = .8 * .8 = .64, r23 = .4 * .8 = .32 and the product is .64 * .32 = .20, while the actual r13 correlation is .32 * .8 = .256. Assuming that reliability is constant, we have three equations with three unknowns and it is possible to solve the equations to estimate reliability.

(1) r12 = rel*s1; s1 = r12/rel
(2) r23 = rel*s2; s2 = r23/rel
(3) r13 = rel*s1*s2, rel = r13/(s1*s2)

r = (r12*r23)/r13

with r12 = .64, r23 = .32, and r13 = .256, we get (.64*.32)/.256 = .8.

Heise’s model is called an autoregressive model which implies that over time, retest correlations will become smaller and smaller until they approach zero. However, if stability is high, this can take a long time. For example, Conley (1984) estimated that the annual stability of IQ tests is r = .99. With this high stability, the retest correlation over 40 years is still r = .67. Consistent with Conley’s prediction a study found that the retest correlation from age 11 to age 70 of r = .67 (ref), which is even higher than predicted by Conley.

The Figure below shows Conley’s estimate for personality traits like extraversion and neuroticism. The figure shows that reliability varies across studies and instruments from as low as .4 to as high as .9. After correcting for unreliability, the estimated annual stability of personality traits is s = .98.

The figure also shows that most studies in this meta-analysis of retest correlations covered short time-intervals from a few month up to 10 years. Studies with 10 or more years are rare. As a result, Conley’s estimates are not very precise.

To test Conley’s predictions, I used the three waves of the Midlife in the US study (MIDUS). Each wave was approximately 10 years apart with a total time span of 20 years. To analyze the data, I fitted a measurement model to the personality items in the MIDUS. The fit of the measurement model has been examined elsewhere (Schimmack, 2019). The measurement model was constrained for all three waves (see OSF for syntax). The model had acceptable overall fit, CFI = .963, RMSEA = .018, SRMR = .035 (see OSF for output).

The key finding are the retest correlations r12, r23, and r13 for the Big Five and two method factors; a factor for evaluative bias (halo) and acquiescence bias.


For all traits except acquiescence bias, the r13 correlation is lower than the r12 or r23 correlation, indicating some real change. However, for all traits, the r13 correlation is higher than the product of r12 and r23, indicating the presence of random measurement error or occasion specific variance.

The next table shows the decomposition of the retest-correlations into a reliability component and a stability component.

Reliable20Y Stability1Y Stability

The reliability estimates range from .84 to .92 for the Big Five scales. Reliability of the method factor is estimated to be lower. After correcting for unreliability, 20-year stability estimates increase from observed levels of .72 to .85 to estimated levels of .83 to .1. The implied annual stability estimates are above .99, which is higher than Conley’s estimate of .98.

Unfortunately, three time points are not enough to test the assumptions of Heise’s model. Maybe reliability increases over time. Another possibility is that some of the variance in personality is influenced by stable factors that never change (e.g., genetic variance). In this case, retest correlations do not approach zero, but to a level that is set by the influence of stable factors.

Anusic and Schimmack’s meta-analysis suggested that for the oldest age group, the amount of stable variance is 80, and that this asymptote is reached very quickly (see picture). However, this model predicts that 10-year retest correlations are equivalent to 20-year retest correlations, which is not consistent with the results in Table 1. Thus, the MIDUS data suggest that the model in Figure 1 overestimates the amount of stable trait variance in personality. More data are needed to model the contribution of stable factors to stability of personality traits. However, both models predict high stability of personality over a long period of 20 years.


Science can be hard. Astronomy required telescopes to study the universe. Psychologists need longitudinal studies to examine stability of personality and personality development. The first telescopes were imperfect and led to false beliefs about canals and life on Mars. Similarly, longitudinal data are messy and provide imperfect glimpses into the stability of personality. However, the accumulating evidence shows impressive stability in personality differences. Many psychologists are dismayed by this finding because they have a fixation on disorders and negative traits. However, the Big Five traits are not disorders or undesirable traits. They are part of human diversity. When it comes to normal diversity, stability is actually desirable. Imagine you train for a job and after ten years of training you don’t like it anymore. Imagine you marry a quiet introvert and five year later, he is a wild party animal. Imagine, you never know who you are because your personality is constantly changing. The grass on the other side of the fence is often greener, but self-acceptance and building on one’s true strength may be a better way to live a happy life than to try to change your personality to fit cultural norms or parental expectations. Maybe stability and predictability aren’t so bad after all.

The results also have implications for research on personality change and development. If natural variation in factors that influence personality produces only very small changes over periods of a few years, it will be difficult to study personality change. Moreover, small real changes will be contaminated with relatively large amounts of random measurement error. Good measurement models that can separate real change from noise are needed to do so.


Conley, J. J. (1984). The hierarchy of consistency: A review and model of longitudinal findings on adult individual differences in intelligence, personality and self-opinion. Personality and Individual Differences, 84, 11-25.

Heise D. R. (1969) Separating reliability and stability in test-retest correlation. Am. social. Rev. 34, 93-101.

Roberts, B. W., & DelVecchio, W. F. (2000). The rank-order consistency of personality traits from childhood to old age: A quantitative review of longitudinal studies. Psychological Bulletin, 126, 3–25.

Measuring Personality in the MIDUS

Although the replication crisis in psychology is far from over, a new crisis is emerging on the horizon; the validation crisis. Despite a proud tradition of psychological measurement, psychological science has ignored psychological measurement and treated sum scores of ratings or reaction times as valid, without testing this assumption (Schimmack, 2019a, 2019b).

Even when psychometricians examined the validity of psychological measures, these studies are often ignored. For example, there is ample evidence that self-ratings are influenced by a general evaluative bias or halo (Campbell & Fiske, 1959; Thorndike, 1920; Biesanz and West, 2004; deYoung, 2006; Anusic et al., 2009; Kim et al., 2012). Yet, psychometric studies of the Big Five tend to ignore this method factor (Zimprich, Allemand, Lachman, 2012).

This is unfortunate because psychologists now have invaluable datasets that examine personality in large, nationally representative, longitudinal studies such as the German Socio-Economic Panel (Specht et al.) and the Midlife In Da United States (MIDUS) study

The aim of this blog post is to invite psychologists to take advantage of advances in psychometric methods when the analyze these datasets. Rather than computing sum scores with low reliability that are contaminated by method variance, it is preferable to use latent variable models that can test measurement invariance across samples and over time.

To examine age effects on personality in the MIDUS, I developed an open measurement model ( Rather than arguing that this is the best measurement model, I consider it a starting point for further exploration. Exploring different measurement models and examining the theoretical consequences of different specifications is not validity hacking (v-hacking; cf. Schimmack, 2019c). Transparent open debate about specifications of measurement models is open science and necessary for developing better measures.

Using a measurement model for the MIDUS is particularly important because the questionnaire has only few items to represent some Big Five dimensions. Moreover, halo bias inflates factor loadings in models that do not control for halo bias (EFA, CFA without method factors) and results overestimate the validity of Big Five scales in the MIDUS.

The final model had acceptable overall fit and modification indices suggested no major further revisions to the model, CFI = .958, RMSEA = .039, 90%CI = .037 to .040, SRMR = .035.

Table 1 shows the factor loadings of items and scale scores on the latent Big Five factors.


Results show several items with notable secondary loadings (e.g., warm), and some primary factor loadings were modest (e.g., curious). Nevertheless, 50% or more of the variance in sum scores can be attributed to the primary content of a scale, except for conscientiousness. All scales also had considerable halo variance. For conscientiousness, halo variance was nearly as high as conscientiousness variance. Given these results, it is preferable to examine substantive questions with the latent factors of a measurement model rather than with manifest scale scores.

Age and Personality

The MIDUS data are some of the best data to examine the influence of age on personality because longitudinal studies with large samples and long retest intervals are rare (see meta-analysis by Anusic & Schimmack, 2016).

Age effects can be examined cross-sectionally and longitudinally. The problem with cross-sectional studies that age is confounded with cohort effects. The problem of longitudinal studies is that age is confounded with period effects. Stronger evidence for robust age effects is obtained in longitudinal cohort studies. The MIDUS data make it possible to compare participants who are 45 (40 to 50) to participants who are 55 (50 to 60) at time 1 and compare their scores at time 1 to their scores at time 2. The older age group at time 1 corresponds to the younger age group at time 2 (age 50 to 60). Thus, these groups should be similar to each other, but differ from the younger group at time 1 and the older group at time 2, if age influences personality.

To test this hypothesis, I fitted a multi-group model to the MIDUS data at time 1 and time 2. The model assumed metric and scalar invariance four all four groups. This model had good fit to the data, CFI = .957, RMSEA = .026, SRMR = .047.

The mans of the latent Big Five factors and the two method factors were divided by the overall mean of the four groups so that mean differences are presented as deviations from 0 (rather than using one group as an arbitrary reference group).

The results show no notable age effects for extraversion or openness. Neuroticism shows a decreasing trend with a standardized mean difference of .33 from age 40-50 to age 60-70. Agreeableness shows an even smaller increase by .21 standard deviations. The results for conscientiousness are difficult to interpret because the equivalent age groups differ more from each other than from other age groups. Overall, these results suggest that mean levels of personality are fairly stable from age 40 to age 70.

The halo factor shows a trend towards increasing with age. However, the increase is also modest, d = .35. The largest effect is a decrease in acquiescence. This effect is mostly driven by a retest effect, suggesting that acquiescence bias decreases with repeated testing.

These results suggest that most changes in personality may occur during adolescence and early adulthood, but that mean levels of personality are fairly stable through-out mid-life.

The model also provides information about the rank-order consistency of personality over a 10-year period. Consistent with meta-analytic evidence, retest correlations are high: neuroticism, r = .81, extraversion r = .87, openness r = .78, agreeableness r = .84, and conscientiousness, r = .81. A novel finding is that halo bias is also stable over a 10-year period, r = .69. So is acquiescence bias, r = .57. Thus, even time-lagged correlations can be influenced by method factors. Thus, it is necessary to control for halo bias in studies that rely on self-reports.

Gender and Personality

I also fitted a multiple-group model to the data with gender as between-group variable and time (T1 vs. T2). This model examines age differences for groups age 40-50 (T1) and age 50-60 (T2). The model with metric and scalar invariance had acceptable fit, CFI = .952, RMSEA = .027, SRMR = .051. As before, the means of the latent factors were transformed so that the overall mean was zero.

The main finding is a large difference between men and women’s agreeableness of nearly a full standard deviation. This difference was the same in both age groups. This finding is consistent with previous studies, including cross-cultural studies, suggesting that gender differences in agreeableness are robust and universal.

The results also showed consistent gender differences in neuroticism with an effect size of about 50% of a standard deviation. Again, the gender difference was observed in both age groups. This finding is also consistent with cross-cultural studies.

Hidden Invalidity of Personality Measures?

Sometimes journal articles have ironic titles. The article “Hidden invalidity among fifteen commonly used measures in social and personality psychology” (in press at AMPPS) is one of them. The authors (Ian Hussey & Sean Hughes) claim that personality psychologists engaged in validity-hacking (v-hacking) and claim validity for personality measures when actual validation studies show that these measures have poor validity. As it turns out, these claims are false and and the article is an example of invalidity hacking where the authors ignore and hide evidence that contradicts their claims.

The authors focus on several aspects of validity. Many measures show good internal consistency and retest-reliability. The authors ignore convergent and discriminant validity as important criteria of construct validity (Campbell & Fiske, 1959). The claim that many personality measures are invalid is based on examination of structural validity and measurement invariance across age groups and genders.

Yet, when validity was assessed comprehensively (via internal consistency, immediate and delayed test-retest reliability, factor structure, and measurement invariance for median age and gender) only 4% demonstrated good validity. Furthermore, the less commonly a test is reported in the literature, the more likely it was to be failed (e.g., measurement invariance). This suggests that the pattern of underreporting in the field may represent widespread hidden invalidity of the measures we use, and therefore pose a threat to many research findings. We highlight the degrees of freedom afforded to researchers in the assessment and reporting of structural validity. Similar to the better-known concept of p-hacking, we introduce the concept of validity hacking (v-hacking) and argue that it should be acknowledged and addressed.

Structural validity is important when researchers rely on manifest scale scores to test theoretical predictions that hold at the level of unobserved constructs. For example, gender differences in agreeableness are assumed to exist at the level of the construct. If a measurement model is invalid, mean differences between men and women on an (invalid) agreeableness scale may not reveal the actual differences in agreeableness.

The authors claim that “rigorous tests of validity are rarely conducted or reported” and that “many of the measures we use appear perfectly adequate on the surface and yet fall apart when subjected to more rigorous tests of validity beyond Cronbach’s α.” This claim is neither supported by citation nor consistent with the general practice in the development of psychological measures to explore the factor structure of items. For example, the Big Five were not conceived theoretically, but found empirically by employing exploratory factor analysis (or principal component analysis). Thus claims of widepread v-hacking by omitting structural analyses seems inconsistent with actual practices.

Based on a questionable description of the state of the affairs, the authors suggest that they are the fist to conduct empirical tests of structural validity.

“With this in mind, we examined the structural validity of fifteen well-known selfreport measures that are often used in social and personality psychology using several best practices (see Table 1).”

The practice to present something as novel by omitting relevant prior studies has been called l-hacking (literature review hacking). It also makes it unnecessary to compare results with prior results and to address potentially inconsistent results.

This also allows the authors to make false claims about their data. “The sheer size of the sample involved (N = 81,986 individuals, N = 144,496 experimental sessions) allowed us to assess the psychometric properties of these measures with numbers that were far greater than those used in many earlier validation studies. Contrary to this claim, Nye, Allemand, Gosling, and Roberts (2016) published a study of structural validity of the same personality measure (BFI) with over 150,000 participants. Thus, their study was neither novel nor did it have a larger sample size than prior studies.

The authors also made important and questionable choices that highlight the problem of researchers’ degrees of freedom in validation studies. In this case, their choice to fit a simple-structure model to the data ensured that they would obtain relatively bad fit if scales included reverse scored items, which is a good practice to reduce the influence of acquiescence bias on scale scores. However, the presence of acquiescence bias will also produce weaker correlations between direct and revere scored items. This response style can be modeled by including a method factor in the measurement model. Prior articles showed that acquiescence bias is present and that including an acquiescence factor improves model fit (Anusic et al., 2009; Nye et al., 2016). The choice not to include a method factor contributed to the authors conclusion that Big Five scales are structurally invalid. Thus, the authors conclusion is based on their own choice of a poor measurement model rather than hidden invalidity of the BFI.

The authors justify their choice of a simple-structure with the claim that most researchers who use these scales simply calculate sum scores and rely on these in their subsequent analyses. In doing so, they are tacitly endorsing simple measurement models with no cross-loadings or method factors). This claim is plain wrong. The only purpose of reverse scored items is to reduce the influence of acquiescence bias on scale scores because aggregation of direct and reverse scored items reduces the bias that is common to both types of items. If researchers would assume that acquiescence bias is absent, there would be no need for reverse scored items. Moreover, aggregation of items does not imply that all items are pure indicators of the latent construct or that there are no additional relationships among items (see Nye et al., 2016). The main rational for summing items is that they all have moderate to high loadings on the primary factor. When this is the case, most of the variance in sum scores reflects the common primary factor (see, e.g., Schimmack, 2019, for an example).

The authors also developed their own coding scheme to determine whether a scale has good, mixed, or poor fit to the data based on three fit indices. A scale was said to have poor fit, if CFI was below .95, TLI was below .95, RMSEA was below .06, and SRMR was above .09. That is, to have good fit, a scale must meet all four criteria. A scale was said to have poor fit, if it met none of the four criteria. All other possibilities were considered to be mixed fit. Only Conscientiousness met all four criteria. Agreeableness met 4 out of 3 (RMSEA = .063). Extraversion met 3 out of 4 (RMSEA .075). Neuroticism met 4 out of 3 (RMSEA = .065). And openness met 1 out of 4 (SRMR = .060), but was misclassified as poor. Thus, although the authors fitted a highly implausible simple structure model, fit suggested that a single-factor model fitted the data reasonably well. Experienced SEM researchers would also wonder about the classification of Openness as poor fit given that CFI was .933 and RMSEA was .069.

More important than meeting conventional cut-off values is to examine problems with a measurement model. In this case, one obvious problem is the lack of a method factor for acquiescence bias; or the presence of substantive variance that reflects lower-order traits (facets).

It is instructive to compare these results to Nye et al.’s (2016) prior results of structural validity. They found slighlty worse fit for the simple-structure model, but they also showed that model fit improved when they modeled the presence of lower-order factors or acquiescence bias (2 factor, pos/neg.) in the data. An even better model fit would have been obtained by modeling facets and aquiescence bias in a single model (Schimmack, 2019).

In short, the problem with the Big Five Inventory is not that it has poor validity as a measure of the Big Five. Poor fit of a simple-structure simply shows that other content and method factors contribute to variance in Big Five scales. A proper assessment of validity would require quantifying how much of the variance in Big Five scales can be attributed to the variance in the intended construct. That is, how much of the variance in extraversion scores on the BFI reflects actual variation in extraversion? This fundamental question was not addressed in the “hidden invalidity” article.

The “hidden invalidity” article also examined measurement invariance across age groups (median split) and the two largest gender groups (male, female). The actual results are only reported in a Supplement. Inspecting the Supplement shows hidden validity. Big Five measures passed most tests of metric and scalar invariance by the authors own criteria.

Big 5 Inventory – Aagefit_configuralNANANANAPoor
Big 5 Inventory – Aagefit_metric0.0200.044-0.0130.001Passed
Big 5 Inventory – Aagefit_scalar-0.0130.0000.0000.002Passed
Big 5 Inventory – Asexfit_configuralNANANANAPoor
Big 5 Inventory – Asexfit_metric0.0230.046-0.0140.001Passed
Big 5 Inventory – Asexfit_scalar-0.014-0.0030.0010.003Passed
Big 5 Inventory – Cagefit_configuralNANANANAPoor
Big 5 Inventory – Cagefit_metric0.0290.054-0.0170.001Passed
Big 5 Inventory – Cagefit_scalar-0.013-0.0030.0010.003Passed
Big 5 Inventory – Csexfit_configuralNANANANAPoor
Big 5 Inventory – Csexfit_metric0.0310.055-0.0180.002Passed
Big 5 Inventory – Csexfit_scalar-0.0050.006-0.0020.001Passed
Big 5 Inventory – Eagefit_configuralNANANANAPoor
Big 5 Inventory – Eagefit_metric0.0450.081-0.0320.001Passed
Big 5 Inventory – Eagefit_scalar-0.013-0.0010.0000.003Passed
Big 5 Inventory – Esexfit_configuralNANANANAPoor
Big 5 Inventory – Esexfit_metric0.0420.078-0.0300.002Passed
Big 5 Inventory – Esexfit_scalar-0.0070.007-0.0030.001Passed
Big 5 Inventory – Nagefit_configuralNANANANAPoor
Big 5 Inventory – Nagefit_metric0.0260.054-0.0200.002Passed
Big 5 Inventory – Nagefit_scalar-0.022-0.0100.0040.005Failed
Big 5 Inventory – Nsexfit_configuralNANANANAPoor
Big 5 Inventory – Nsexfit_metric0.0320.061-0.0220.001Passed
Big 5 Inventory – Nsexfit_scalar-0.0110.001-0.0010.003Passed
Big 5 Inventory – Oagefit_configuralNANANANAPoor
Big 5 Inventory – Oagefit_metric0.0360.068-0.0140.001Passed
Big 5 Inventory – Oagefit_scalar-0.041-0.0240.0050.006Failed
Big 5 Inventory – Osexfit_configuralNANANANAPoor
Big 5 Inventory – Osexfit_metric0.0350.065-0.0140.002Passed
Big 5 Inventory – Osexfit_scalar-0.043-0.0280.0060.006Failed

However, readers of the article don’t get to see this evidence. Instead they are presented with a table that suggests Big Five measures lack measurement invariance.

Aside from the misleading presentation of the results, the results are not very informative because they don’t reveal whether deviations from a simple-structure pose a serious threat to the validity of Big Five scales. Unfortunately, the authors’ data are currently not available to examine this question.

Own Investigation

Incidentally, I had just posted a blog post about measurement models of Big Five data (Schimmack, 2019), using open data from another study (Beck, Condon, & Jackson, 2019) using a large, online dataset with the IPIP-100 items. I showed that it is possible to fit a measurement model to the IPIP-100. To achieve model fit, the model included secondary loadings, some correlated residuals, and method factors for acquiescence bias and evaluative (halo) bias. These results show that a reasonable measurement model can fit Big Five data, as was demonstrated in several previous studies (Anusic et al., 2009; Nye et al., 2016).

Here, I examine measurement invariance for gender and age groups. I also modified and improved the measurement model, by using several of the rejected IPIP-100 items as indicators of the halo factor. Item analysis showed that the items “quick to understand things,” “carry conversations to a higher level,” “take charge,” “try to avoid complex people,” “wait for others to lead the way,” “will not probe deeply into a subject” loaded more highly on the halo factor than on the intended Big Five factor. This makes these items ideal candidates for the construction of a manifest measure of evaluative bias.

The sample were 9,309 Canadians, 140,479, US Americans, 5,804 British, and 5,091 Australians between the age of 14 and 60 (see for data). Data were analyzed with MPLUS.8.2 using robust maximum likelihood (see for complete syntax). The final model met the standard criteria for acceptable fit (CFI = .965, RMSEA = .015, SRMR = .032).

Table 1. Factor Loadings and Item Intercepts for Men (First) and Women (Second)

3 .48/  .47-.07 /-.08-.21/-.21.14/ .13-0.34/-0.45
10-.61/-.60.16/ .16.13/ .130.21/ 0.22
17-.66/-.63.09/ .09.14/ .13.15/ .140.71/ 0.70
46 .73/ .73.11/ .09-.20/-.20 .13/ .12-0.19/-0.20
56  .56/ .54-.15/-.16-.20/-.19.13/ .12-0.46/-0.46
SUM .83/ .83-.02/-.02.00/  .00.03/ .03-.06/-.07-.26/-.26.04/ .04
16.12/ .12-.72/-.69-.20/-.19.13/ .120.31/ 0.32
33-.32/-.32 .60/ .61.24/ .21.08/ .09.19/ .20.15/ .140.56/ 0.58
38.20/ .20-.71/-.70-.09/-.10-.24/-.24.13/ .12-0.17/-0.18
60.09/ .08-.69/-.67-.21/-.21.14/ .13-0.07/-0.08
88-.10/-.10 .72/ .72.16 /.14 .26/ .26.14/ .140.44/ 0.46
93-.15/-.15 .72/ .72.10/ .09.16/ .16.12/ .110.03/ 0.03
SUM-.23/-.23 .81/ .81.00/ .00.14/ .12.05/ .05.24/ .24.08/ .07
5.10/.09 .40/ .42-.07/-.08.50/ .47.19/ .171.32/ 1.29
27-.70/-.74-.23/-.22.15/ .14-1.12/-1.09
52.08/ .08 .73/ .76-.14/-.14.29/ .27.17/ .151.21/ 1.16
53-.64/-.65-.37/-.34.18/ .16-1.49/-1.40
SUM.03/ .02.03/ .03 .79/ .82.00/ .00-.07/-.07.44/.40.00/.00
8-.59/-.53-.22/-.24.15/ .15-0.65/-0.71
12-.59/-.51-.22/-.23.14/ .14-0.49/-0.52
35-.63/-.54-.23/-.24.15/ .14-0.79/-0.83
51.63/ .61.02/ .02.15/ .160.62/0.72
89.74/ .72.20/ .23.16/ .180.80/ 0.95
94.58/ .53.10/ .12.12/ .10.16/ .160.45/ 0.50
SUM.00/ .00.00/ .00.00/ .00.87/ .84.02/ .03.24/ .26.00/ .00
40 .42/ .46.08/ .08.14/ .130.15/ 0.38
43.09/.09 .63/ .66.14/ .13.13/ .12-0.26/-0.13
63-.76/-.79-.18/-.17.12/ .100.19/ 0.18
64-.69/-.74-.04/-.04.12/ .11-0.08/ 0.01
68.12/ .12-.06/-.06-.06/-.07.14/ .12 .43/ .48-.03/-.03.15/ .140.38/ 0.39
79-.72/-.76-.19/-.18.12/ .11-0.11/-0.11
SUM.03/ .02.01/ .01-.01/-.01.03/ .02 .87/ .89 .15/ .14.00 / .00
15 .53/ .51.19/ .181.38/ 1.38
23.10 / .10.20/ .20 .61/ .62.16/ .160.79/ 0.82
90.43/ .41.14/ .15 .36/ .35.15/ .140.65/ 0.64
95.13/ .13-.49/-.46.15/ .13-0.73/-0.70
97-.41/-.39.14/ .12-.10/-.11-.41/-.40.14/ .13-0.38/-0.38
99-.11/-.11-.54/-.53.16/ .15-0.86/-0.87
SUM.06/ .05.29/ .28.00/ .00-.04/-.04.03/ .03 .77/ .77.00/ .00
SUM.10/ .10.03/ .03-.04/-.05.14/ .12-.19/-.21-.17/-.17 .71/ .69

The factor loadings show that items load on the primary factors and that these factor loadings are consistent for men and women. Secondary loadings tended to be weak, although even the small loadings were highly significant and consistent across both genders; so were loadings on the two method factors. The results for the sum scores show that most of the variance in sum scores was explained by the primary factor with effect sizes ranging from .71 to .89.

Item-intercepts show the deviation from the middle of the scale in standardized units (standardized mean differences from 3.5). The assumption of equal item-intercepts was relaxed for four items (#3, #40, #43, #64), but even for these items the standardized mean differences were small. The largest difference was observed for following a schedule (M = 0.15, F = 0.38). Constraining these coefficients would reduce fit, but it would have a negligible effect on gender differences on the Big Five traits.

Table 2 and Figure 1 show the standardized mean differences between men and women for latent variables and for sum scores. The results for sum scores were based on the estimated means and variances in the Tech4 output of MPLUS (see output file on OSF).


Given the high degree of measurement invariance and the fairly high correlations between latent scores and sum scores, the results are very similar and replicate previous findings that most gender differences are small, but that women score higher on neuroticism and agreeableness. These results show that these differences cannot be attributed to hidden invalidity of Big Five measures. In addition, the results show a small difference in evaluative bias. Men are more likely to describe their personality in an overly positive way. However, given the size effect and the modest contribution of halo bias to sum scores, it has a small effect effect on mean differences in scales. Along with unreliability, it attenuates the gender differences in agreeableness from d = .80 to d = .56.


Hussey and Hughes claim that personality psychologist were hiding invalidity of personality measures by not reporting tests of structural validity. They also claim that personality measures fail tests of structural validity. The first claim is false because personality psychologists have examined factor structures and measurement invariance for the Big Five (e.g., Anusic et al., 2009; Nye et al., 2016). Thus, Hussey and Hughes misrepresent the literature and fail to cite relevant work. The second claim is inconsistent with Nye et al. results and with my new examination of structural invariance in personality ratings. Thus, Hussey and Hughes article does not make a contribution to the advancement of psychological science. Rather it is an example of poor scholarship, where authors make strong claims (validity hachking) with weak evidence.

The substantive conclusion is that men and women have similar measurement models of personality and that it is possible to use sum scores to compare them. Thus, past results that are based on sum scores reflect valid personality differences. This is not surprising because men and women speak the same language and are able to communicate to each other about personality traits of men and women. There is also no evidence to suggest that memory retrieval processes underlying personality ratings differ between men and women. Thus, there are no reasons to expect structural invariance in personality ratings.

A more important question is whether gender differences in self-ratings reflect actual differences in personality. One threat to the validity could be social comparison processes where women compare to other women and men compare to other men. However, social comparison would attenuate gender differences and cannot explain the moderate to large differences in neuroticism and agreeableness. Nevertheless, future research should examine gender differences using measures of actual behavior and informant ratings. Althoug sum scores are mostly valid, it is preferable to use latent variable models for these studies because latent variable models make it possible to test assumptions that are merely assumed to hold in studies with sum scores.

The Misguided Attack of a Meta-Psychometrician

Webster’s online dictionary defines a psychometrician as (a) a person who is skilled in the administration and interpretation of objective  psychological tests or (b) a psychologist who devises, constructs, and standardizes psychometric tests.

Neither definition describes Denny Borsboom, who is better described as a theoretical psychometrician, a philosopher of pyschometrics, or a meta-psychometrician. The reason is that Borsboom never developed a psychological test. His main contribution to discussions about psychological measurement have been meta-psychological articles that reflect on the methods that psychometricans use to interpret and validate psychological measures and test scores.

Thus, one problem with Borsboom’s (2006) article “The attack of the psychometricians” is that Borsboom is not a psychometrician who is concerned with developing psychological measures. The second problem with the title is the claim that a whole group of psychometricians is ready to attack, while he was mainly speaking for himself. Social psychologists call this the “false consensus” bias, where individuals overestimate how much their attitudes are shared with others.

When I first read the article, I became an instant fan of Borsboom. He shared my frustration with common practices in psychological measurement that persisted despite criticism of these practices by eminent psychologists like Cronbach, Meehl, Campbell, and Fiske in the 1950s. However, it later became apparent that I misunderstood Borsboom’s article. What seemed like a call for improving psychological assessment turned out to be a criticism of the entire enterprise of measuring individuals’ traits. Apparently, psychomtricians weren’t just using the wrong methods; they were actually misguided in their beliefs that there are traits that can be measured.

In 2006, Borsboom criticized leading personality psychologists for dismissing results that contradicted their a priori assumptions. When McCrae, Zonderman, Costa, Bond, & Paunonen’s (1996) used Confirmatory Factor Analysis (CFA) to explore personality structure and their Big Five model didn’t fit the data, they questioned CFA and did not consider the possibility that their measurement model of the Big Five was wrong.

In actual analyses of personality data [. . .] structures that are known to be reliable [from principal components analyses] showed poor fits when evaluated by CFA techniques. We believe this points to serious problems with CFA itself when used to examine personality structure” (McCrae et al., 1996, p. 563).

I fully agree with Borsboom that it is wrong to dismiss a method simply on the grounds that it does not support a preexisting theory. However, six years later Borsboom made the same mistake and used misfit of a measurement model to jump to the conclusion that the Big Five do not exist ( Cramer, van der Sluis, Noordhof, Wichers, Geschwind, Aggen, Kendler, & Borsboom, 2012).

The central tenet of this paper is to consider the misfit of the untweaked model an indication that the latent variable hypothesis fails as an explanation of the emergence of normal personality dimensions, and to move on towards alternative model (p. 417).

This conclusion is as ridiculous as McCrae et al.’s conclusion. After all, how would it be possible that personality items that were created to measure a personality attribute and that were selected to show internal consistency and convergent validity with informant ratings do not reflect a personality trait? It seems more likely that the specified measurement model was wrong and that a different measurement model is needed to fit the data.

The key problem with the measurement model is the ridiculous assumption that all items load only on the intended factor. However, exploratory factor analyses or principal component analysis typically show secondary loadings. Thus, it is not surprising that omitting these secondary loadings from a CFA model produces bad fit. Thus, the key problem in fitting CFA models is that it is difficult to create content-pure items. The problem is not that the Big Five cannot be identified or do not exist.

Confirmatory Factor Analysis of Big Five Data

Idealistic vs. Realistic CFA

The key problems in fitting simple CFA models to data are psychometric assumptions that are neither theory-driven nor plausible. The worst assumption of standard CFA models is that each personality item loads on a single factor. As a result, all loadings on other factors are fixed at zero. To the extent that actual data have secondary loadings, these CFA models will show poor fit.

From a theoretical point of view, constraining all secondary loadings to zero makes no sense. To do so, implies that psychometricians are able to create perfect items that reflect only a single factor. In this idealistic scenario that exists only in the world of meta-psychometricians with simulated data, the null-hypothesis that there are no secondary loadings is true. However, psychometricians who work with real data know that the null-hypothesis is always false (Cohen, 1994). Meehl called this the crude factor. All items will be correlated with all other items, even if these correlations are small and meaningless.

McCrae et al. (1996) made the mistake to interpret bad fit of the standard CFA model as evidence that CFA cannot be used to study the Big Five. Borsboom and colleagues made the mistake to claim that bad fit of the standard CFA model implies that the Big Five do not exist. The right conclusion is that the standard CFA model without secondary loadings and without method factors is an idealistic and therewith unrealistic model that will not fit real data. It can only serve as a starting point for exploration to find better fitting models that actually fit the data.

CFA models of the Big Five

Another problem of Borsboom et al.’s (2012) article is that the authors ignored studies that used CFA to model Big Five questionnaires with good fit to actual data. The do not cite Biesanz and West (2004), deYoung (2006), or Anusic et al. (2009).

All three articles used CFA to model agreement and disagreement in Big Five ratings for self-ratings and informant ratings. The use of multi-method data is particularly useful to demonstrate that Big Five factors are more than mere self-perceptions. The general finding of these studies is that self and informant ratings show convergent validity and can be modeled with five latent factors that reflect the shared variance among raters. In addition, these models showed that unique variances in ratings by a single rater are systematically correlated. The pattern of these correlations suggests an evaluative bias in self-ratings. CFA makes it possible to model this bias as a separate method factor, which is not possible with exploratory factor analysis. Thus, these articles demonstrate the usefulness of examining personality measurement models with CFA and they show that the Big Five are real personality traits that are not mere artifacts of self-ratings.

Anusic et al. (2006) also developed a measurement model for self-ratings. The model assumes that variance in each Big Five item has at least four components: (a) the intended construct variance (valid variance), (b) evaluative bias variance, (c) acquiescence bias variance, and (d) item-specific unique variance.

Thus, even in 2012 it was wrong to claim that CFA models do not fit Big Five data and to suggest that the Big Five do not exist. This misguided claim could only arise from a meta-psychometric perspective that ignores the substantive literature on personality traits.

Advances in CFA Modeling of Big Five Data

The internet has made it easier to collect and share data. Thus, there are now large datasets with Big Five data. In addition, computing power has increased exponentially, which makes it possible to analyze larger sets of items with CFA.

Beck, Condon, and Jackson published a preprint that examined the structure of personality with a network model and made their data openly available ( ). The dataset contains responses to 100 Big Five items (the IPIP-100) from a total of 369,151 participants who voluntarily provided data on an online website.

As the questionnaire was administered in English, I focused on English-speaking countries for my analyses. I used the Canadian sample for exploratory analyses. This way, the much larger US sample and the samples from Great Britain and Australia can be used for cross-validation.

Out of the 100 items, 78 items were retained for the final model. Two items were excluded because they had low coverage (that is infrequently administered with other items). The remaining 20 items were excluded because they had low loadings on the primary factor. The remaining 78 items were fitted to a model with the Big Five, an acquiescence factor and an evaluative bias factor. Secondary loadings and correlated residuals were added by exploring modification indices. Loadings on the halo factor were initially fixed at 1. However, loadings for some items were freed if modification indices suggested that this would improve fit. This allowed to identify items with high or low evaluative bias.

The precise specification of the model and the full results can be found in the OSF project MPLUS input file ( ). The model had excellent fit using Root Mean Square Error of Approximation as a criterion, RMSEA = .017, 90%CI[.017,.017]. The Comparative Fit Index (CFI) was acceptable, CFI = .940, considering the use of single -item indicators (Anusic et al., 2009). Table 1 shows the factor loadings on the Big Five factors and the two method factors for individual items and for the Big Five scales.

easily disturbed30.44-0.25
not easily bothered10-0.58-0.12-0.110.25
relaxed most of the time17-0.610.19-0.170.27
change my mood a lot250.55-0.15-0.24
feel easily threatened370.50-0.25
get angry easily410.50-0.13
get caught up in my problems420.560.13
get irritated easily440.53-0.13
get overwhelmed by emotions450.620.30
stress out easily460.690.11
frequent mood swings560.59-0.10
often feel blue770.54-0.27-0.12
panic easily800.560.14
rarely get irritated82-0.52
seldom feel blue83-0.410.12
take offense easily910.53
worry about things1000.570.210.09
hard to get to know7-0.45-0.230.13
quiet around strangers16-0.65-0.240.14
skilled handling social situations180.650.130.390.15
am life of the party190.640.160.14
don’t like drawing attention to self30-0.540.13-0.140.15
don’t mind being center of attention310.560.230.13
don’t talk a lot32-0.680.230.13
feel at ease with people 33-0.200.640.160.350.16
feel comfortable around others34-0.230.650.150.270.16
find it difficult to approach others38-0.60-0.400.16
have little to say57-0.14-0.52-0.250.14
keep in the background60-0.69-0.250.15
know how to captivate people610.490.290.280.16
make friends easily73-0.100.660.140.250.15
feel uncomfortable around others780.22-0.64-0.240.14
start conversations880.700.120.270.16
talk to different people at parties930.720.220.13
full of ideas50.650.320.19
not interested in abstract ideas11-0.46-0.270.16
do not have good imagination27-0.45-0.190.16
have rich vocabulary500.520.110.18
have a vivid imagination520.41-
have difficulty imagining things53-0.48-0.310.18
difficulty understanding abstract ideas540.11-0.48-0.280.16
have excellent ideas550.53-0.090.370.22
love to read challenging materials70-0.180.400.230.14
love to think up new ways710.510.300.18
indifferent to feelings of others8-0.58-0.270.16
not interested in others’ problems12-0.58-0.260.15
feel little concern for others35-0.58-0.270.18
feel others’ emotions360.600.220.17
have a good word for everybody490.590.100.17
have a soft heart510.420.290.17
inquire about others’ well-being580.620.320.19
insult people590.190.12-0.32-0.18-0.250.15
know how to comforte others620.260.480.280.17
love to help others690.140.640.330.19
sympathize with others’ feelings890.740.300.18
take time out for others920.530.320.19
think of others first940.610.290.17
always prepared20.620.280.17
exacting in my work4-0.090.380.290.17
continue until everything is perfect260.140.490.130.16
do things according to a plan280.65-0.450.17
do things in a half-way manner29-0.49-0.400.16
find it difficult to get down to work390.09-0.48-0.400.14
follow a schedule400.650.070.14
get chores done right away430.540.240.14
leave a mess in my room63-0.49-0.210.12
leave my belongings around64-0.50-0.080.13
like order650.64-0.070.16
like to tidy up660.190.520.120.14
love order and regularity680.150.68-0.190.15
make a mess of things720.21-0.50-0.260.15
make plans and stick to them750.520.280.17
neglect my duties76-0.55-0.450.16
forget to put things back 79-0.52-0.220.13
shirk my duties85-0.45-0.400.16
waste my time98-0.49-0.460.14

The results show that the selected items have their highest loading on the intended factor and all but two loadings exceed |.4|. Secondary loadings are always lower than the primary loadings and most secondary loadings are below |.2| . Consistent with previous studies, loadings on the acquiescence factor are weak. As acquiescence bias is reduced by including reverse scored items, the effect of acquiescence on scales is trivial. However, halo bias accumulates and has about 15% of the variance in scales is evaluative bias. Secondary loadings produce only negligible correlations between scales. Thus, scales are a mixture of the intended construct and evaluative bias.

These results show that it is possible to fit a CFA model to a large set of Big Five items and to recover the intended structure. The results also show that sum scores can be used as reasonable proxies of the latent constructs. The main caveat is that scales are contaminated with evaluative bias.

Creating Short Scales

Even on a powerful computer, a model with 78 items takes a lot of time to converge. Thus, it is not very useful for further analyses such as cross-validation across samples or to explore age or gender differences. Moreover, it is unnecessary to measure a latent variable with 18 items as even 3 or 4 indicators are sufficient to identify a latent construct. Thus, I created short scales with high loading items and an equal balance of positive and negative items. The goal was to have six items, but for two scales only two negative items were available and the total number of items was only five.

The results are presented in Table 2. Interestingly, even the results for scales (sum scores) are very similar suggesting that administering only 28 items provides the same information as 80 items.

easily disturbed30.46-0.210.14
not easily bothered10-0.660.200.13
relaxed most of the time17-0.630.220.15
stress out easily460.740.11-0.190.13
frequent mood swings560.58-0.11-0.200.13
quiet around strangers16-0.69-0.200.15
at ease with people33-0.270.650.200.230.13
difficult to approach others380.16-0.69-0.190.14
keep in the background60-0.68-0.210.14
start conversations880.740.150.220.15
talk to a lot of different people930.720.180.12
full of ideas50.090.540.270.18
don’t have good imagination27-0.69-0.220.15
have vivid imagination520.110.72-
difficulty imagining things53-0.70-0.260.18
love to think up new ways710.410.240.16
indifferent to others’ feelings8-0.61-0.220.15
not interest in others’ problems12-0.60-0.220.15
feel little concern for others35-0.61-0.220.15
have a soft heart510.560.230.16
sympathize with others’ feelings890.760.250.17
think of others first940.560.240.16
follow a schedule400.450.200.13
get chores done right away430.640.200.13
leave a mess in my room63-0.74-0.170.12
leave my belongings around64-0.70-0.180.12
love order and regularity680.20-0.13-0.210.410.220.15
forget to put things back79-0.72-0.180.12

These results also show that Borsboom’s criticism of scale scores as arbitrary sum scores overstated the problem with commonly used personality measures. While scale scores are not perfect indicators of constructs, sum scores of carefully selected items can be used as proxies of latent variables.

The main problem of sum scores is that they are contaminated with evaluative bias variance. This is not a problem if latent variable models are used because evaluative bias can be separate from Big Five variance. In order to control for evaluative bias with manifest scale scores, it is necessary to regress outcomes on all Big Five traits. As evaluative bias is shared across the Big Five, it is removed from regression coefficients that reflect only the unique contribution of each Big Five trait.

In conclusion, the results show that McCrae et al. were wrong to dismiss CFA as a method for personality psychologists and Borsboom et al. were wrong to claim that traits do not exist. CFA is ideally suited to create measurement models of personality traits and to validate personality scales. Ideally, these measurement models would use multiple methods such as personality ratings by multiple raters as well as measures of actual behaviors as indicators of personality traits.

Borsboom Comes to His Senses

in 2017, Borsboom and colleagues published another article on the Big Five (Epskamp, Rhemutulla, & Borsboom, 2017). The main focus of the article is to introduce a psychometric model that combines CFA to model the influence of unobserved common causes and network models that describe direct relationships between indicators. The model is illustrated with the Big Five.

As the figure shows, the model is essentially a CFA model (on the left) and a model of correlated residuals that are presented as a network (on the right). The authors note that the CFA model alone does not fit the data and that adding the residual network improves model fit. However, the authors do not explore alternative models with secondary loadings and their model ignores the presence of method factors such as the acquiescence bias and evaluative bias factors identified in the model above (Anusic et al., 2009). Even though the authors consider their model a new psychometric model, adding residual covariances to a structural equation model is not really novel. Most important, the article shows a reversal in Borsboom’s thinking. Rather than claiming that latent factors are illusory, he seems to acknowledge that personality items are best modeled with latent factors.

It is interesting to contrast the 2017 article with the 2012 article. In 2012, Borsboom critiqued personality psychologists for ” tweaking the model ‘on the basis of the data’ so that the basic latent variable hypothesis is preserved (e.g. by allowing cross-loadings,
exploratory factor analysis with procrustes rotation; see also Borsboom, 2006 for an elaborate critique).” However, in 2017, Borsboom added a purely data-driven network of residual correlations to produce model fit. This shows a major shift in Borsboom’s thinking about personality measurement.

I think the 2017 article is a move in the right direction. All we need to do is to add method factors and secondary loadings to the model and Borsboom’s model in 2017 converges with my measurement model of the Big Five.

Where are we now?

A decade has past since Borsboom marshaled his attack, but most of the problems that triggered Borsboom’s article remain the same. In part, Borsboom is to blame for this lack of progress because his attack was directed at the existence of traits as opposed to bad practices in measuring them.

The most pressing concerns remains the deep-rooted tradition in psychology to work with operational definitions of constructs. That is, constructs are mere labels or vague statements that refer to a particular measure. Subjective well-being is the sum score on the Satisfaction with Life Scale; self-esteem is the sum-score of Rosenberg’s 10 self-esteem items; and implicit bias is the difference score between reaction times on the Implicit Association Test. At best, it is recognized that observed scores are unreliable, but the actual correspondence between constructs and measures is never challenged.

This is what distinguishes psychology from natural sciences. Even though valid measurement is a fundamental requirement for empirical data to be useful, psychologists pay little attention to the validity of their measures. Implicitly, psychological research is conducted as if psychological measures are as valid as measures of height, weight, or temperature. However, in reality psychological measures have much lower validity. To make progress as a science, psychologists need to pay more attention to the validity of their measures. Borsboom (2006) was right about the obstacles in the way towards this goal.

Operationalism Rules

There is no training in theory construction or formalizing measurement models. Measures are valid if they have face validity and reliability. Many psychology programs have no psychometricians and do not offer courses in psychological measurement. Measures are used because others used them before and somebody said at some point that the measure has been validated. This has to stop. Every valid measure requires a measurement theory and assumptions being made by a measurement theory need to be tested . The measurement theory has to be specified as a formal model that can be tested. No measure without a formal measurement model should be considered a valid measure. Most important, it is not sufficient to rely on mono-method data because mono-method data always have method-specific variance. A multi-method approach is needed to separate construct variance from systematic method variance (Campbell & Fiske, 1959).

Classical Test Theory

Classical test theory may be sufficient for very specific applications such as performance or knowledge in a well-defined domain (e.g., multiplication, knowing facts about Canada). However, classical test theory does not work for abstract concepts like valuing power, achievement motivation, prejudice, or well-being. Students need to learn about latent variable models that can relate theoretical constructs to observed measures.

The Catch-All of Construct Validity

Borsboom correctly observes that psychologists lack a clear understanding of validation research.

Construct validity functions as a black hole from which nothing can escape: Once a question gets labeled as a problem of construct validity, its difficulty is considered superhuman and its solution beyond a mortal’s ken.”

However, he doesn’t provide a solution to the problem, and blames the inventors of construct validity for it. I vehemently disagree. Cronbach and Meehl (1955) did not only coin the term construct validity; they also outlined a clear program of research that is required to validate measures and to probe the meaning of constructs. The problem is that psychologists never followed their recommendations. To improve psychological science, psychologists must learn to create formal measurement models (nomological networks) and test them with empirical data. No matter how poor and simplistic these models are in the beginning, they are needed to initiate the process of construct validation. As data become available, measures and constructs need to be revised to accommodate new evidence. In this sense, construct validation is a never-ending process that is still ongoing even in the natural science (the definition of a meter was just changed); but just because the process is never ending doesn’t mean it is should never be started.

Psychometrics is Risky

Borsboom correctly notes that it is not clear who should do actual psychometric work. Psychologists do not get rewarded for validation research because developing valid measures is not sexy, while showing unconscious bias with invalid measures is. Thus, measurement articles are difficult to publish in psychology journals. Actual psychometric work is also difficult to publish in method journals that focus on mathematical and statistical developments and do not care about applications to specific content areas. Finally, assessment journals focus on clinical populations and are not interested in measures of individual differences in normal populations. Thus, it is difficult to publish validation studies. To address this problem it is important to demonstrate that psychology has a validation problem. Only when researches realize that they are using measures with unknown or low validity, journals have an incentive to publish validation studies. As long as psychologists believe that any reliable sum score is valid, there is no market for validation studies.

It Shouldn’t Be Too Difficult

Psychologists would like to have the same status as the natural sciences or economics. However, students in these areas often have to learn complex techniques and math. In comparison, psychology is easy, which is partly the appeal. However, to make scientific progress in psychology is a lot harder than it seems. Many students lack the training to do the hard work that wold be required to move psychology forward. Structural question modeling, for example, is not taught and many students would not know how to develop a measurement model and how to test it. They may learn how to fit a cookie-cutter model to a dataset, but if the data do not fit the model, they would not know what to do. To make progress, training has to take measurement more seriously and prepare students to evaluate and modify measurement models.

But It’s Not in SPSS!

At least on this front, progress has been made. Many psychologists ran a principal components analysis because this was the default option in SPSS. There were always other options, but users didn’t know the differences and sticked with the default option. Now a young generation is trained to use R and structural equation modeling is freely available with the lavaan package. Thus, students have access to statistical tools that were not as easily available a decade ago.

Thou Shalt Not. . .

Theoretical psychometricians are a special type of personality. Many of them would like to be as far away as possible from any applications to real data that never meet the strict assumptions of their models. This makes it difficult for them to teach applied researchers how to use psychometric models effectively. Ironically, Borsboom himself imposed the criterion of simple structure and local independence on data to argue that CFA is not appropriate for personality data. But if two items do not have local indendence (e.g., because they are antonyms), it doesn’t imply that a measurement model is fundamentally flawed. The main advantage of structural equation modeling is that it is possible to test the assumption of local independence and to modify the model if it is violated. This is exactly what Borsboom did in the 2017 article. The idea that you shall not have correlated residuals in your models is unrealistic and not useful in applied settings. Thus, we need real psychometricians who are interested in the application of psychometric models to actual data. They care deeply about the substance area and want to apply models to actual messy data to improve psychological measurement. Armchair criticism from meta-psychometricians is not going to move psychology forward.

Sample Size Issues

Even in 2006, sample size was an issue for the use of psychometric models. However, sample sizes have increased tremendously thanks to online surveys. There are now dozens of Big Five datasets with thousands of respondents. The IAT has been administered to millions of volunteers. Thus, sample size is no longer an issue and it is possible to fit complex measurement models to real data.

Substantive Factors

Borsboom again picks personality psychology to make his point.

“For instance, personality traits are usually taken to be continuously structured and conceived of as reflective latent variables (even though the techniques used do not sit well with this interpretation). The point, however, is that there is nothing in personality theory that motivates such a choice, and the same holds for the majority of the subdisciplines in psychology.”

This quote illustrates the problem of meta-psychometricians. They are not experts in a substantive area and often unaware of substantive facts that may motivate a specific measurement model. Borsboom seems to be unaware that psychologists have tried to find personality types, but that dimensional models won because it was impossible to find clearly defined types. Moreover, people have no problems to rate their personality along quantitative scales and to indicate that they are slightly or strongly interested in art or that they are worried sometimes or often. Not to mention, that personality traits show evidence of heritability and that we would expect an approximately normal distribution for traits that are influenced by multiple randomly combined genes (e.g., height).

Thus, to make progress, we need psychologists who have substantive knowledge and statistical knowledge in order to develop and improve measurement model of personality or other constructs. What we do not need are meta-psychologists without substantive knowledge who comment on substantive issues.

Read and Publish Widely

Borsboom also gives some good advice for psychometricians.

The founding fathers of the Psychometric Society—scholars such as Thurstone, Thorndike, Guilford, and Kelley—were substantive psychologists as much as they were psychometricians. Contemporary psychometricians do not always display a comparable interest with respect to the substantive field that lends them their credibility. It is perhaps worthwhile to emphasize that, even though psychometrics has benefited greatly from the input of mathematicians, psychometrics is not a puremathematical discipline but an applied one. If one strips the application from an applied science one is not left with very much that is interesting; and psychometrics without the “psycho” is not, in my view, an overly exciting discipline. It is therefore essential that a psychometrician keeps up to date with the developments in one or more subdisciplines of psychology.

I couldn’t agree more and I invite Denny to learn more about personality psychology, if he wants to make some contribution to the measurement of personality. The 2017 paper is a step in the right direction. Finding the Big Five in a questionnaire that was developed to measure the Big Five is a first step. Developing a measurement model of personality and assessing validity with multi-method data is a task that is worthwhile attacking in the next decade.

Well-Being Science

Happiness has become a big top in the social sciences. Many universities offer happiness courses that teach how to be happier. Many of the exercises that are being taught in these courses are not based on evidence of effectiveness. I am teaching a different course. The course is an introduction to the science of well-being. The aim of this course is to provide an overview of the empirical research on well-being.

Textbooks that cover well-being science are often written by textbook writers who are not experts on the topic. They are often pretty bad. A better alternative is the free textbook published by well-being expert Ed Diener on his free textbook site Noba publishing (link).

For my students, I wrote my own textbook. It is still a work in progress, but given the costly alternatives, I decided to make it public. As I said, it is a work in progress. I am always looking for ways to improve it and to correct it. Feel free to provide comments in the comment section or by email.

Wellbeing Science: In Search of the Good Life (Ulrich Schimmack)