Category Archives: Personality Measurement

A Psychometric Study of the NEO-PI-R

Galileo had the clever idea to turn a microscope into a telescope and to point it towards the night sky. His first discovery was that Jupiter had four massive moons that are now known as the Galilean moons (Space.com).

Now imagine what would have happened if Galileo had an a priori theory that Jupiter has five moons and after looking through the telescope, Galileo decided that the telescope was faulty because he could see only four moons. Surely, there must be five moons and if the telescope doesn’t show them, it is a problem of the telescope. Astronomers made progress because they created credible methods and let empirical data drive their theories. Eventually even better telescopes discovered many more, smaller moons orbiting around Jupiter. This is scientific progress.

Alas, psychologists don’t follow the footsteps of natural sciences. They mainly use the scientific method to provide evidence that confirms their theories and dismiss or hide evidence that disconfirms their theories. They also show little appreciation for methodological improvements and often use methods that are outdated. As a result, psychology has made little progress in developing theories that rest of solid empirical foundations.

An example of this ill-fated approach to science is McCrae et al.’s (1996) attempt to confirm their five factor model with structural equation modeling (SEM). When they failed to find a fitting model, they decided that SEM is not an appropriate method to study personality traits because SEM didn’t confirm their theory. One might think that other personality psychologists realized this mistake. However, other personality psychologists were also motivated to find evidence for the Big Five. Personality psychologists had just recovered from an attack by social psychologists that personality traits does not even exist, and they were all too happy to rally around the Big Five as a unifying foundation for personality research. Early warnings were ignored (Block, 1995). As a result, the Big Five have become the dominant model of personality without subjecting the theory to rigorous tests and even dismissing evidence that theoretical models do not fit the data (McCrae et al., 1996). It is time to correct this and to subject Big Five theory to a proper empirical test by means of a method that can falsify bad models.

I have demonstrated that it is possible to recover five personality factors, and two method factors, from Big Five questionnaires (Schimmack, 2019a, 2019b, 2019c). These analyses were limited by the fact that the questionnaires were designed to measure the Big Five factors. A real test of Big Five theory requires to demonstrate that the Big Five factors explain the covariations among a large set of a personality traits. This is what McCrae et al. (1996) tried and failed to do. Here I replicate their attempt to fit a structural equation model to the 30 personality traits (facets) in Costa and McCrae’s NEO-PI-R.

In a previous analysis I was able to fit an SEM model to the 30 facet-scales of the NEO-PI-R (Schimmack, 2019d). The results only partially supported the Big Five model. However, these results are inconclusive because facet-scales are only imperfect indicators of the 30 personality traits that the facets are intended to measure. A more appropriate way to test Big Five theory is to fit a hierarchical model to the data. The first level of the hierarchy uses items as indicators of 30 facet factors. The second level in the hierarchy tries to explain the correlations among the 30 facets with the Big Five. Only structural equation modeling is able to test hierarchical measurement models. Thus, the present analyses provide the first rigorous test of the five-factor model that underlies the use of the NEO-PI-R for personality assessment.

The complete results and the MPLUS syntax can be found on OSF (https://osf.io/23k8v/). The NEO-PI-R data are from Lew Goldberg’s Eugene-Springfield community sample. Theyu are publicly available at the Harvard Dataverse

Results

Items

The NEO-PI-R has 240 items. There are two reasons why I analyzed only a subset of items. First, 240 variables produce 28,680 covariances, which is too much for a latent variable model, especially with a modest sample size of 800 participants. Second, a reflective measurement model requires that all items measure the same construct. However, it is often not possible to fit a reflective measurement model to the eight items of a NEO-facet. Thus, I selected three core-items that captured the content of a facet and that were moderately positively correlated with each other after reversing reverse-scored items. Thus, the results are based on 3 * 30 = 90 items. It has to be noted that the item-selection process was data-driven and needs to be cross-validated in a different dataset. I also provide information about the psychometric properties of the excluded items in an Appendix.

The first model did not impose a structural model on the correlations among the thirty facets. In this model, all facets were allowed to correlate freely with each other. A model with only primary factor loadings had poor fit to the data. This is not surprising because it is virtually impossible to create pure items that reflect only one trait. Thus, I added secondary loadings to the model until acceptable model fit was achieved and modification indices suggested no further secondary loadings greater than .10. This model had acceptable fit, considering the use of single-items as indicators, CFI = .924, RMSEA = .025, .035. Further improvement of fit could only be achieved by adding secondary loadings below .10, which have no practical significance. Model fit of this baseline model was used to evaluate the fit of a model with the Big Five factors as second-order factors.

To build the actual model, I started with a model with five content factors and two method factors. Item loadings on the evaluative bias factor were constrained to 1. Item loadings for on the acquiescence factor were constrained to 1 or -1 depending on the scoring of the item. This model had poor fit. I then added secondary loadings. Finally, I allowed for some correlations among residual variances of facet factors. Finally, I freed some loadings on the evaluative bias factor to allow for variation in desirability across items. This way, I was able to obtain a model with acceptable model fit, CFI = .926, RMSEA = .024, SRMR = .045. This model should not be interpreted as the best or final model of personality structure. Given the exploratory nature of the model, it merely serves as a baseline model for future studies of personality structure with SEM. That being said, it is also important to take effect sizes into account. Parameters with substantial loadings are likely to replicate well, especially in replication studies with similar populations.

Item Loadings

Table 1 shows the item-loadings for the six neuroticism facets. All primary loadings exceed .4, indicating that the three indicators of a facet measure a common construct. Loadings on the evaluative bias factors were surprisingly small and smaller than in other studies (Anusic et al., 2009; Schimmack, 2009a). It is not clear whether this is a property of the items or unique to this dataset. Consistent with other studies, the influence of acquiescence bias was weak (Rorer, 1965). Secondary loadings also tended to be small and showed no consistent pattern. These results show that the model identified the intended neuroticism facet-factors.

Table 2 shows the results for the six extraversion facets. All primary factor loadings exceed .40 and most are more substantial. Loadings on the evaluative bias factor tend to be below .20 for most items. Only a few items have secondary loadings greater than .2. Overall, this shows that the six extraversion facets are clearly identified in the measurement model.

Table 3 shows the results for Openness. Primary loadings are all above .4 and the six openness factors are clearly identified.

Table 4 shows the results for the agreeableness facets. In general, the results also show that the six factors represent the agreeableness facets. The exception is the Altruism facet, where only two items show a substantial loadings. Other items also had low loadings on this factor (see Appendix). This raises some concerns about the validity of this factor. However, the high-loading items suggest that the factor represents variation in selfishness versus selflessness.

Table 5 shows the results for the conscientiousness facets. With one exception, all items have primary loadings greater than .4. The problematic item is the item “produce and common sense” (#5) of the competence facet. However, none of the remaining five items were suitable (Appendix).

In conclusion, for most of the 30 facets it was possible to build a measurement model with three indicators. To achieve fit, the model included 76 out of 2,610 (3%) secondary loadings. Many of these secondary loadings were between .1 and .2, indicating that they have no substantial influence on the correlations of factors with each other.

Facet Loadings on Big Five Factors

Table 6 shows the loadings of the 30 facets on the Big Five factors. Broadly speaking the results provide support for the Big Five factors. 24 of the 30 facets (80%) have a loading greater than .4 on the predicted Big Five factor, and 22 of the 30 facets (73%) have the highest loading on the predicted Big Five factor. Many of the secondary loadings are small (< .3). Moreover, secondary loadings are not inconsistent with Big Five theory as facet factors can be related to more than one Big Five factor. For example, assertiveness has been related to extraversion and (low) agreeableness. However, some findings are inconsistent with McCrae et al.’s (1996) Five factor model. Some facets do not have the highest loading on the intended factor. Anger-hostility is more strongly related to low agreeableness than to neuroticism (-.50 vs. .42). Assertiveness is also more strongly related to low agreeableness than to extraversion (-.50 vs. .43). Activity is nearly equally related to extraversion and low agreeableness (-.43). Fantasy is more strongly related to low conscientiousness than to openness (-.58 vs. .40). Openness to feelings is more strongly related to neuroticism (.38) and extraversion (.54) than to openness (.23). Finally, trust is more strongly related to extraversion (.34) than to agreeableness (.28). Another problem is that some of the primary loadings are weak. The biggest problem is that excitement seeking is independent of extraversion (-.01). However, even the loadings for impulsivity (.30), vulnerability (.35), openness to feelings (.23), openness to actions (.31), and trust (.28) are low and imply that most of the variance in this facet-factors is not explained by the primary Big Five factor.

The present results have important implications for theories of the Big Five, which differ in the interpretation of the Big Five factors. For example, there is some debate about the nature of extraversion. To make progress in this research area it is necessary to have a clear and replicable pattern of factor loadings. Given the present results, extraversion seems to be strongly related to experiences of positive emotions (cheerfulness), while the relationship with goal-driven or reward-driven behavior (action, assertiveness, excitement seeking) is weaker. This would suggest that extraversion is tight to individual differences in positive affect or energetic arousal (Watson et al., 1988). As factor loadings can be biased by measurement error, much more research with proper measurement models is needed to advance personality theory. The main contribution of this work is to show that it is possible to use SEM for this purpose.

The last column in Table 6 shows the amount of residual (unexplained) variance in the 30 facets. The average residual variance is 58%. This finding shows that the Big Five are an abstract level of describing personality, but many important differences between individuals are not captured by the Big Five. For example, measurement of the Big Five captures very little of the personality differences in Excitement Seeking or Impulsivity. Personality psychologists should therefore reconsider how they measure personality with few items. Rather than measuring only five dimensions with high reliability, it may be more important to cover a broad range of personality traits at the expense of reliability. This approach is especially recommended for studies with large samples where reliability is less of an issue.

Residual Facet Correlations

Traditional factor analysis can produce misleading results because the model does not allow for correlated residuals. When such residual correlations are present, they will distort the pattern of factor loadings; that is, two facets with a residual correlation will show higher factor loadings. The factor loadings in Table 6 do not have this problem because the model allowed for residual correlations. However, allowing for residual correlations can also be a problem because freeing different parameters can also affect the factor loadings. It is therefore crucial to examine the nature of residual correlations and to explore the robustness of factor loadings across different models. The present results are based on a model that appeared to be the best model in my explorations. These results should not be treated as a final answer to a difficult problem. Rather, they should encourage further exploration with the same and other datasets.

Table 7 shows the residual correlation. First appear the correlations among facets assigned to the same Big Five factor. These correlations have the strongest influence on the factor loading pattern. For example, there is a strong correlation between the warmth and gregariousness facets. Removing this correlation would increase the loadings of these two facets on the extraversion factor. In the present model, this would also produce lower fit, but in other models this might not be the case. Thus, it is unclear how central these two facets are to extraversion. The same is also true for anxiety and self-consciousness. However, here removing the residual correlation would further increase the loading of anxiety, which is already the highest loading facet. This justifies the use of anxiety as the most commonly used indicator of neuroticism.

Table 7. Residual Factor Correlations

It is also interesting to explore the substantive implications of these residual correlations. For example, warmth and gregariousness are both negatively related to self-consciousness. This suggests another factor that influences behavior in social situations (shyness/social anxiety). Thus, social anxiety would be not just high neuroticism and low extraversion, but a distinct trait that cannot be reduced to the Big Five.

Other relationships are make sense. Modesty is negatively related to competence beliefs; excitement seeking is negatively related to compliance, and positive emotions is positively related to openness to feelings (on top of the relationship between extraversion and openness to feelings).

Future research needs to replicate these relationships, but this is only possible with latent variable models. In comparison, network models rely on item levels and confound measurement error with substantial correlations, whereas exploratory factor analysis does not allow for correlated residuals (Schimmack & Grere, 2010).

Conclusion

Personality psychology has a proud tradition of psychometric research. The invention and application of exploratory factor analysis led to the discovery of the Big Five. However, since the 1990s, research on the structure of personality has been stagnating. Several attempts to use SEM (confirmatory factor analysis) in the 1990s failed and led to the impression that SEM is not a suitable method for personality psychologists. Even worse, some researchers even concluded that the Big Five do not exist and that factor analysis of personality items is fundamentally flawed (Borsboom, 2006). As a result, personality psychologists receive no systematic training in the most suitable statistical tool for the analysis of personality and for the testing of measurement models. At present, personality psychologists are like astronomers who have telescopes, but don’t point them to the stars. Imagine what discoveries can be made by those who dare to point SEM at personality data. I hope this post encourages young researchers to try. They have the advantage of unbelievable computational power, free software (lavaan), and open data. As they say, better late than never.

Appendix

Running the model with additional items is time consuming even on my powerful computer. I will add these results when they are ready.

What lurks beneath the Big Five?

Any mature science classifies the objects that it studies. Chemists classify atoms. Biologists classify organisms. It is therefore not surprising that personalty psychologists have spent a lot of their effort on classifying personality traits; that is psychological attributes that distinguish individuals from each other.

[It is more surprising that social psychologists have spent very little effort on classifying situations; a task that is now being carried out by personality psychologists (Rauthmann et al., 2014)]

After decades of analyzing correlations among self-ratings of personality items, personality psychologists came to a consensus that five broad factors can be reliably identified. Since the 1980s, the so-called Big Five have dominated theories and measurement of personality. However, most theories of personality also recognize that the Big Five are not a comprehensive description of personality. That is, unlike colors that can be produced by mixing three basic colors, specific personality traits are not just a mixture of the Big Five. Rather, the Big Five represent an abstract level in a hierarchy of personality traits. It is possible to compare the Big Five to the distinction of five classes of vertebrate animals: mammals, birds, reptiles, fish, and amphibians. Although there are important distinctions between these groups, there are also important distinctions among the animals within each class; cats are not dogs.

Although the Big Five are a major achievement in personality psychology, it also has some drawbacks. As early as 1995, personality psychologists warned that focusing on the Big Five would be a mistake because the Big Five are too broad to be good predictors of important life outcomes (Block, 1995). However, this criticism has been ignored and many researchers seem to assume that they measure personality when they administer a Big Five questionnaire. To warrant the reliance on the Big Five would require that the Big Five capture most of the meaningful variation in personality. In this blog post, I use open data to test this implicit assumption that is prevalent in contemporary personality science.

Confirmatory Factor Analysis

In 1996, McCrae et al. (1995) published an article that may have contributed to the stagnation in research on the structure of personality. In this article, the authors argued that structural equation modeling (SEM), specifically confirmatory factor analysis (CFA), is not suitable for personality researchers. However, CFA is the only method that can be used to test structural theories and to falsify structural theories that are wrong. Even worse, McCrae et al. (1995) demonstrated that a simple-structure model did not fit their data. However, rather than concluding that personality structure is not simple, they concluded that CFA is the wrong method to study personality traits. The problem with this line of reasoning is self-evident and was harshly criticized by Borsboom (2006). If we dismiss methods because they do not show a theoretically predicted pattern, we loose the ability to test theories empirically.

To understand McCrae et al.’s (1995) reaction to CFA, it is necessary to understand the development of CFA and how it was used in psychology. In theory, CFA is a very flexible method that can fit any dataset. The main empirical challenge is to find plausible models and to find data that can distinguish between competing plausible models. However, when CFA was introduced, certain restrictions were imposed on models that could be tested. The most restrictive model imposed that a measurement model should have only primary loadings and no correlated residuals. Imposing these restrictions led to the foregone conclusions that the data are inconsistent with the model. At this point, researchers were supposed to give up, create a new questionnaire with better items, retest it with CFA and find out that there were still secondary loadings that produced poor fit to the data. The idea that actual data could have a perfect structure must have been invented by an anal-retentive statistician who never analyzed real data. Thus, CFA was doomed to be useless because it could only show that data do not fit a model.

It took some time and courage to decide that the straight-jacket of simple structure has to go. Rather than giving up after a simple-structure model was rejected, the finding should encourage further exploration of the data to find models that actually fit the data. Maybe biologists initially classified whales as fish, but so what. Over time, further testing suggested that they are mammals. However, if we never get started in the first place, we will never be able to develop a structure of personality traits. So, here I present a reflective measurement model of personality traits. I don’t call it CFA, because I am not confirming anything. I also don’t call it EFA because this term is used for a different statistical technique that imposes other restrictions (e.g., no correlated residuals, local independence). We might call it exploratory modeling (EM) or because it relies on structural equation modeling, we could call it ESEM. However, ESEM is already been used for a blind computer-based version of CFA. Thus, the term EM seems appropriate.

The Big Five and the 30 Facets

Costa and McCrae developed a personality questionnaire that assesses personality at two levels. One level are the Big Five. The other level are 30 more specific personality traits.

Image result for costa mccrae facets

The 30 facets are often presented as if they are members of a domain, just like dogs, cats, pigs, horses, elephants, and tigers are mammals and have nothing to do with reptiles or bird. However, this is an oversimplification. Actual empirical data show that personality structure is more complex and that specific facets can be related to more than one Big Five factor. In fact, McCrae et al. (1996) published the correlations of the 30 facets with the Big Five factors and the table shows many, and a few substantial, secondary loadings; that is, correlations with a factor other than the main domain. For example, Impulsive is not just positively related to Neuroticism. It is also positively related to extraversion, and negatively related to conscientiousness.

Thus, McCrae et al.’s (1996) results show that Big Five data do not have a simple structure. It is therefore not clear what model a CONFIRMATORY factor analysis tries to confirm, when the CFA model imposes a simple structure. McCrae et al. (1995) agree: “If, however, small loadings are in fact meaningful, CFA with a simple structure model may not fit well” (p. 553). In other words, if an exploratory factor analysis shows a secondary loading of Anger/Hostility on Agreeableness (r = -.40), indicating that agreeable people are less likely to get angry, it makes no sense to confirm a model that sets this parameter to zero. McCrae et al. also point out that simple structure makes no theoretical sense for personality traits. “There is no theoretical reason why traits should not have meaningful loadings on three, four, or five factors:” (p. 553). The logical consequence of this insight is to fit models that allow for meaningful secondary loadings; not to dismiss modeling personality data with structural equations.

However, McCrae et al. (1996) were wrong about the correct way of modeling secondary loadings. “It is possible to make allowances for secondary loadings in CFA by fixing the loadings at a priori values other than zero” (p. 553). Of course, it is possible to fix loadings to a non-zero value, but even for primary loadings, the actual magnitude of a loading is estimated by the data. It is not clear why this approach could not be used for secondary loadings. It is only impossible to let all secondary loadings to be freely estimated, but there is no need to fix the loading of anger/hostilty on the agreeableness factor to a fixed value to model the structure of personality.

Personality psychologists in the 1990s also seemed to not fully understand how sensitive SEM is to deviations between model parameters and actual data. McCrae et al. (1996) critically discuss a model by Church and Burke (1994) because it “regarded loadings as small as ± .20 as salient secondaries” (p. 553). However, fixing a loading of .20 to a value of 0, introduces a large discrepancy that will hurt overall fit. One either has to free parameters or lower the criterion for acceptable fit. However, fixing loadings greater than .10 to zero and hoping to met standard criteria of acceptable fit is impossible. Effect sizes of r = .2 (d = .4) are not zero, and treating them as such will hurt model fit.

In short, exploratory studies of the relationship between the Big Five and facets show a complex pattern with many non-trivial (r > .1) secondary loadings. Any attempt to model these data with SEM needs to be able to account for this finding. As many of these secondary loadings are theoretically expected and replicable, allowing for these secondary loadings makes theoretical sense and cannot be dismissed as overfitting of data. Rather, imposing a simple structure that makes no theoretical sense should be considered underfiting of the data, which of course results in bad fit.

Correlated Residuals are not Correlated Errors

Another confusion in the use of structural equation modeling is the interpretation of residual variances. In the present context, residuals represent the variance in a facet scale that is not explained by the Big Five factors. Residuals are interesting for two reasons. First, they provide information about unique aspects of personality that are not explained by the Big Five. To use the analogy of animals, although cats and dogs are both animals, they also have distinct features. Residuals are analogous to these distinct features, and we would think that personality psychologists would be very interested in exploring this question. However, statistics textbooks tend to present residual variances as error variance in the context of measurement models where items are artifacts that were created to measure a specific construct. As the only purpose of the item is to measure a construct, any variance that does not reflect the intended construct is error variance. If we were only interested in measuring the Big Five, we would think about residual facet-variance as error variance. It does not matter how depressed people are. We only care about their neuroticism. However, the notion of a hierarchy implies that we do care about the valid variance in facets that is not explained by the Big Five. Thus, residual variance is not error variance.

The mistake of treating residual variance as error variance becomes especially problematic when residual variance in one facet is related to residual variance in another facet. For example, how angry people get (the residual variance in anger) could be related to how compliant people are (the residual variance in compliance). After all, anger could be elicit by a request to comply to some silly norms (e.g., no secondary loadings) that make no sense. There is no theoretical reason, why facets could only be linked by means of the Big Five. In fact, a group of researchers has attempted to explain all relations among personality facet without the Big Five because they don’t belief in broader factors (cf. Schimmack, 2019b). However, this approach has difficulties explaining the constistent primary loadings of facets on their predicted Big Five factor.

The confusion of residuals with errors accounts at least partially for McCrae et al.’s (1996) failure to fit a measurement model to the correlations among the 30 facets.

“It would be possible to specify a correlated error term between these two scales, but the interpretation of such a term is unclear. Correlated error usually refers to a nonsubstantive
source of variance. If Activity and Achievement Striving were, say, observer ratings, whereas all other variables were self-reports, it would make sense to control for this difference in method by introducing a correlated error term. But there are no obvious sources of correlated error among the NEO-PI-R facet scales in the present study” (p. 555).

The Big Five Are Independent Factors, but Evaluative Bias produces correlations among Big Five Scales

Another decision researchers have to make is whether they specify models with independent factors or whether they allow factors to be correlated. That is, are extraversion and openness independent factors or are extraversion and openness correlated. A model with correlated Big Five factors has 10 additional free parameters to fit the data. Thus, the model will is likely to fit better than a model with independent factors. However, the Big Five were discovered using a method that imposed independence (EFA and Varimax rotation). Thus, allowing for correlations among the factors seems atheoretical, unless an explanation for these correlations can be found. On this front, personality researchers have made some progress by using multi-method data (self-ratings and ratings by informants). As it turns out, correlations among the Big Five are only found in ratings by a single rater, but not in correlations across raters (e.g., self-rated Extraversion and informant-rated Agreeableness). Additional research has further validated that most of this variance reflects response styles in ratings by a single rater. These biases can be modeled with two method factors. One factor is an acquiescence factor that leads to higher or lower ratings independent of item content. The other factor is an evaluative bias (halo) factor. It represent responses to the desirability of items. I have demonstrated in several datasets that it is possible to model the Big Five as independent factors and that correlations among Big Five Scales are mostly due to the contamination of scale scores with evaluative bias. As a result, neuroticism scales tend to be negatively related to the other scales because neuroticism is undesirable and the other traits are desirable (see Schimmack, 2019a). Although the presence of evaluative biases in personality ratings has been known for decades, previous attempts at modeling Big Five data with SEM often failed to specify method factors; not surprisingly they failed to find good fit (McCrae et al., 1996. In contrast, models with method factors can have good fit (Schimmack, 2019a).

Other Problems in McCrae et al.’s Attempt

There are other problems with McCrae et al.’s (1996) conclusion that CFA cannot be used to test personality structure. First, the sample size was small for a rigorous study of personality structure with 30 observed variables (N = 229). Second, the evaluation of model fit was still evolving and some of the fit indices that they reported would be considered acceptable fit today. Most importantly, an exploratory Maximum Likelihood model produced reasonable fit, chi2/df = 1.57, RMS = .04, TLI = .92, CFI = .92. Their best fitting CFA model, however, did not fit the data. This merely shows a lack of effort and not the inability of fitting a CFA model to the 30 facets. In fact, McCrae et al. (1996) note “a long list of problems with the technique [SEM], ranging from technical difficulties in estimation
of some models to the cost in time and effort involved.” However, no science has made progress by choosing cheap and quick methods over costly and time-consuming methods simply because researchers lack the patients to learn a more complex method. I have been working on developing measurement models of personality for over a decade (Anusic et al., 2009). I am happy to demonstrate that it is possible to fit an SEM model to the Big Five data, to separate content variance from method variance, and to examine how big the Big Five factors really are.

The Data

One new development in psychology is that data are becoming more accessible and are openly shared. Low Goldberg has collected an amazing dataset of personality data with a sample from Oregon (the Eugene-Springfield community sample). The data are now publicly available at the Harvard Dataverse. With N = 857 participants the dataset is nearly four times larger than the dataset used by McCrae et al. (1996), and the ratio 857 observations and 30 variables (28:1) is considered good for structural equation modeling.

It is often advised to use different samples for exploration and then for cross-validation. However, I used the full sample for a mix of confirmation and exploration. The reason is that there is little doubt about the robustness of the data structure (the covariance/correlation matrix). The bigger issue is that a well-fitting model does not mean that it is the right model. Alternative models could also account for the same pattern of correlations. Cross-validation does not help with this bigger problem. The only way to address this is a systematic program of research that develops and tests different models. I see the present model as the beginning of such a line of research. Other researchers can use the same data to fit alternative models and they can use new data to test model assumptions. The goal is merely to boot a new era of research on the structure of personality with structural equation modeling, which could have started 20 years ago, if McCrae et al. (1996) had been more positive about the benefits of testing models and being able to falsify them (a.k.a. doing science).

Results

I started with a simple model that had five independent personality factors (the Big Five) and an evaluative bias factor. I did not include an acquiescence factor because facets are measured with scales that include reverse scored items. As a result, acquiescence bias is negligible (Schimmack, 2019a).

In the initial model facet loadings on the evaluative bias factor were fixed at 1 or -1 depending on the direction or desirability of a facet. This model had poor fit. I then modified the model by adding secondary loadings and by freeing loadings on the evaluative bias factor to allow for variation in desirability of facets. For example, although agreeableness is desirable, the loading for the modesty facet actually turned out to be negative. I finally added some correlated residuals to the model. The model was modified until it reached criteria of acceptable fit, CFI = .951, RMSEA = .044, SRMR = .034. The syntax and the complete results can be found on OSF (https://osf.io/23k8v/).

Table 3 shows the standardized loadings of the 30 facets on the Big Five and the two method factors.

There are several notable findings that challenge prevalent conceptions of personality.

The Big Five are not so big

First, the loadings of facets on the Big Five factors are notably weaker than in McCrae et al.’s Table 4 reproduced above (Table 2). There are two reasons for this discrepancy. First, often evaluative bias is shared between facets that belong to the same factor. For example, anxiety and depression have strong negative loadings on the evaluative bias factor. This shared bias will push up the correlation between the two facets and inflate factor loadings in a model without an evaluative bias factor. Another reason can be correlated residuals. If this extra shared variance is not modeled it pushes up loadings of these facets on the shared factor. The new and more accurate estimates in Table 3 suggest that the Big Five are not as big as the name implies. The loading of anxiety on neuroticism (r = .49) implies that only 25% of the variance in anxiety is captured by the neuroticism factor. Loadings greater than .71 are needed for a Big Five factor to explain more than 50% of the variance in a facet. There are only two facets where the majority of the variance in a facet is explained by a Big Five factor (order, self-discipline).

Secondary loadings can explain additional variance in some facets. For example, for anger/hostility neuroticism explains .48^2 = 23% of the variance and agreeableness explains another -.43^2 = 18% of the variance for a total of 23+18 = 41% explained variance. However, even with secondary loadings many facets have substantial residual variance. This is of course predicted by a hierarchical model of personality traits with more specific factors underneath the global Big Five traits. However, it also implies that Big Five measures fail to capture substantial personality variance. It is therefore not surprising that facet measures often predict additional variance in outcomes that it is not predicted by the Big Five (e.g., Schimmack, Oishi, Furr, & Funder, 2004). Personality researchers need to use facet level or other more specific measures of personality in addition to Big Five measures to capture all of the personality variance in outcomes.

What are the Big Five?

Factor loadings are often used to explore the constructs underlying factors. The terms neuroticism, extraversion, or openness are mere labels for the shared variance among facets with primary loadings on a factor. There has been some discussion about the Big Five factors and there meaning is still far from clear. For example, there has been a debate about the extraversion factor. Lucas, Diener, Grob, Suh, and Shao (2000) argued that extraversion is the disposition to respond strongly to rewards. Ashton, Lee, and Paunonen disagreed and argued that social attention underlies extraversion. Empirically it would be easy to answer these questions if one facet would show a very high loading on a Big Five factor. The more loadings approach one, the more a factor corresponds to a facet or is highly related to a facet. However, the loading pattern does not suggest that a single facet captures the meaning of a Big Five factor. The strongest relationship is found for self-discipline and conscientiousness. Thus, good self-regulation may be the core aspect of conscientiousness that also influences achievement striving or orderliness. However, more generally the results suggest that the nature of the Big Five factors is not obvious. It requires more work to uncover the glue that ties facets belonging to a single factor together. Theories range from linguistic structures to shared neurotransmitters.

Evaluative Bias

The results for evaluative bias are novel because previous studies failed to model evaluative bias in responses to the NEO-PI-R. It would be interesting to validate the variation in loadings on the evaluative bias factor with ratings of item- or facet-desirability. However, intuitively the variation makes sense. It is more desirable to be competent (C1, r = .66) and not depressed (N3, r = -69) than to be an excitement seeker (E5: r = .03) or compliant (A4: r = .09). The negative loading for modesty also makes sense and validates self-ratings of modesty (A5,r = -.33). Modest individuals are not supposed to exaggerate their desirable attributes and apparently they refrain from doing so also when they complete the NEO-PI-R.

Recently, McCrae (2018) acknowledged the presence of evaluative biases in NEO scores, but presented calculations that suggested the influence is relatively small. He suggested that facet-facet correlations might be inflated by .10 due to evaluative bias. However, this average is uninformative. It could imply that all facets have a loading of .33 or -.33 on the evaluative bias factor, which introduces a bias of .33*.33 = .10 or .33*-.33 = -.10 in facet-facet correlations. In fact, the average absolute loading on the evaluative bias factor is .30. However, this masks the fact that some facets have no evaluative bias and others have much more evaluative bias. For example, the measure of competence beliefs (self-effacy) C1 has a loading of .66 on the evaluative bias factor, which is higher than the loading on conscientiousness (.52). It should be noted that the NEO-PI-R is a commercial instrument and that it is in the interest of McCrae to claim that the NEO-PI-R is a valid measure for personalty assessment. In contrast, I have no commercial interest in finding more or less evaluative bias in the NEO-PI-R. This may explain the different conclusions about the practical significance of evaluative bias in NEO-PI-R scores.

In short, the present analysis suggests that the amount of evaluative bias varies across facet scales. While the influence of evaluative bias tends to be modest for many scales, scales with highly desirable traits show rather strong influence of evaluative bias. In the future it would be interesting to use multi-method data to separate evaluative bias from content variance (Anusic et al., 2009).

Measurement of the Big Five

Structural equation modeling can be used to test substantive theories with a measurement model or to develop and evaluate measurement models. Unfortunately, personality psychologists have not taken advantage of structural equation modeling to improve personality questionnaires. The present study highlights two ways in which SEM analysis of personality ratings is beneficial. First, it is possible to model evaluative bias and to search for items with low evaluative bias. Minimizing the influence of evaluative bias increases the validity of personality scales. Second, the present results can be used to create better measures of the Big Five. Many short Big Five scales focus exclusively on a single facet. As a result, these measures do not actually capture the Big Five. To measure the Big Five efficiently, a measure requires several facets with high loadings on the Big Five factor. Three facets are sufficient to create a latent variable model that separates the facet-specific residual variance from the shared variance that reflects the Big Five. Based on the present results, the following facets seem good candidates for the measurement of the Big Five.

Neuroticism: Anxiety, Anger, and Depression. The shared variance reflects a general tendency to respond with negative emotions.

Extraversion: Warmth, Gregariousness, Positive Emotions: The shared variance reflects a mix of sciability and cheerfulness.

Openness: Aesthetics, Action, Ideas. The shared variance reflects an interest in a broad range of activities that includes arts, intellectual stimulation, as well as travel.

Agreeableness: Straightforwardness, Altruism, Complicance: The shared variance represents respecting others.

Conscientiousness: Order, Self-Discipline, Dutifulness. I do not include achievement striving because it may be less consistent across the life span. The shared variance represents following a fixed set of rules.

This is of course just a suggestion. More research is needed. What is novel is the use of reflective measurement models to examine this question. McCrae et al. (1996) and some others before them tried and failed. Here I show that it is possible and useful to fit facet corelations with a structural equation model. Thus, twenty years after McCrae et al. suggested we should not use SEM/CFA, it is time to reconsider this claim and to reject it. Most personality theories are reflective models. It is time to test these models with the proper statistical method.

When Personality Psychologists are High

One area of personality psychology aims to classify personality traits. I compare this activity to research in biology where organisms are classified into a large taxonomy.

In a hiearchical taxnomy, the higher levels are more abstract, less descriptive, but also comprise a larger group of items. For example, there are more mammals (class) than dogs (species).

in the 1980s, personality psychologists agreed on the Big Five. The Big Five represent a rather abstract level of description that combines many distinct traits into traits that are predominantly related to one of the Big Five dimensions. For example, talkative falls into the extraversion group.

To illustrate the level of abstraction, we can compare the Big Five to the levels in biology. After distinguishing vertebrate and invertebrate animals, there are five classes of vertebrate animals: mammals, fish, reptiles, birds, and amphibians). This suggests that the Big Five are a fairly high level of abstraction that cover a broad range of distinct traits within each dimension.

The Big Five were found using factor or pincipal component analysis (PCA). PCA is a methematical method that reduces the covariances among personality ratings to a smaller number of factors. The goal of PCA is to capture as much of the variance as possible with the smallest number of components. Evidently there is a trade-off. However, often the first components account for most of the variance while additional components add very little additional information. Using various criteria, five components seemed to account for most of the variance in personality ratings and the first five components could be identified in different datasets. So, the Big Five were born.

One important feature of PCA is that the components are independent (orthogonal). This is helpful to maximize the information that is captured with five dimensions. If the five dimensions would correlated, they would present overlapping variances and this redundancy would reduce the amount of explained variance. Thus, the Big Five are conceptually independent because they were discovered with a method that enforced independence.

Scale Scores are not Factors

While principal component analysis is useful to classify personality traits, it is not useful to do basic research on the causes and consequences of personality. For this purpose, personality psychologists create scales. Scales are usually created by summing items that belong to a common factor. For example, responses to the items “talkative,” “sociable,” and “reserved” are added up to create an extraversion score. Ratings of the item “reserved” are reversed so that higher scores reflect extraversion. Importantly, sum scores are only proxies of the components or factors that were identified in a factor analysis or a PCA. Thus, we need to distinguish between extraversion-factors and extraversion-scales. They are not the same thing. Unfortunately, personality psychologists often treat scales as if they were identical with factors.

Big Five Scales are not Independent

Now something strange happened when personalty psychologists examined the correlations among Big Five SCALES. Unlike the factors that were independent by design, Big Five Scales were not independent. Moreover, the correlations among Big Five scales were not random. Digman (1997) was the first to examine these correlations. The article has garnered over 800 citations.

Digman examined these correlations conducted another principal component analysis of the correlations. He found two factors. One factor for extraversion and openesss and the other factor for agreeableness and conscientiousness (and maybe low neuroticism). He proposed that these two factors represent an even higher level in a hierarchy of personality traits. Maybe like moving from the level of classess (mammals, fish, reptiles) to the level Phylum; a level that is so abstract that few people who are not biologists are familiar with.

Digman’s article stimulated further research on higher-order factors of personality, where higher means even higher than the Big Five, which are already at a fairly high level of abstraction. Nobody stopped to wonder how there could be higher-order factors if the Big Five are actually independent factors, and why Big Five scales show systematic correlations that were not present in factor analyses.

Instead personality psychologists speculated about the biological underpinning of the higher order factors. For example, Jordan B. Peterson (yes, them) and colleagues proposed that serotonin is related to higher stability (high agreeableness, high conscientiousness, and low neuroticism) (DeYoung, Peterson, and Higgins, 2002).

Rather than interpreting this finding as evidence that response tendencies contribute to correlations among Big Five scales, they interpreted this finding as a substantive finding about personality, society in the context of psychodynamic theories.

Only a few years later, separated from the influence of his advisor, deYoung (2006) published a more reasonable article that used a multi-method approach to separate personality variance from method variance. This article provided strong evidence that a general evaluative bias (social desirable responding) contributes to correlations among Big Five Scales, which was formalized in Anusic et al.’s (200) model with an explicit evaluative bias (halo) factor.

However, the idea of higher-order factors was sustained by finding cross-method correlations that were consistent with the higher-order model.

After battling Colin as a reviewer, when we submitted a manuscript on halo bias in personality ratings, we finally were able to publish a compromise model that also included the higher order factors (stability/alpha; plasticity/beta), although we had problems identifying the alpha factor in some datasets.

The Big Mistake

Meanwhile, another article built on the 2002 model that did not control for rating biases and proposed that the correlation between the two higher-order factors implies that there is an even higher level in the hierarchy. The Big Trait of Personality makes people actually have more desirable personalities; They are less neurotic, more sociable, open, agreeable, and conscientious. Who wouldn’t want one of them as a spouse or friend? However, the 2006 article by deYoung showed that the Big One only exists in the imagination of individuals and is not shared with perceptions by others. This finding was replicated in several datasets by Anusic et al. (2009).

Although claims about the Big One were already invalidated when the article was published, it appealed to some personality psychologists. In particular, white supremacist Phillip Rushton found the idea of a generally good personality very attractive and spend the rest of his life promoting it (Rushton & Irving, 2011; Rushton Bons, & Hur, 2008). He never realized the distinction between a personality factor, which is a latent construct, and a personality scale, which is the manifest sum-score of some personality items, and ignored deYoung’s (2006) and other (Anusic et al., 2009) evidence that the evaluative portion in personality ratings is a rating bias and not substantive covariance among the Big Five traits.

Peterson and Rushton are examples of pseudo-science that mixes some empirical findings with grand ideas about human nature that are only loosely related. Fortunately, interest in the general factor of personality seems to be decreasing.

Higher Order Factors or Secondary Loadings?

Ashton, Lee, Goldberg, and deVries (2009) put some cold water on the idea of higher-order factors. They pointed out that correlations between Big Five Scales may result from secondary loadings of items on Big Five Factors. For example, the item adventurous may load on extraversion and openness. If the item is used to create an extraversion scale, the openness and extraversion scale will be positively correlated.

As it turns out, it is always possible to model the Big Five as independent factors with secondary loadings to avoid correlations among factors. After all, this is how exploratory factor analysis or PCA are able to account for correlations among personality items with independent factors or components. In an EFA, all items have secondary loadings on all factors, although some of these correlations may be small.

There are only two ways to distinguish empirically between a higher-order model and a secondary-loading model. One solution is to obtain measures of the actual causes of personality (e.g., genetic markers, shared environment factors, etc.) If there are higher order factors, some of the causes should influence more than one Big Five dimension. The problem is that it has been difficult to identify causes of personality traits.

The second approach is to examine the number of secondary loadings. If all openness items load on extraversion in the same direction (e.g., adventurous, interest in arts, interest in complex issues), it suggests that there is a real common cause. However, if secondary loadings are unique to one item (adventurous), it suggests that the general factors are independent. This is by no means a definitive test of the structure of personality, but it is instructive to examine how many items from one trait have secondary loadings on another trait. Even more informative would be the use of facet-scales rather than individual items.

I have examined this question in two datasets. One dataset is an online sample with items from the IPIP-100 (Johnson). The other dataset is an online sample with the BFI (Gosling and colleagues). The factor loading matrices have been published in separate blog posts and the syntax and complete results have been posted on OSF (Schimmack, 2019b; 2019c).

IPIP-100

Neuroticism items show 8 out of 16 secondary loadings on agreeableness, and 4 out of 16 secondary loadings on conscientiousnes.

Item#NEOACEVBACQ
Neuroticism
easily disturbed30.44-0.25
not easily bothered10-0.58-0.12-0.110.25
relaxed most of the time17-0.610.19-0.170.27
change my mood a lot250.55-0.15-0.24
feel easily threatened370.50-0.25
get angry easily410.50-0.13
get caught up in my problems420.560.13
get irritated easily440.53-0.13
get overwhelmed by emotions450.620.30
stress out easily460.690.11
frequent mood swings560.59-0.10
often feel blue770.54-0.27-0.12
panic easily800.560.14
rarely get irritated82-0.52
seldom feel blue83-0.410.12
take offense easily910.53
worry about things1000.570.210.09
SUM0.83-0.050.000.07-0.02-0.380.12

Agreeableness items show only one secondary loading on conscientiousness and one on neuroticism.

Agreeableness
indifferent to feelings of others8-0.58-0.270.16
not interested in others’ problems12-0.58-0.260.15
feel little concern for others35-0.58-0.270.18
feel others’ emotions360.600.220.17
have a good word for everybody490.590.100.17
have a soft heart510.420.290.17
inquire about others’ well-being580.620.320.19
insult people590.190.12-0.32-0.18-0.250.15
know how to comforte others620.260.480.280.17
love to help others690.140.640.330.19
sympathize with others’ feelings890.740.300.18
take time out for others920.530.320.19
think of others first940.610.290.17
SUM-0.030.070.020.840.030.410.09

Finally, conscientiousness items show only one secondary loading on agreeableness.

Conscientiousness
always prepared20.620.280.17
exacting in my work4-0.090.380.290.17
continue until everything is perfect260.140.490.130.16
do things according to a plan280.65-0.450.17
do things in a half-way manner29-0.49-0.400.16
find it difficult to get down to work390.09-0.48-0.400.14
follow a schedule400.650.070.14
get chores done right away430.540.240.14
leave a mess in my room63-0.49-0.210.12
leave my belongings around64-0.50-0.080.13
like order650.64-0.070.16
like to tidy up660.190.520.120.14
love order and regularity680.150.68-0.190.15
make a mess of things720.21-0.50-0.260.15
make plans and stick to them750.520.280.17
neglect my duties76-0.55-0.450.16
forget to put things back 79-0.52-0.220.13
shirk my duties85-0.45-0.400.16
waste my time98-0.49-0.460.14
SUM-0.03-0.010.010.030.840.360.00

Of course, there could be additional relationships that are masked by fixing most secondary loadings to zero. However, it also matters how strong the secondary loadings are. Weak secondary loadings will produce weak correlations among Big Five scales. Even the secondary loadings in the model are weak. Thus, there is little evidence that neuroticism, agreeableness, and conscientiousness items are all systematically related as predicted by a higher-order model. At best, the data suggest that neuroticism has a negative influence on agreeable behaviors. That is, people differ in their altruism, but agreeable neurotic people are less agreeable when they are in a bad mood.

Results for extraversion and openness are similar. Only one extraversion item loads on openness.

Extraversion
hard to get to know7-0.45-0.230.13
quiet around strangers16-0.65-0.240.14
skilled handling social situations180.650.130.390.15
am life of the party190.640.160.14
don’t like drawing attention to self30-0.540.13-0.140.15
don’t mind being center of attention310.560.230.13
don’t talk a lot32-0.680.230.13
feel at ease with people 33-0.200.640.160.350.16
feel comfortable around others34-0.230.650.150.270.16
find it difficult to approach others38-0.60-0.400.16
have little to say57-0.14-0.52-0.250.14
keep in the background60-0.69-0.250.15
know how to captivate people610.490.290.280.16
make friends easily73-0.100.660.140.250.15
feel uncomfortable around others780.22-0.64-0.240.14
start conversations880.700.120.270.16
talk to different people at parties930.720.220.13
SUM-0.040.880.020.06-0.020.370.01

And only one extraversion item loads on openness and this loading is in the opposite direction from the prediction by the higher-order model. While open people tend to like reading challenging materials, extraverts do not.

Openness
full of ideas50.650.320.19
not interested in abstract ideas11-0.46-0.270.16
do not have good imagination27-0.45-0.190.16
have rich vocabulary500.520.110.18
have a vivid imagination520.41-0.110.280.16
have difficulty imagining things53-0.48-0.310.18
difficulty understanding abstract ideas540.11-0.48-0.280.16
have excellent ideas550.53-0.090.370.22
love to read challenging materials70-0.180.400.230.14
love to think up new ways710.510.300.18
SUM-0.02-0.040.75-0.01-0.020.400.09

The next table shows the correlations among the Big Five SCALES.

Scale CorrelationsNEOAC
Neuroticism (N)
Extraversion (E)-0.21
Openness (O)-0.160.13
Agreeableness (A)-0.130.270.17
Conscientiousness (C)-0.170.110.140.20

The pattern mostly reflects the influence of the evaluative bias factor that produces negative correlations of neuroticism with the other scales and positive correlations among the other scales. There is no evidence that extraversion and openness are more strongly correlated in the IPIP-100. Overall, these results are rather disappointing for higher-order theorists.

The next table shows the correlations among the Big Five Scales.

Scale CorrelationsNEOAC
Neuroticism (N)
Extraversion (E)-0.21
Openness (O)-0.160.13
Agreeableness (A)-0.130.270.17
Conscientiousness (C)-0.170.110.140.20

The pattern of correlations reflects mostly the influence of the evaluative bias factor. As a result, the neuorticism scale is negatively correlated with the other scales and the other scales are positively correlated with each other. There is no evidence for a stronger correlation between extraversion and openness because there are no notable secondary loadings. There is also no evidence that agreeableness and conscientiousness are more strongly related to neuroticism. Thus, these results show that deYoung’s (2006) higher-order model is not consistent across different Big Five questionnaires.

Big Five Inventory

deYoung found the higher-order factors with the Big Five Inventory. Thus, it is particularly interesting to examine the secondary loadings in a measurement model with independent Big Five factors (Schimmack, 2019b).

Neuroticism items have only one secondary loading on agreeableness and one on conscientiousness and the magnitude of these loadings is small.

Item#NEOACEVBACQ
Neuroticism
depressed/blue40.33-0.150.20-0.480.06
relaxed9-0.720.230.18
tense140.51-0.250.20
worry190.60-0.080.07-0.210.17
emotionally stable24-0.610.270.18
moody290.43-0.330.18
calm34-0.58-0.04-0.14-0.120.250.20
nervous390.52-0.250.17
SUM0.79-0.08-0.01-0.05-0.02-0.420.05

Four out of nine agreeableness items have secondary loadings on neuroticism, but the magnitude of these loadings is small. Four items also have loadings on conscientiousness, but one item (forgiving) has a loading opposite to the one predicted by the hgher-order model.

Agreeableness
find faults w. others20.15-0.42-0.240.19
helpful / unselfish70.440.100.290.23
start quarrels 120.130.20-0.50-0.09-0.240.19
forgiving170.47-0.140.240.19
trusting 220.150.330.260.20
cold and aloof27-0.190.14-0.46-0.350.17
considerate and kind320.040.620.290.23
rude370.090.12-0.63-0.13-0.230.18
like to cooperate420.15-0.100.440.280.22
SUM-0.070.00-0.070.780.030.440.04

For conscientiousness, only two items have a secondary loading on neuroticism and two items have a secondary loading on agreeableness.

Conscientiousness
thorough job30.590.280.22
careless 8-0.17-0.51-0.230.18
reliable worker13-0.090.090.550.300.24
disorganized180.15-0.59-0.200.16
lazy23-0.52-0.450.17
persevere until finished280.560.260.20
efficient33-0.090.560.300.23
follow plans380.10-0.060.460.260.20
easily distracted430.190.09-0.52-0.220.17
SUM-0.050.00-0.050.040.820.420.03

Overall, these results provide no support for the higher-order model that predicts correlations among all neuroticism, agreeableness, and conscientiousness items. These results are also consistent with Anusic et al.’s (2009) difficulty of identifying the alpha/stability factor in a study with the BFI-S, a shorter version of the BFI.

However, Anusic et al. (2009) did find a beta-factor with BFI-S scales. The present analysis of the BFI do not replicate this finding. Only two extraversion items have small loadings on the openness factor.

Extraversion
talkative10.130.70-0.070.230.18
reserved6-0.580.09-0.210.18
full of energy110.34-0.110.580.20
generate enthusiasm160.070.440.110.500.20
quiet21-0.810.04-0.210.17
assertive26-0.090.400.14-0.240.180.240.19
shy and inhibited310.180.64-0.220.17
outgoing360.720.090.350.18

And only one openness item has a small loading that is opposite to the predicted direction. Extraverts are less likely to like reflecting.

Openness 
original50.53-0.110.380.21
curious100.41-0.070.310.24
ingenious 150.570.090.21
active imagination200.130.53-0.170.270.21
inventive25-0.090.54-0.100.340.20
value art300.120.460.090.160.18
like routine work35-0.280.100.13-0.210.17
like reflecting40-0.080.580.270.21
few artistic interests41-0.26-0.090.15
sophisticated in art440.070.44-0.060.100.16
SUM0.04-0.030.76-0.04-0.050.360.19

In short, there is no support for the presence of a higher-order factor that produces overlap between extraversion and openness.

The pattern of correlations among the BFI scales, however, might suggest that there is an alpha factor because neuroticism, agreeableness and conscientiousness tend to be more strongly correlated with each other than with other dimensions. This shows the problem of using scales to study higher-order factors. However, there is no evidence for a higher-order factor that combines extraversion and openness as the correlation between these traits is an unremarkable r = .18.

Scale CorrelationsNEOAC
Neuroticism (N)
Extraversion (E)-0.26
Openness (O)-0.110.18
Agreeableness (A)-0.280.160.08
Conscientiousness (C)-0.230.180.070.25

So, why did deYoung (2006) find evidence for higher-order factors? One possible explanation is that BFI scale correlations are not consistent across different samples. The next table shows the self-report correlations from deYoung (2006) below the diagonal and discrepancies above the diagonal. Three of the four theoretically important correlations tend to be stronger in deYoung’s (2006) data. It is therefore possible that the secondary loading pattern differs across the two datasets. It would be interesting to fit an item-level model to deYoung’s data to explore this issue further.

Scale CorrelationsNEOAC
Neuroticism (N)0.100.03-0.06-0.08
Extraversion (E)-0.160.070.010.03
Openness (O)-0.080.25-0.020.02
Agreeableness (A)-0.360.150.06-0.01
Conscientiousness (C)-0.310.210.090.24

In conclusion, an analysis of the BFI also does not support the higher-order model. However, results seem to be inconsistent across different samples. While this suggests that more research is needed, it is clear that this research needs to model personality at the level of items and not with scale scores that are contaminated by evaluative bias and secondary loadings.

Conclusion

Hindsight is 20/20 and after 20 years of research on higher-order factors a lot of this research looks silly. How could there be higher order factors for the Big Five factors if the Big Five are independent factors (or components) by default. The search for higher-order factors with Big Five scales can be attributed to methodological limitations, although higher-order models with structural equation modeling have been around since the 1980. It is rather obvious that scale scores are impure measures and that correlations among scales are influenced by secondary loadings. However, even when this fact was pointed out by Ashton et al. (2009), it was ignored. The problem is mainly due to the lack of proper training in methods. Here the problem is the use of scales as indicators of factors, when scales introduce measurement error and higher-order factors are method artifacts.

The fact that it is possible to recover independent Big Five factors from questionnaires that were designed to measure five independent dimensions says nothing about the validity of the Big Five model. To examine the validity of the Big Five as a valid model of the highest level in a taxonomy of personality trait it is important to examine the relationship of the Big Five with the diverse population of personality traits. This is an important area of research that could also benefit from proper measurement models. This post merely focused on the search for higher order factors for the Big Five and showed that searching for higher-order factors of independent factors is a futile endeavor that only leads to wild speculations that are not based on empirical evidence (Peterson, Rushton).

Even deYoung and Peterson seems to have realized that it is more important to examine the structure of personality below rather than above the Big Five (deYoung, Quility, & Peterson, 2007) . Whether 10 aspects, 16 factors (Cattell) or 30 facets (Costa & McCrae) represent another meaningful level in a hierarchical model of personality traits remains to be examined. Removing method variance and taking secondary loadings into account will be important to separate valid variance from noise. Also, factor analysis is superior to principle component analysis unless the goal is simply to describe personality with atheoretical components that capture as much variance as possible.

Correct me if you can

This blog post is essentially a scientific article without peer-review. I prefer this mode of communication over submitting manuscript to traditional journals where a few reviewers have the power to prevent research from being published. This happened with a manuscript that Ivana Anusic and I submitted and that was killed by Colin deYoung as a reviewer. I prefer open reviews and I invite Colin to write an open review of this “article.” I am happy to be corrected and any constructive comments would be a welcome contribution to advancing personality science. Simply squashing critical work so that nobody gets to see it is not advancing science. The new way of conducting open science with open submissions, open reviews is the way to go. Of course, others are also invited to engage in the debate. So, let’s start a debate with the thesis “Higher-order factors of the Big Five do not exist.”

Personality and Self-Esteem

In the 1980s, personality psychologists agreed on the Big Five as a broad framework to describe and measure personality; that is, variation in psychological attributes across individuals.

You can think about the Big Five as a five-dimensional map. Like the two-dimensional map (or a three-dimensional globe), the Big Five are independent dimensions that create a space with coordinates that can be used to describe the vast number of psychological attributes that distinguish one person from another. One area of research in personality psychology is to correlate measures of personality attributes with Big Five measures to pinpoint their coordinates.

One important and frequently studied personality attribute is self-esteem, and dozens of studies have correlated self-esteem measures with Big Five measures. Robins, Tracy, and Trzesniewski (2001) reviewed some of these studies.

The results are robust and there is no worry about the replicability of these results. The strongest predictor of self-esteem is neuroticism vs. emotional stability. Self-esteem is located at the high end of neuroticism. The second predictor is extraversion vs. introversion. Self-esteem is located at the higher end of extraversion. The third predictor is conscientiousness which shows a slight positive location on the conscientious vs. careless dimension. Openness vs. closeness also shows a slight tendency towards openness. Finally, the results for agreeableness are more variable and show at least one negative correlation, but most correlations tend to be positive.

Evaluative Bias

Psychologists have a naive view of the validity of their measures. Although they sometimes compute reliability and examine convergent validity in methodological articles that are published in obscure journals like “Psychological Assessment,” they treat measures as perfectly valid in substantive articles that are published in journals like “Journal of Personality” or “Journal of Research in Personality.” Unfortunately, measurement problems can distort effect sizes and occasionally they can change the sign of a correlation.

Anusic et al. (2009) developed a measurement model for the Big Five that separates valid variance in the Big Five dimensions from rating biases. Rating biases can be content free (acquiescence) or respond to the desirability of items (halo, evaluative bias). They showed that evaluative bias can obscure the location of self-esteem in the Big Five space. Here, I revisit this question with better data that measure the Big Five with a measurement model fitted to the 44-items of the Big Five Inventory (Schimmack, 2019a).

I used the same data, which is the Canadian subsample of Gosling and colleagues large internet study that collects data from visitors who receive feedback about their personality. I simply added the single-item self-esteem measure to the dataset. I then fitted three different models. One model regressed the self-esteem item only on the Big Five dimensions. This model essentially replicates analyses with scale scores. I then added the method factors to the set of predictors.

NEOACEVBACQ
Self-Esteem M1-0.430.300.08-0.030.16
Self-Esteem M2-0.330.190.00-0.140.080.430.11

Results for the first model reproduce previous findings (see Table 1). However, results changed when the method factors were added. Most important, self-esteem is now placed on the negative side of agreeableness towards being more assertive. This makes sense given the selfless and other-focused nature of agreeableness. Agreeable people are less like to think about themselves and may subordinate their own needs to the needs of others. In contrast, people with high self-esteem are more likely to focus on themselves. Even though this is not a strong relationship, it is noteworthy that the relationship is negative rather than positive.

The other noteworthy finding is that evaluative bias is the strongest predictor of self-esteem. There are two interpretations of this finding and it is not clear which explanation accounts for this finding.

One interpretation is that self-esteem is rooted in a trait to see everything related to the self in an overly positive way. This interpretation implies that responses to personality items are driven by the desirability of items and individuals with high self-esteem see themselves as possessing all kinds of desirable attributes that they do not have (or have to a lesser degree). They think that they are kinder, smarter, funnier, and prettier than others, when they are actually not. In this way, the evaluative bias in personality ratings is an indirect measure of self-esteem.

The other interpretation is that evaluative bias is a rating bias that influences self-ratings, which includes self-ratings. Thus, the loading of the self-esteem item on the evaluative bias factor shows simply that self-esteem ratings are influenced by evaluative bias because self-esteem is a desirable attribute.

Disentangling these two interpretations requires the use of a multi-method approach. If evaluative bias is merely a rating bias, it should not correlated with actual life-outcomes. However, if evaluative bias reflects actual self-evaluations, it should be correlated with outcomes of high self-esteem.

Conclusion

Hopefully, this blog-post will create some awareness that personality psychology needs to move beyond the use of self-ratings in mapping the location of personality attributes in the Big Five space.

The blog post also has important implications for theories of personality development that assign value to personality dimensions (Dweck, 2008). Accordingly, the goal of personality development is to become more agreeable and conscientious and less neurotic among other things. However, I question that personality traits have intrinsic value. That is, agreeableness is not intrinsically good and low conscientiousness is not intrinsically bad. The presence of evaluative bias in personality items shows only that personality psychologists assign value to some traits and do not include items like “I am a clean-freak” in their questionnaires. Without a clear evaluation, there is no direction to personality change. Becoming more conscientious is no longer a sign of personal growth and maturation, but rather a change that may have positive or negative consequences for individuals. Although these issues can be debated, it is problematic that current models of personality development do not even question the evaluation of personality traits and treat the positive nature of some traits as a fundamental assumption that cannot be questioned. I suggest it is worthwhile to think about personality like sexual orientation or attractiveness. Although society has created strong evaluations that are hard to change, the goal should be to change these evaluations, not to change individuals to conform to these norms.

How Valid are Short Big-Five Scales?

The first measures of the Big Five used a large number of items to measure personality. This made it difficult to include personality measures in studies as the assessment of personality would take up all of the survey time. Over time, shorter scales became available. One important short Big Five measure is the BFI-S (Lang et al., 2011).  This 15-item measure has been used in several national representative, longitudinal studies such as the German Socio-Economic Panel (Schimmack, 2019a). These results provide unique insights into the stability of personality (Schimmack, 2019b) and the relationship of personality with other constructs such as life-satisfaction (Schimmack, 2019c). Some of these results overturn textbook claims about personality. However, critics argue that these results cannot be trusted because the BFI-S is an invalid measure of personality.

Thus, it is is critical importance to evaluate the validity of the BFI-S. Here I use Gosling and colleagues data to examine the validity of the BFI-S. Previously, I fitted a measurement model to the full 44-item BFI (Schimmack, 2019d). It is straightforward to evaluate the validity of the BFI-S by examining the correlation of the 3-item BFI-S scale scores with the latent factors based on all 44 BFI items. For comparison purposes, I also show the correlations for the BFI scale scores. The complete results for individual items are shown in the previous blog post (Schimmack, 2019d).

The measurement model for the BFS has seven independent factors. Five factors represent the Big Five and two factors represent method factors. One factor represents acquiescence bias. The other factor represents evaluative bias that is present in all self-ratings of personality (Anusic et al., 2009). As all factors are independent, the squared coefficients can be interpreted as the amount of variance that a factor explains in a scale score.

The results show that the BFI-S scales are nearly as valid as the longer BFI scales (Table 1).

Scale#ItemsNEOACEVBACQ
N-BFI80.79-0.08-0.01-0.05-0.02-0.420.05
N-BFI-S30.77-0.13-0.050.07-0.04-0.290.07
E-BFI8-0.020.830.04-0.050.000.440.06
E-BFI-S30.050.820.000.04-0.070.320.07
O-BFI100.04-0.030.76-0.04-0.050.360.19
O-BFI-S30.090.000.66-0.04-0.100.320.25
A-BFI9-0.070.00-0.070.780.030.440.04
A-BFI-S3-0.03-0.060.000.750.000.330.09
C-BFI9-0.050.00-0.050.040.820.420.03
C-BFI-S3-0.090.00-0.020.000.750.440.06

For example, the factor-scale correlations for neuroticism, extraversion, and agreeableness are nearly identical. The biggest difference was observed for openness with a correlation of r = .76 for the BFI-scale and r = .66 for the BFI-S scale. The only other notable systematic variance in scales is the evaluative bias influence which tends to be stronger for the longer scales with the exception of conscientiousness. In the future, measurement models with an evaluative bias factor can be used to select items with low loadings on the evaluative bias factor to reduce the influence of this bias on scale scores. Given these results, one would expect that the BFI and BFI-S produce similar results. The next analyses tested this prediction.

Gender Differences

I examined gender differences three ways. First, I examined standardized mean differences at the level of latent factors in a model with scalar invariance (Schimmack, 2019d). Second, I computed standardized mean differences with the BFI scales. Finally, I computed standardized mean differences with the BFI-S scales. Table 2 shows the results. Results for the BFI and BFI-S scales are very similar. The latent mean differences show somewhat larger differences for neuroticism and agreeablness because these mean differences are not attenuated by random measurement error. The latent means also show very small gender differences for the method factors. Thus, mean differences based on scale scores are not biased by method variance.

Table 2. Standardized Mean Differences between Men and Women

NEOACEVBACQ
Factor0.640.17-0.180.310.150.090.16
BFI0.450.14-0.100.200.14
BFI-S0.480.21-0.030.180.12

Note. Positive values indicate higher means for women than for men.

In short, there is no evidence that using 3-item scales invalidates the study of gender differences.

Age Differences

I demonstrated measurement invariance for different age groups (Schimmack, 2019d). Thus, I used simple correlations to examine the relationship between age and the Big Five. I restricted the age range from 17 to 70. Analyses of the full dataset suggest that older respondents have higher levels of conscientiousness and agreeableness (Soto, John, Gosling, & Potter, 2011).

Table 3 shows the results. The BFI and the BFI-S both show the predicted positive relationship with conscientiousness and the effect size is practically identical. The effect size for the latent variable model is stronger because the relationship is not attenuated by random measurement error. Other relationships are weaker and also consistent across measures except for Openness. The latent variable model reveals the reason for the discrepancies. Three items (#15 ingenious, #l35 like routine work, and #10 sophisticated in art) showed unique relationships with age. The art-related items showed a unique relationship with age. The latent factor does not include the unique content of these items and shows a positive relationship between openness and age. The scale scores include this content and show a weaker relationship. The positive relationship of openness with age for the latent factor is rather surprising as it is not found in nationally representative samples (Schimmack, 2019b). One possible explanation for this relationship is that older individuals who take an online personality test are more open.

NEOACEVBACQ
Factor-0.08-0.020.180.120.330.01-0.11
BFI-0.08-0.010.080.090.26
BFI-S-0.08-0.04-0.020.080.25

In sum, the most important finding is that the 3-item BFI-S conscientiousness scale shows the same relationship with age as the BFI-scale and the latent factor. Thus, the failure to find aging effects in the longitudinal SOEP data with the BFI-S cannot be attributed to the use of an invalid short measure of conscientiousness. The real scientific question is why the cross-sectional study by Soto et al. (2011) and my analysis of the longitudinal SOEP data show divergent results.

Conclusion

Science has changed since researchers are able to communicate and discuss research findings on social media. I strongly believe that open science outside of peer-controlled journals is beneficial for the advancement of science. However, the downside of social media of open science is that it becomes more difficult to evaluate expertise of online commentators. True experts are able to back up their claims with scientific evidence. This is what I did here. I showed that Brenton Wiernik’s comment has as much scientific validity as a Donald Trump tweet. Whatever the reason for the lack of personality change in the SOEP data will be, it is not the use of the BFI-S to measure the Big Five.

Personality Measurement with the Big Five Inventory

In one of the worst psychometric articles every published (although the authors still have a chance to retract their in press article before it is actually published), Hussey and Hughes argue that personality psychologists intentionally fail to test the validity of personality measures. They call this practice validity-hacking. They also conduct some psychometric tests of popular personality measures and claim that they fail to demonstrate structural validity.

I have demonstrated that this claim is blatantly false and that the authors failed to conduct a proper test of structural validity (Schimmack, 2019a). That is, the authors fitted a model to the data that is known to be false. Not surprisingly, they found that their model didn’t meet standard criteria of model fit. This is exactly what should happen when a false model is subjected to a test of structural validity. Bad models should not fit the data. However, a real test of structural validity requires fitting a plausible model to the data. I already demonstrated with several Big Five measures that these measures have good structural validity and that scale scores can be used as reasonable measures of the latent constructs (Schimmack, 2019b). Here I examine the structural validity of the Big Five Inventory (Oliver John) that was used by Hussay and Hughes.

While I am still waiting to receive the actual data that were used by Hussay and Hughes, I obtained a much larger and better dataset from Sam Gosling that includes data from 1 million visitors to a website that provides personality feedback (https://www.outofservice.com/bigfive/).

For the present analyses I focused on the subgroup of Canadian visitors with complete data (N = 340,000). Subsequent analyses can examine measurement invariance with the US sample and samples from other nations. To examine the structure of the BFI, I fitted a structural equation model. The model has seven factors. Five factors represent the Big Five personality traits. The other two factores represent rating biases. One bias is an evaluative bias and the other bias is acquiescence bias. Initially, loadings on the method factors were fixed. This basic model was then modified in three ways. First, item loadings on the evaluative bias factor were relaxed to allow for some items to show more or less evaluative bias. Second, secondary loadings were added to allow for some items to be influenced by more than one factor. Finally, items of the same construct were allowed to covary to allow for similar wording or shared meaning (e.g., three arts items from the openness factor were allowed to covary). The final model and the complete results can be found on OSF (https://osf.io/23k8v/).

Model fit was acceptable, CFI = .953, RMSEA = .030, SRMR = .032. In contrast, fitting a simple structure without method factors produced unacceptable fit for all three fit indices, CFI = .734, RMSEA = .068, SRMR = .110. This shows that the model specification by Hussey and Hughes accounted for the bad fit. It has been known for over 20 years that a simple structure does not fit Big Five data (McCrae et al., 1996). Thus, Hussay and Hughes claim that the BFI lacks validity is based on an outdated and implausible measurement model.

Table 1 shows the factor loading pattern for the 44 BFI items on the Big Five factors and the two method factors. It also shows the contribution of the seven factors to the scale scores that are used to provide visitors with personality feedback and in many research articles that use scale scores as proxies for the latent constructs.

Item#NEOACEVBACQ
Neuroticism
depressed/blue40.33-0.150.20-0.480.06
relaxed9-0.720.230.18
tense140.51-0.250.20
worry190.60-0.080.07-0.210.17
emotionally stable24-0.610.270.18
moody290.43-0.330.18
calm34-0.58-0.04-0.14-0.120.250.20
nervous390.52-0.250.17
SUM0.79-0.08-0.01-0.05-0.020.420.05
Extraversion
talkative10.130.70-0.070.230.18
reserved6-0.580.09-0.210.18
full of energy110.34-0.110.580.20
generate enthusiasm160.070.440.110.500.20
quiet21-0.810.04-0.210.17
assertive26-0.090.400.14-0.240.180.240.19
shy and inhibited310.180.64-0.220.17
outgoing360.720.090.350.18
SUM-0.020.830.04-0.050.000.440.06
Openness 
original50.53-0.110.380.21
curious100.41-0.070.310.24
ingenious 150.570.090.21
active imagination200.130.53-0.170.270.21
inventive25-0.090.54-0.100.340.20
value art300.120.460.090.160.18
like routine work35-0.280.100.13-0.210.17
like reflecting40-0.080.580.270.21
few artistic interests41-0.26-0.090.15
sophisticated in art440.070.44-0.060.100.16
SUM0.04-0.030.76-0.04-0.050.360.19
Agreeableness
find faults w. others20.15-0.42-0.240.19
helpful / unselfish70.440.100.290.23
start quarrels 120.130.20-0.50-0.09-0.240.19
forgiving170.47-0.140.240.19
trusting 220.150.330.260.20
cold and aloof27-0.190.14-0.46-0.350.17
considerate and kind320.040.620.290.23
rude370.090.12-0.63-0.13-0.230.18
like to cooperate420.15-0.100.440.280.22
SUM-0.070.00-0.070.780.030.440.04
Conscientiousness
thorough job30.590.280.22
careless 8-0.17-0.51-0.230.18
reliable worker13-0.090.090.550.300.24
disorganized180.15-0.59-0.200.16
lazy23-0.52-0.450.17
persevere until finished280.560.260.20
efficient33-0.090.560.300.23
follow plans380.10-0.060.460.260.20
easily distracted430.190.09-0.52-0.220.17
SUM-0.050.00-0.050.040.820.420.03

Most of the secondary loadings are very small, although they are statistically highly significant in this large sample. Most items also have the highest loading on the primary factor. Exceptions are the items blue/depressed, full of engery, and generate enthusiasm that have higher loadings on the evaluative bias factor. Except for two openness items, all items also have loadings greater than .3 on the primary factor. Thus, the loadings are consistent with the intended factor structure.

The most important results are the loadings of the scale scores on the latent factors. As the factors are all independent, squaring these coefficients shows the amount of explained variance by each factor. By far the largest variance component is the intended construct with correlations ranging from .76 for openness to .83 for extraversion. Thus, the lion share of the reliable variance in scale scores reflects the intended construct. The next biggest contributor is evaluative bias with correlations ranging from .36 for openness to .44 for extraversion. Although this means only 15 to 20 percent of the total variance in scale scores reflects evaluative bias, this systematic variance can produce spurious correlations when scale scores are used to predict other self-report measures (e.g., life satisfaction, Schimmack, 2019c).

In sum, a careful psychometric evaluation of the BFI shows that the BFI has good structural validity. The key problem is the presence of evaluative bias in scale scores. Although this requires caution in the interpretation of results obtained with BFI scales, it doesn’t justify the conclusion that the BFI is invalid.

Measurement Invariance

Hussey and Hughes also examined measurement invariance across age-groups and the two largest gender groups. They claimed that the BFI lacks measurement invariance, but this claim was based on a cunning misrepresentation of the results (Schimmack, 2019a). The claim is based on the fact that the simple-structure model does not fit in any group. However, fit did not decrease when measurement invariance was imposed on different groups. Thus, all groups showed the same structure and fit did not increase when measurement invariance was imposed, but this fact was hidden in the supplementary results.

I replicated their analyses with the current dataset. First, I fitted the model for the whole sample separately to the male and female samples. Fit for the male sample was acceptable, CFI = .949, RMSEA = .029, SRMR = .033. So was fit for the female sample, CFI = .947, RMSEA = .030, SRMR = .037.

Table 2 shows the results side by side. There are no notable differences between the parameter estimates for males and females (m/f). This finding replicates results with other Big Five measures (Schimmack, 2019a).

Item#NEOACEVBACQ
Neuroticism
depressed/blue4.33/ .30-.18/-.11.19/ .20-.45/-.50.07/.05
relaxed9-.71/-.72.24/ .23.19/.18
tense14.52/ .49-.17/-.14.11/ .13-.27/-.32.20/ .20
worry19.58/ .57-.10/-.08.05/ .07-.22/-.22.17/ .17
emotionally stable24-.58/-.58.10/ .06.25/ .30.19/ .17
moody29.41/ .38-.26/-.25-.30/-.38.18/ .18
calm34-.55/-.59-.02/-.03.14/ .13.12/ .13-.27/-.24.21/ .19
nervous39.51/ .49-.21/.26-.10/-.10.08/ .08-.11/-.11-.27/-.25.18/ .17
SUM.78/ .77-.09/-.08-.01/-.01-.07/-.05-.02/-.02-.42-.46.05/ .04
Extraversion
talkative1.09/ .11.69/ .70-.10/-.08.24/ .24.19/ .18
reserved6-.55/-.60.08/.10.21/ .22.19/ .18
full of energy11.33/ .32-.09/-.04.56/ .59.21/ .20
generate enthusiasm16.04/ .03.44/ .43.12/ .13.48/ .50.20/ .20
quiet21-.79/-.82.03/ .04-.22/-.21.17/ .16
assertive26-.08/-.10.39/ .40.12/ .14-.23/-.25.18/ .17.26/ .24.20/ .18
shy and inhibited31.19/ .15.61/ .66.23/ .22.18/ .17
outgoing36.71/ .71.10/ .07.35/ .38.18/ .18
SUM-.02/-.02.82/ .82.04/ .05-.04-.06.00/ .00.45/ .44.07/ .06
Openness 
original5.50/ .54-.12/-.12.40/ .39.22/ .20
curious10.40/ .42-.05/-.08.32/ .30.25/ .23
ingenious 150.00/0.00.60/ .56.18/ .16.10/ .04.22/ .20
active imagination20.50/ .55-.07/-.06-.17/-.18.29/ .26.23/ .21
inventive25-.07/ -.08.51/ .55-.12/-.10.37/ .34.21/ .19
value art30.10/ .03.43/ .52.08/ .07.17/ .14.18/ .19
like routine work35-.27/-.27.10/ .10.09/ .15-.22/-.21.17/ .16
like reflecting40-.09/-.08.58/ .58.28/ .26.22/ .20
few artistic interests41-.25/-.29-.10/-.09.16/ .15
sophisticated in art44.03/ .00.42/ .49-.08/-.08.09/ .09.16/ .16
SUM.01/ -.01-.01/-.01.74/ .78-.05/-.05-.03/-.06.38/ .34.20/ .19
Agreeableness
find faults w. others2.14/ .17-.42/-.42-.24/-.24.19/ .19
helpful / unselfish7.45/ .43.09/.11.29/ .29.23/ .23
start quarrels 12.12/ .16.23/ .18-.49/-.49-.07/-.08-.24/-.24.19/ .19
forgiving17.49/ .46-.14/-.13.25/ .24.20/ .19
trusting 22-.14/-.16.38/ .32.27/ .25.21/ .19
cold and aloof27-.20/-.18.14/ .12.44/ .46-.34/-.37.18/ .17
considerate and kind32.02/.01.62/.61.28/ .30.22/ .23
rude37.10/ .12.12/ .12-.62/-.62-.13/-.08-.23/-.23.19/ .18
like to cooperate42.18/ .11-.09/-.10.43/ .45.28/ .29.23/ .22
SUM-.07/-.08.00/ .00-.07/-.07.78/ .77.03/ .03.43/ .44.04/ .04
Conscientiousness
thorough job3.58/ .59.29/ .28.23/ .22
careless 8-0.16-.49/-.51.24/ .23.19/ .18
reliable worker13-.10/-.09.09/ .10.55/ .55.30/ .31.24/ .24
disorganized18.13/ .16-.58/-.59-.21/-.20.17/ .15
lazy23-.52/-.51-.45/-.45.18/ .17
persevere until finished28.54/ .58.27/ .25.21/ .19
efficient33-.11/-.07.52/ .58.30/ .29.24/ .23
follow plans38.00/ .00-.06/-.07.45/ .44.27/ .26.21/ .20
easily distracted43.17/ .19.07/ .06-.53/-.53-.22/-.22.18/ .17
SUM-.05/-.05-.01/-.01-.05/-.06.04/ .04.81/ .82.43/ .41.03/ .03

I then fitted a multi-group model with metric invariance. Despite the high similarity between the individual models, model fit decreased, CFI = .925, RMSEA = .033, SRMR = .062. Although RMSEA and SRMR were still good, the decrease in fit might be considered evidence that the invariance assumption is violated. Table 2 shows that it is insufficient to examine changes in global fit indices. What matters is whether the decrease in fit has any substantial meaning. Given the results in Table 2, this is not the case.

The next model imposed scalar invariance. Before presenting the results, it is helpful to know what scalar invariance implies. Take extraversion as an example. Assume that there are no notable gender differences in extraversion. However, extraversion has multiple facets that are represented by items in the BFI. One facet is assertiveness and the BFI includes an assertiveness item. Scalar invariance implies that there cannot be gender differences in assertiveness if there are no gender differences in extraversion. It is obvious that this is an odd assumption because gender differences can occur at any level in the hierarchy of personality traits. Thus, evidence that scalar invariance is violated does not imply that we can not examine gender differences in personality. Rather, it would require further examination of the pattern of mean differences at the level of the factors and the item residuals.

However, imposing scalar invariance did not produce a decrease in fit, CFI = .921, RMSEA = .034, SRMR = .063. Inspection of the modification indices showed the highest modification index for item O6 “valuing art” with an implied mean difference of 0.058. This implies that there are no notable gender differences at the item-level. The pattern of mean differences at the factor level is consistent with previous studies, showing higher levels of neuroticism (d = .64) and agreeableness (d = .31), although the difference in agreeableness is relatively small compared to some other studies.

In sum, the results show that the BFI can be used to examine gender differences in personality and that the pattern of gender differences observed with the BFI is not a measurement artifact.

Age Differences

Hussey and Hughes used a median split to examine invariance across age-groups. The problem with a median split is that online samples tend to be very young. Figure 1 shows the age distribution for the Canadian sample. The median age is 22.

To create two age-groups, I split the sample into a group of under 30 and 30+ participants. The unequal sample size is not a problem because both groups are large given the large overall sample size (young N = 221,801, old N = 88,713). A published article examined age differences in the full sample, but the article did not use SEM to test measurement invariance (Soto, John, Gosling, & Potter, 2011). Given the cross-sectional nature of the data, it is not clear whether age differences are cohort differences or aging effects. Longitudinal studies suggest that age differences may reflect generational changes rather than longitudinal changes over time (Schimmack, 2019d). In any case, the main point of the present analyses is to examine measurement invariance across different age groups.

Fit for the model with metric invariance was similar to the fit for the gender model, CFI = .927, RMSEA = .033, SRMR = .062. Fit for the model with scalar invariance was only slightly weaker for CFI and better for RMSEA. More important, inspection of the modification indices showed the largest difference for O10 “sophisticated in art” with a standardized mean difference of .068. Thus, there were no notable differences between the two age groups at the item level.

The results at the factor level reproduced the finding with scale scores by Soto et al. (2011). The older group had a higher level of conscientiousness (d = .61) than the younger group. Differences for the other personalty dimensions were statistically small. There were no notable differences in response styles.

In sum, the results show that the BFI shows reasonable measurement invariance across age groups. Contrary to the claims by Hussey and Hughes, this finding is consistent with the results reported in Hussay and Hughes’s supplementary materials. These results suggest that BFI scale scores provide useful information about personality and that published articles that used scale scores produced meaningful results.

Conclusion

Hussey and Hughes accused personality researchers of validity hacking. That is, they do not report results of psychometric tests because these tests would show that personality measures are invalid. This is a strong claim that requires strong evidence. However, closer inspection of this claim shows that the authors used an outdated measurement model and misrepresented the results of their invariance analyses. Here I showed that the BFI has good structural validity and shows reasonable invariance across gender and age groups. Thus Hussay and Hughes’s claims are blatantly false.

So far, i have only examined the BFI, but I have little confidence in the authors’ conclusions about other measures like Rosenberg’s self-esteem scale. I am still waiting for the authors to share all of their data to examine all of their claims. At present, there is no evidence of v-hacking. Of course, this does not mean that self-ratings of personality are perfectly valid. As I showed, self-ratings of the Big Five are contaminated with evaluative bias. I presented a measurement model that can test for the presence of these biases and that can be used to control for rating biases. Future validation studies might benefit from using this measurement model as a basis for developing better measures and better measurement models. Substantive articles might also benefit from using a measurement model rather than scale scores, especially when the BFI is used as a predictor of other self-report measures to control for shared rating biases.