Most psychologists have at least a vague understanding of the scientific method. Somewhere they probably heard about Popper and the idea that empirical data can be used to test theories. As all theories are false, these tests should at some point lead to an empirical outcome that is inconsistent with a theory. This outcome is not a failure. It is an expected outcome of good science. It also does not mean that the theory was bad. Rather it was a temporary theory that is now modified or replaced by a better theory. And so, science makes progress….
However, psychologists do not use the scientific method popperly. Null-hypothesis significance testing adds some confusion here. After all, psychologists publish over 90% successful rejections of the nil-hypothesis. Doesn’t that show they are good Popperians? The answer is no because the nil-hypothesis is not predicted by a theory. The nil-hypothesis is only useful to reject it to claim that there is a predicted relationship between two variables. Thus, psychology journals are filled with over 90% reports of findings that confirm theoretical predictions. While this may look like a major success, it actually shows a major problems. Psychologists never publish results that disconfirm a theoretical prediction. As a result, there is never a need to develop better theories. Thus, a root evil that prevents psychology from being a real science is verificationism.
The need to provide evidence for, rather than against, a theory led to the use of questionable research practices. Questionable research practices are used to report results that confirm theoretical predictions. For example, researchers may simply not report results of studies that did not reject the nil-hypothesis. Other practices can help to produce significant results by inflating the risk of a false positive result. The use of QRPs explains why psychology journals have been publishing over 90% results that confirm theoretical predictions for 60 years (Sterling, 1959). Only recently, it has become more acceptable to report studies that failed to support a theoretical prediction and question the validity of a theory. However, these studies are still a small minority. Thus, psychological science suffers from confirmation bias.
Structural Equation Modelling
Multivariate, correlational studies are different from univariate experiments. In a univariate experiment, a result is either significant or not. Thus, only tempering with the evidence can produce confirmation bias. In multivariate statistics, data are analyzed with complex statistical tools that provide researchers with flexibility in their data analysis. Thus, it is not necessary to alter the data to produce confirmatory results. Sometimes it is sufficient to analyze the data in a way that confirm a theoretical prediction without showing that alternative models fit the data equally well or better.
It is also easier to combat confirmation bias in multivariate research by fitting alternative models to the same data. Model comparison also avoids the problem of significance testing, where non-significant results are considered inconclusive, while significant results are used to confirm and cement a theory. In SEM, statistical inferences work the other way around. A model with good fit (non-significant chi-square or acceptable fit) is a possible model that can explain the data, while a model with significant deviation from the data is rejected. The reason is that the significance test (or model fit) is used to test an actual theoretical model rather than the nil-hypothesis. This forces researchers to specify an actual set of predictions and subject them to an empirical test. Thus, SEM is ideally suited to test theories popperly.
Confirmation Bias in SEM Research
Although SEM is ideally suited to test competing theories against each other, psychology journals are not used to model comparisons and tend to publish SEM research in the same flawed confirmatory way as other research is conducted and reported. For example, an article in Psychological Science this year published an investigation of the structure of personality and the hypothesis that several personality traits are linked to a bio-marker (Wright et al., 2019).
Their preferred model assumes that the Big Five traits neuroticism, agreeableness, and conscientiousness are not independent, but systematically linked by a higher-order triat called alpha or stability (Digman, 1997; DeYoung, 2007). In their model, the stability factor is linked to a marker of the serotonin (5-HT) prolactin response. This model implies that all three traits are related to the biomarker as there are indirect paths from all three traits to the biomarker that are “mediated” by the stability factor (for technical reasons the path goes from stabilty to the biomarker, but theoretically, we would expect the relationship to go the other way from a neurological mechanism to behaviour).
Thanks to the new world of open science, the authors shared actual MPLUS outputs of their models on OSF ( https://osf.io/h5nbu/ ). All the outputs also included the covariance matrix among the predictor variables, which made it possible to fit alternative models to the data.
Another source of confirmation bias in psychology is that literature reviews fail to mention evidence that contradicts the theory that authors try to confirm. This is pervasive and by no means a specific criticism of the authors. Contrary to the claims in the article, the existence of a meta-trait of stability is actually controversial. Digman (1997) reported some SEM results that were false and could not be reproduced (cf. Anusic et al., 2009). Moreover, alpha could not be identified when the Big Five were modelled as latent factors (Anusic et al., 2009). This led me to propose that meta-traits may be an artifact of using impure Big Five scales as indicators of the Big Five. For example, if some agreeableness items have negative secondary loadings on neuroticism, the agreeableness scale is contaminated with valid variance in neuroticism. Thus, we would observe a negative correlation between neuroticism and agreeableness even across raters (e.g., self-ratings of neuroticism and informant ratings of agreeableness). Here I fitted a model with secondary loadings and independent Big Five factors to the data. I also examined the prediction that the biomarker is related to all three Big Five traits. The alternative model had acceptable fit, CFI = .976, RMSEA = .056.
The main finding of this model is that the biomarker shows only a significant relationship with conscientiousness, while the relationship with agreeableness trended in the right direction, but was not significant (p = .089) and the relationship for neuroticism was even weaker (p = .474). Aside from the question about significance, we also have to take effect sizes into account. Given the parameter estimates, the bimarker would produce very small correlations among the Big Five traits (e.g., r(A,C) = .19 * .10 = .019. Thus, even if these relationships were significant, they would not provide compelling evidence that a source of shared variance among the three traits has been identified.
The next model shows that the authors’s model ignored the stronger relationship between conscientiousness and the biomarker. When this relationship is added to the model, there is no significant relationship between the stability factor and the biomarker.
Thus, the main original finding of this study was that a serotonin related bio-marker was significantly related to conscientiousness, but not significantly related to neuroticism. This finding is inconsistent with theories that link neuroticism to serotonin, and evidence that serotonin reuptake inhibitors reduce neuroticism (at least in depressed patients). However, such results are difficult to publish because a single study with a non-significant results does not provide sufficient evidence to falsify a theory. However, fitting data to a theory only leads to confirmation bias.
The good news is that the authors were able to publish the results of an impressive study and that their data are openly available and can provide credible information for meta-analytic evaluations of structural models of personality, while the results of this study alone are inconclusive and compatible with many different theories of personality.
One way to take more advantage of these data would be to share the covariance matrix of items to model personality structure with a proper measurement model of the Big Five traits and to avoid the problem of contaminated scale scores, which is the best practice for the use of structural equation models. These models provide no evidence for Digman’s meta-traits (Schimmack, 2019a, Schimmack, 2019b).
In conclusion, the main point of this post is that (a) SEM can be used to test and falsify models, (b) SEM can be used to realize that data are consistent with multiple models and that better data are needed to find the better model, (c) studies of Big Five factors require a measurement model with Big Five factors and cannot rely on messy scale scores as indicators of the Big Five, and (d) personality psychologists need better training in the use of SEM.
Personality psychology is dominated by dimensional models of personality (Funder, 2019). There is a good reason for this. Most personality characteristics vary along a continuum like height rather than being categorical like eye color. Thus, a system of personality types requires some arbitrary decisions about a cutoff point. For example, a taxonomy of body types could do a median split on height and weight to assign people to the tall-heavy or the tall-light type.
The notion that there are only three personality types is puzzling. The dominant framework in personality psychology is the Big Five model that conceptualizes personality traits as five independent continuous dimensions. If we were to create personality types by splitting each dimension at the median, it would create 32 personality types, where individuals are either above or below the median on neuroticism, extraversion, openness, agreeableness, and conscientiousness. if these five dimensions were perfectly independent of each other, we would see that individuals are equally likely to be assigned to one of the 32 types. There is no obvious way to reduce these 32 types to just 3.
Figure 1. small caps = below median, capitals = above mean
So, how did Robins et al. (1996) come to the conclusion that there are only three personality types? The data were Q-sorts. A Q-sort is similar to personality ratings on a series of attributes. The main difference is that the sorting task imposes a constraint on the scores that can be given to an individual. As a result, all individuals have the same overall mean across items. That is, nobody could be above average on all attributes. These kind of data are known as ipsative data. An alternative way to obtain ipsative data would be to subtract the overall mean of ratings from individual ratings. Although the distinction between ipsative and non-ipsative data is technically important, it has no implications for the broader understanding of Robins et al.’s work. The study could also have used ratings.
Robins et al. then performed a factor analysis. However, this factor analysis is different from a typical factor analysis that relies on correlations among items. Rather, the data matrix is transposed and the factor analysis is run on participants. With N = 300, there are three hundred variables and factor analysis is used to reduce this set of variables to a smaller set of factors, while minimizing the loss of information.
Everybody knows that the number of factors in a factor analysis is arbitrary and that a smaller number of factors implies a loss of information.
“Empirical research on personality typologies has been hampered by the lack of clear criteria for determining the number of types in a given sample. Thus, the costs and benefits of having a large number of types must be weighed against those of having relatively few types” (Robins et al., 1996).
The authors do not report Eigenvalues or other indicators of how much variance their three factor solution explained.
The three types are described in terms of the most and least descriptive items. Type 1 can be identified by high conscientiousness (“He is determined in what he does”), high extraversion (“He is energetic and full of life”), low neuroticism (reversed: “When he is under stress, he gives up and backs off”), high agreeableness (“He is open and straightforward”), and high openness (“He has a way with words”). In short, Type 1 is everybody’s dream child; a little Superman in the making.
Type 2 is characterized by high neuroticism (“He gets nervous in uncertain situations”), introversion (reversed: “He tries to be the center of attention”), low openness (reversed: he has a way with words,” but high agreeableness (“He is considerate and thoughtful of other people” ). Conscientiousness doesn’t define this type one way or the other.
Type 3 is characterized by low neuroticism (rerversed: “He is calm and relaxed; easy going”), high extraversion (“He tries to be the center of attention”), low conscientiousness (reversed: He plans things ahead; he thinks before he does something) and low agreeableness (He is stubborn”).
The main problem with this approach is that these personality profiles are not types. Take Profile 1 for example. While some participants’ profile correlated highly positively with Profile 1, some participants profile correlates highly negatively with Profile 1. What personality type are they? We might say that they are the opposite of Superman, but that would imply that we need another personality type for the Anti-Supermans. The problem doesn’t end here. As there are three profiles, each individual is identified by their correlations with all three profiles. Thus, we end up with eight different types depending on whether the correlation with the three profiles are positive or negative.
In short, profiles are not types. Thus, the claim that there are only three personality types is fundamentally flawed because the authors confused profiles with types. Even the claim that there are only 8 types would rest on the arbitrary choice of extracting only three factors. Four factors would have produced 16 types and five factors would have produced 32 types, just as the Big Five model predicted.
Asendorph et al. (2001) also found three profiles that they considered to be similar to those found by Robins et al. (1996). Moreover, they examined profiles in a sample of adult with a Big Five questionnaire (i.e., the NEO-FFI). Importantly, Asendorpf et al. (2001) use better terminology and refer to profiles as prototypes rather than types.
The notion of a prototype is that there are no clear defining features that determine class membership. For example, on average mammals are heavier than birds. So we can distinguish birds and mammals by their prototypical weight (how close their weight is to the average weight of a bird or mammal) rather than on the basis of a defining feature (lays eggs, has a uterus). Figure 2 shows the prototypical Big Five profile for the three groups of participants, when participants were assigned to three groups.
The problem is once more that the grouping into three groups is arbitrary. Clearly there are individuals with high scores on agreeableness and on openness, but this variation in personality was not used to create the three groups. Based on this figure, groupings are based on low N and high C, high N and low E, and low C. It is not clear what we should do with individuals who do not match any of these prototypical profiles. What type are individuals who are high in N and high in C?
In sum, a closer inspection of studies of personality types suggests that these studies failed to address the question. Searching for prototypical item-profiles is not the same thing as searching for personality types. In addition, the question may not be a good question. If personality attributes vary mostly quantitatively and if the number of personality traits is large, the number of personality types is infinite. Every individual is unique.
Are Some Personality Types More Common Than Others?
As noted above, the number of personality types that are theoretically possible is determined by the number of attributes and the levels of each attribute. If we describe personality with the Big Five and limit the levels to being above or below the median, we have 32 theoretical patterns. However, this does not mean that we actually observe all patterns. Maybe some types never occur or are at least rare. The absence of some personality types could provide some interesting insights into the structure of personality. For example, high conscientiousness might suppress neuroticism and we would see very few individuals who are high in C and low in N (Digman, 1997). However, when C is low, we could see equal numbers of individuals with high N and low N because conscientiousness only inhibits high N, while low conscientiousness does not lead to high N. It is impossible to examine such patterns with bivariate correlations (Feger, 1988).
A simple way to examine this question is to count the frequencies of personality traits (Anusic & Schimmack, unpublished manuscript that was killed in peer-review). Here, I present the results of this analysis based on Sam Gosling’s large internet survey with millions of visitors who completed the BFI (John, Naumann, & Soto, 2008).
Figure 3 simply shows the relative frequencies of the 32 personality types.
Figure 4 shows the results only for US residents. The results are very similar to those for the total sample.
The most notable finding is that the types nEOAC and Neoac are more frequent than all other types. These types are evaluatively positive or negative. However, it is important to realize that these types are not real personality types. Other research has demonstrated that the evaluative dimension in self-ratings of personality is mostly a rating or a perception bias (Anusic et al., 2009). Thus, individuals with a nEOAC profile do not have a better personality. Whether they simply rate themselves (other-deception) or actually see themselves (self-deception) as better than they are is currently unknown.
The next two types with above average frequency are nEoAC and NeOac. A simple explanation for this pattern is that openness is not highly evaluative and so some people will not inflate their openness scores, while they are still responding in a desirable way on the other four traits.
The third complementary pair are the neoAC and the NEOac types. This pattern can also be explained with rating biases because some people do not consider openness and extraversion desirable; so they will only show bias on neuroticism, agreeableness and conscientiousness. These people were called “Saints” by Paulhus and John (1998).
In short, one plausible explanation of the results is that all 32 personality types that can be created by combining high and low scores on the Big Five exists. Some types are more frequent than others, but at least some of this variation is explained by rating biases rather than by actual differences in personality.
The main contribution of this new look at personality types is to clarify some confusion about the notion of personality types. Previous researchers used the term types for prototypical personality profiles. This is unfortunate because it led to the misleading impression that there are only three personality types. You are either resilient, over-controlled, or under-controlled. In fact, even three profiles create more than three types. Moreover, the profiles are based on exploratory factor analyses of personality ratings and it is not clear why there are only three profiles. Big Five theory would predict five profiles where each profile is defined by items belonging to one of the Big Five factors. It is not clear why profile analyses yielded only three factors. One explanation could be that the item set did not capture some personality dimensions. For example, Robins et al.’s (1996) Q-sort did not seem to include many openness items.
Based on Big Five theory, one would expect 32 personality types that are about equally frequent. An analysis of a large data set showed that all 32 types exists, which is consistent with the idea that the Big Five are fairly independent dimensions that can occur in any combination. However, some types were more frequent than others. The most frequent combination was either desirable (nEOAC) or undesirable (Neoac). This finding is consistent with previous evidence that personality ratings are influenced by a general evaluative bias (Anusic et al., 2009). Additional types with higher frequencies can be attributed to variations in desirability. Openness and extraversion are not as desirable, on average, as low neuroticism and high agreeableness and conscientiousness. Thus, the patterns nEoAC and neoAC may also reflect desirability rather than actual personality structure. Multi-method studies or low evaluative items would be needed to examine this question.
Personality psychologists are frustrated that they have discovered the Big Five factors and created a scientific model of personality, but in applied settings the Myers-Briggs Type Indicator (MBTI) dominates personality assessment (Funder, 2019).
One possible reason is that the MBTI provides simple information about personality by classifying individuals into 16 types. These 16 types are defined by being high or low on four dimensions.
There is no reason, why personality psychologists could not provide simplified feedback about personality using a median split on the Big Five and assigning individuals to the 32 types that can be created by the Big Five factors. For example, I would be the NEOac type. Instead of using small caps and capitals, one could also use letters for both poles of the dimension, neurotic (N) vs. stable (S), extraverted (E) vs. introverted (I), variable (V) versus regular (R), agreeable (A) vs. dominant (D), and conscientious (C) vs. laid back (L). This would make me an NEVDL type. My son would be an SIRAC.
I see no reason why individuals would prefer Myer-Briggs types over Big Five types, given that the Big Five types are based on a well-established scientific theory. I believe the main problem in giving individuals feedback with Big Five scores is that many people do not think in terms of dimensions.
The main problem might be that we are assigning individuals to types even when their scores are close to the median and their classification is arbitrary. For example, I am not very high on E or low on C and it is not clear whether I am really an NEVDL or an NIVDC type. One possibility would be to use only scores that are one standard deviation above or below then mean or median. This would make me an N-VD- type.
To conclude, research on personality types has not made much progress for a good reason. The number of personality types depends on the number of attributes that are being considered and it is no longer an empirical question which types exists. With fairly independent dimensions all types exist and the number of types increases exponentially with the number of attributes. The Big Five are widely considered the optimal trade-off between accuracy and complexity. Thus, they provide an appealing basis for the creation of personality type and a viable alternative to the Myer-Briggs Type Indicator.
If you want to know what type you are, you can take the BFI online ( https://www.outofservice.com/bigfive/ ). It provides feedback about your personality in terms of percentiles. To create your personality type, you only have to convert the percentiles into letters.
Negative Emotionality P < 50 = S P > 50 = N Extraversion P < 50 = I P > 50 = E Open-Mindedness P < 50 = R P > 50 = V Agreeableness P < 50 = D P > 50 = A Conscientiousness P < 50 = L P > 50 = C
However, keep in mind that your ratings and those of the comparison group are influenced by desirability.
If you are a NIRDL, you may have a bias to rate yourself as less desirable than you actually are
If you are an SEVAC, you may have a tendency to overrate your desirability.
Naive and more sophisticated conceptions of science assume that empirical data are used to test theories and that theories are abandoned when data do not support them. Psychological journals give the impression that psychologists are doing exactly that. Journals are filled with statistical hypothesis tests. However, hypothesis tests are not theory tests because only results that confirm a theoretical prediction (by falsifying the null-hypothesis) get published; p < .05 (Sterling, 1959). As a result, psychology journals are filled with theories that have never been properly tested. Chances are that some of these theories are false.
To move psychology towards being a science, it is time to subject theories to empirical tests and to replace theories that do not fit the data with theories that do. I have argued elsewhere already that higher-order models of personality are a bad idea with little empirical support (Schimmack, 2019a). Colin DeYoung responded to this criticism of his work (DeYoung, 2019). In this blog post, I present a new approach to the testing of structural theories of personality with confirmatory factor analysis (CFA). The advantage of CFA is that it is a flexible statistical method that can formalize a variety of competing theories. Another advantage of CFA is that it is possible to capture and remove measurement error. Finally, CFA provides fit indices that make it possible to compare models and to select models that fit the data better. Although CFA celebrates its 50th birthday this year, psychologists still have to appreciate its potential for testing personality theories (Joreskog, 1969).
What are Higher-Order Factors?
The notion of a factor has a clear meaning in psychology. A factor is a common cause that explains, at least in a statistical sense, why several variables are correlated with each other. That is, a factor represents the shared variance among several variables that is assumed to be caused by a common cause rather than by direct causation among the variables.
In traditional factor analysis, factors explain correlations among observed variables such as personality ratings. The notion of higher-order factors implies that first-order factors that explain correlations among items are correlated (i.e., not independent) and that these correlations among factors are explained by another set of factors, which are called higher-order factors.
In empirical tests of higher-order factors it has been overlooked that the Big Five factors are already higher-order factors in a hierarchy of personality traits that explain correlations among more specific personality traits like sociability, curiosity, anxiety, or impulsiveness. Instead ALL tests of higher-order models have relied on items or scales that measure the Big Five. This makes it very difficult to study the higher-order structure of personality because results will vary depending on the selection of items that are used to create Big Five scales.
A much better way to test higher-order models is to fit a hierarchical CFA model to data that represent multiple basic personality traits. A straightforward prediction of a higher-order model is that all or at least most facets that belong to a common higher order factor should be correlated with each other.
For example, Digman (1997) and DeYoung (2006) suggested that extraversion and openness are positively correlated because they are influenced by a common factor, called beta or plasticity. As extraversion is conceived as a common cause of sociability, assertiveness, and cheerfulness and openness is conceived as a common cause of being imaginative, artistic, and reflective, the model makes the straightforward prediction that sociability, assertiveness, and cheerfulness are positively correlated with being imaginative, artistic, and reflective.
One problem in testing structural models of personality is that personality ratings are imperfect indicators of personality. Some of the measurement error in personality ratings is random, but other sources of variance are systematic. Two sources have been reliably identified, namely acquiescence and evaluative bias (Anusic et al., 2009; Biderman et al., 2019). DeYoung (2006) also found evidence for evaluative bias in a multi-rater study. Thus, there is agreement between DeYoung and me that some of the correlations among personality ratings do not reflect the structure of personality, but rather systematic measurement error. It is necessary to control for these method factors when studying the structure of personality traits and to examine the correlation among Big Five traits because method factors distort these correlations in mono-method studies. In two previous posts, I found no evidence of higher-order factors when I fitted hierarchical models to the 30 facets of the NEO-PI-R and another instrument with 24 facets (Schimmack, 2019b, 2019c). Here I take another look at this question by examining more closely the pattern of correlations among personality facets before and after controlling for method variance.
From 2010 to 2012 I posted a personality questionnaire with 303 items on the web. Visitors were provided with feedback about their personality on the Big Five dimensions and specific personality facets. Earlier I presented a hierarchical model of these data with three items per facet (Schimmack, 2019). Subsequently, I examined the loadings of the remaining items on these facets. Here I presents results for 179 items with notable loadings on one of the facets (Item.Loadings.303.xlsx; when you open file in excel, selected items are highlighted in green). The use of more items per facets makes the measurement model of facets more stable and ensures more stable facet correlations that are more likely to replicate across studies with different item sets. The covariance matrix for all 303 items is posted on OSF (web303.N808.cov.dat) so that these results presented below can be reproduced.
I first constructed a measurement model. The aim was not to test a structural model, but to find a measurement model that can be used to test structural models of personality. Using CFA for exploration seems to contradict its purpose, but reading the original article by Joreskog shows that this approach is entirely consistent with the way he envisoned CFA to be used. It is unclear to me who invented the idea that CFA should follow an EFA analysis. This makes little sense because EFA may not fit some data if there are hierarchical relationships or correlated residuals. So, CFA modelling has to start with a simple theoretical model that then may need to be modified to fit some data, which leads to a new model to be tested with new data.
To develop a measurement model with reasonable fit to the data, I started with a simple model where items had fixed primary loadings and no secondary loadings, while all factors were allowed to be correlated with each other. This is a simple structure model. It is well known that this model does not fit real data. I then modified the model based on modification indices that suggested (a) secondary loadings, (b) relaxed the constraint of a primary loading, or (c) suggested correlated item residuals. This way a model with reasonable fit to the data was obtained, CFI = .775, RMSEA = .040, SRMR = .042 (M0.Measurement.Model.inp on OSF). Although CFI was below the standard criterion of .95, model fit was considered acceptable because the only source of misfit to the model would be additional small secondary loadings (< .2) or correlated residuals that have little influence on the magnitude of the facet correlations.
Below I present the correlations among the facets. The full correlation matrix is broken down into sections that are theoretically meaningful. The first five tables show the correlations among facets that share the same Big Five factor.
There are three main neuroticism facets: anxiety, anger/hostility, and depression. A fourth facet was originally intended to be an openness to emotions facet, but it correlated more highly with neuroticism (Schimmack, 2009c). All four facets show positive correlations with each other and most of these correlations are substantial, except the strong emotions and depression facets.
Results for extraversion show that all five facets are positively correlated with each other. All correlations are greater than .3, but none of the correlations are so high as to suggest that they are not distinct facets.
Openness facets are also positively correlated, but some correlations are below .2, and one correlation is only .16, namely the correlation between openness to activities and art.
The correlations among agreeableness facets are more variable and the correlation between modesty and trust is slightly negative, r = -.05. The core facet appears to be caring which shows high correlations with morality and forgiveness.
All correlations among conscientiousness facets are above .2. Self-discipline shows high correlations with competence beliefs and achievement striving.
Overall, these results are consistent with the Big Five model.
The next tables examine correlations among sets of facets belonging to two different Big Five traits. According to Digman and DeYoung’s alpha-beta model, extraversion and openness should be correlated. Consistent with this prediction, the average correlation is r = .16. For ease of interpretation all correlations above .10 are highlighted in grey, showing that most correlations are consistent with predictions. However, the value facet of openness shows lower correlations with extraversion facets. Also, the excitement seeking facet of extraversion is more strongly related to openness facets than other facets.
The alpha-beta model also predicts negative correlations among neuroticism and agreeableness facets. Once more, the average correlation is consistent with this prediction, r = -.15. However, there is also variation in correlations. In particular, the anger facet is more strongly negatively correlated with agreeableness facets than other neuroticism facets.
As predicted by the alpha-beta model, neuroticism facets are negatively correlated with conscientiousness facets, average r = -.21. However, there is variation in these correlations. Anxiety is less strongly negatively correlated with conscientiousness facets than other neuroticism facets. Maybe, anxiety sometimes has similar effects as conscientiousness by motivating people to inhibit approach motivated, impulsive behaviors. In this context, it is noteworthy that I found no strong loading of impulsivity on neuroticism (Schimmack, 2019c).
The last pair are agreeableness and conscientiousness facets, which are predicted to be positively correlated. The average correlation is consistent with this prediction, r = .15.
However, there is notable variation in these correlations. A2-Morality is more strongly positively correlated with agreeableness than other agreeableness facets, in particular trust and modesty which show weak correlations with conscientiousness.
The alpha-beta model also makes predictions about other pairs of Big Five facets. As alpha and beta are conceptualized as independent factors, these correlations should be weaker than those in the previous tables and close to zero. However, this is not the case.
First, the average correlation between neuroticism and extraversion is negative and nearly as strong as the correlation between neuroticism and agreeableness, r = -.14. In particular, depression is strongly negatively related to extraversion facets.
The average correlation between extraversion and agreeableness facets is only r = .07. However, there is notable variability. Caring is more strongly related to extraversion than other agreeableness facets, especially with warmth and cheerfulness. Cheerfulness also tends to be more strongly correlated with agreeableness facets than other extraversion facets.
Extraversion and conscientiousness facets are also positively correlated, r = .15. Variation is caused by stronger correlations for the competence and self-discipline facets of conscientiousness and the activity facet of extraversion.
Openness facets are also positively correlated with agreeableness facets, r = .10. There is a trend for the O1-Imagination facet of openness to be more consistently correlated with agreeableness facets than other openness facets.
Finally, openness facets are also positively correlated with conscientiousness facets, r = .09. Most of this average correlation can be attributed to stronger positive correlations of the O4-Ideas facet with conscientiousness facets.
In sum, the Big Five facets from different Big Five factors are not independent. Not surprisingly, a model with five independent Big Five factors reduced model fit from CFI = .775, RMSEA = .040 to CFI = .729, RMSEA = .043. I then fitted a model that allowed for the Big Five factors to be correlated without imposing any structure on these correlations. This model improved fit over the model with independent dimensions, CFI = .734, RMSEA = .043.
The pattern of correlations is consistent with a general evaluative factor rather than a model with independent alpha and beta factors.
Not surprisingly, fitting the alpha-beta model to the data reduced model fit, CFI = .730, RMSEA = .043. In comparison, a mode with a single evaluative bias factor had better fit, CFI = .732, RMSEA = .043.
In conclusion, the results confirm previous studies that a general evaluative dimension produces correlations among the Big Five factors. DeYoung’s (2006) multi-method study and several other multi-method studies demonstrated that this dimension is mostly rater bias because it shows no convergent validity across raters.
Facet Correlations with Method Factors
To remove the evaluative bias from correlations among facets, it is necessary to model evaluative bias at the item level. That is, all items load on an evaluative bias factor. This way the shared variance among indicators of a facet reflects only facet variance and no evaluative variance. I also included an acquiescence factor, although acquiescence has a negligible influence on facet correlations.
It is not possible to let all facets to be correlated freely when method factors are included in a model because this model is not identified. To allow for a maximum of theoretically important facet correlations, I freed parameters for facets that belong to the same Big Five factor, facets that are predicted to be correlated by the alpha-beta model, and additional correlations that were suggested by modification indices. Loadings on the evaluative bias factor were constraint to 1 unless modification indices suggested that items had stronger or weaker loadings on the evaluative bias factor. This model fitted the data as well as the original measurement model, CFI = .778 vs. 775, RMSEA = .040 vs. .040. Moreover, modification indices did not suggest any further correlations that could e freed to improve model fit.
The main effect of controlling for evaluative bias is that all facet correlations were reduced. However, it is particularly noteworthy to examine the correlations that are predicted by the alpha-beta model.
The average correlation for extraversion and openness facets is r = .07. This average is partially driven by stronger correlations of the excitement seeking facet with openness facets than other excitement facets. There are only four other correlations above .10, and 9 of the 25 correlations are negative. Thus, there is little support for a notable general factor that produces positive correlations between extraversion and openness facets.
The average correlation for neuroticism and agreeableness is r = -.06. However, the pattern shows mostly strong negative correlations for the anger facet of neuroticism with agreeableness facets. In addition, there is a strong positive correlation between anxiety and morality, r = .20. This finding suggests that anxiety may also serve the function to inhibit immoral behavior.
The average correlation for neuroticism and conscientiousness is r = -.07. While there are strong negative correlations, r = -.30 for anger and deliberation, there is also a strong positive correlation, r = .22 for self-discipline and anxiety. Thus, the relationship between neuroticism and conscientiousness facets is complex.
The average correlation for agreeableness and conscientiousness facets is r = .01. Moreover, none of the correlations exceeded r = .10. This finding suggests that agreeableness and conscientiousness are independent Big Five factors, which contradicts the prediction by the alpha-beta model.
The finding also raises questions about the small but negative correlations of neuroticism with agreeableness (r = -.06) and conscientiousness (r = -.07). If these correlations were reflecting the influence of a common factor alpha that influences all three traits, one would expect a positive relationship between agreeableness and conscientiousness. Thus, these relationships may have another origin, or there is some additional negative relationship between agreeableness and conscientiousness that cancels out a potential influence of alpha.
Removing method variance also did not eliminate relationships between facets that are not predicted to be correlated by the alpha-beta model. The average correlation between neuroticism and extraversion facets is r = -.05, which is small, but not notably smaller than the predicted correlations (r = .01 to .07).
Moreover, some of these correlations are substantial. For example, excitement seeking is negatively related to anxiety (r = -.24) and warmth is negatively related to depression (r = -.22). Any structural model of personality structure needs to take these findings into account.
A Closer Examination of Extraversion and Openness
There are many ways to model the correlations among extraversion and openness facets. Here I demonstrate that the correlation between extraversion and openness depends on the modelling of secondary loadings and correlated residuals. The first model allowed for extraversion and openness to be correlated. It also allowed for all openness facets to load on extraversion and for all extraversion facets to load on openness. Residual correlations were fixed to zero. This model is essentially an EFA model.
Model fit was as good as for the baseline model, CFI = .779 vs. 778, RMSEA = .039 vs. .040. The pattern of secondary loadings showed two notable positive loadings. Excitement seeking loaded on openness and open to activities loaded on E. In this model the correlation between extraversion and neuroticism was .08, SE = .17. Thus, the positive correlation in the model without secondary loadings was caused by not modelling the pattern of secondary loadings.
However, it is also possible to fit a model that produces a strong correlation between E and O. To do so, the loadings excitement seeking and openness to actions can be set to zero. This pushes other secondary loadings to be negative, which is compensated by a positive correlation between extraversion and openness. This model has the same overall fit as the previous model, both CFI = .779, both RMSEA = .039, but the correlation between extraversion and openness jumps to r = .70. The free secondary loadings are all negative.
The main point of this analysis is to show the importance of facet correlations for structural theories of personality traits. In all previous studies, including my own, the higher-order structure was examined using Big Five scales. However, the correlation between an Extraversion Scale and an Openness Scale provides insufficient information about the relationship between the Extraversion Factor and the Openness Factor because scales always confound information about secondary loadings, residual correlations, and factor correlations.
The goal for future research is to find ways to test competing structural models. For example, the second model suggests that any interventions that increase extraversion would decrease openness to ideas, while the first model does not make this prediction.
Personality psychologists have developed and tested structural models of personality traits for nearly a century. In the 1980s, the Big Five factors were identified. The Big Five have been relatively robust in future replication attempts and emerged also in this investigation. However, there has been little progress in developing and testing hierarchical models of personality that explain what the Big Five are and how they are related to more specific personality traits called facets. There have also been attempts to find even broader personality dimensions. An influential article by Digman (1997) proposed that a factor called alpha produces correlations among neuroticism, agreeableness, and conscientiousness, while a beta factor links extraversion and openness. As demonstrated before, Digman’s results could not be reproduced and ignored evaluative bias in personality ratings (Anusic et al., 2009). Here, I show that empirical tests of higher-order models need to use a hierarchical CFA model because secondary loadings create spurious correlations among Big Five scales that distort the pattern of correlations among the Big Five factors. Based on the present results, there is no evidence for Digman’s alpha and beta factors.
Psychological science has a replication crisis. Many textbook findings, especially in social psychology, failed to replicate over the past years. The reason for these surprising replication failures is that psychologists have used questionable research practices to produce results that confirm theories rather than using statistical methods to test theories and to let theories fail if the evidence does not support them. However, falsification of theories is a sign of scientific progress and it is time to subject psychological theories to real tests.
In personality psychology, the biggest theory is Big Five theory. In short, Big Five theory postulates that variation in personality across individuals can be described with five broad personality dimensions: neuroticism, extraversion, openness, agreeableness, and conscientiousness. Textbooks also claim that the Big Five are fairly universal and can be demonstrated in different countries and with different languages.
One limitation of these studies is that they often use vague criteria to claim that the Big Five have been found or that personality traits are universal. Psychometricians have developed rigorous statistical methods to test these claims, but these methods are rarely used by personality psychologists to test Big Five theory. Some personality psychologists even claimed that these methods should not be used to test Big Five theory because they fail to support the Big Five (McCrae et al., 1996). I have argued that it is time to test Big Five theory with rigorous methods and to let the data decide whether the Big Five exist or not. Other personality psychologists have also started to subject Big Five theory to more rigorous tests (Soto & John, 2017).
Big Five Theory
Big Five theory does not postulate that there are only five dimensions of personality. Rather, it starts with the observation that many important personality traits have been identified and labeled in everyday life. There are hundreds of words that describe individual differences such as helpful, organized, friendly, anxious, curious, or thoughtful. Big Five theory postulates that these personality traits are systematically related to each other. That is, organized individuals are also more likely to be thoughtful and anxious individuals are also more likely to be irritable and sad. Big Five theory explains these correlations among personality traits with five independent factors; that is some broad personality trait that causes covariations among more specific traits that are often called facets. For example, a general disposition to experience more intense unpleasant feelings may produce correlations among the disposition to experience more anxiety, anger, and sadness.
The main prediction of Big Five theory is that the pattern of correlations among personality traits should be similar across different samples and measurement instruments, and that factor analysis produces the same pattern of correlations.
Testing Big Five Theory
A proper psychometric test of Big Five theory requires a measurement model of the facets. If facets are not measured properly, it is impossible to examine the pattern of correlations among the facets. Thus, a proper test of Big Five theory requires fitting a hierarchical model to personality ratings. The first level of the hierarchy specifies a fairly large number of facets that are supposed to be related to one or more Big Five dimensions. The second level of the hierarchy specifies the Big Five factors so that it is possible to examine the relationships (factor loadings) of the facets on the Big Five factors. At present, very few studies have tried to test Big Five theory with hierarchical models. Soto and John (2017) tested Big Five theory with three facets for each Big Five domain and found reasonably good fit for their hierarchical models (see Figure for their model).
Although Soto and Johns’s (2017) article is a step in the right direction, it does not provide a thorough test of Big Five theory for several reasons. First, the model allows facets to correlate freely rather than testing the prediction that these correlations are produced by a Big Five factor. Second, models with only three indicators have zero degrees of freedom and produce perfect fit to the data. Thus, more than three facets are needed to test the prediction that Big Five factors account for the pattern of correlations among facets. Third, the model was fitted separately for each Big Five domain. Thus, there is no information about the relationship of facets to other Big Five factors. For example, the anger facet of neuroticism tends to show negative loadings on agreeableness. Whether such relationships are consistent across datasets is also important to examine.
In a recent blog post, I presented I tried to fit a Big Five model to the 30 facets of the NEO-PI-R (Schimmack, 2019a). Table 1 shows the factor loadings of the 30 facets on the Big Five factors.
The results were broadly consistent with the theoretical Big Five structure. However, some facets did not show the predicted pattern. For example, excitement seeking did not load on extraversion. Other facets had rather weak loadings. For example, the loading of Impulsivity on neuroticism implies that less than 10% of the variance in impulsivity is explained by neuroticism. These results do not falsify Big Five theory by any means. However, they provide the basis for further theory development and refinement of personality theory. However, before any revisions of Big Five theory are made, it is important to examine the replicability of the factor structure in hierarchical measurement models.
From 2010 to 2012 I posted a personality questionnaire with 303 items on the web. Visitors were provided with feedback about their personality on the Big Five dimensions and the 30 facets. In addition to the items that were used to provide feedback, the questionnaire contained several additional items that might help to improve measurement of facets. Furthermore, the questionnaire included some items about life-satisfaction because I have an interest in the relationship between personality traits and life-satisfaction. The questionnaire also included four questions about desirable attributes, namely attractiveness, intelligence, fitness, and broad knowledge. These questions have been used before to demonstrate evaluative biases in personality ratings (Anusic et al., 2009).
Simple correlations and factor analysis were used to identify three indicators of the 30 NEO-facets. Good indicators should show moderate and similar correlations to each other. During this stage it became apparent that the item-set failed to capture the self-consciousness facet of neuroticism and the dutifulness facet of conscientiousness. Thus, these two facets could not be included in the model.
I first fitted a model that allowed the 28 facets to be correlated freely. This model evaluates the measurement of the 28 facets and provides information about the pattern of correlations among facets that Big Five theory aims to explain. This model showed very high correlations between the neuoroticism-facet anger and the agreeableness-facet compliance (r = .8), the warmth and gregariousness facets of extraversion (r = 1) , and the anxiety and vulnerability facets of neuroticism (r = .8). These correlations raise concerns about the discriminant validity of these facets and create problems in fitting a hierarchical model to the data. Thus, I dropped the vulnerability, gregariousness, and compliance items and facets from the model. Thus, the final model had 83 items, 26 facets, and factors for life-satisfaction and for desirable attributes.
The model fit for the model with correlated factors was accepted considering the RMSEA fit index, but below standard criteria for the CFI. However, examination of the modification indices showed that only freeing weak secondary loadings would further improve model fit. Thus, the model was considered adequate and model fit of this model served as a comparison standard for the hierarchical model, CFI = .867, RMSEA = .037.
The hierarchical model specified the Big Five as independent factors. In addition, it included an acquiescence factor and an evaluative bias factor. The evaluative bias factor was allowed to correlate with the desirable attribute factor (cf. Anusic et al., 2009). Life-satisfaction was regressed on the facets, but only significant facets were retained in the final model. Model fit of this model was comparable to model fit of the baseline model, CFI = .861, RMSEA = .040.
The first set of tables shows the item loadings on the primary factor, the evaluative bias factor, the acquiescence factor, and secondary loadings. Facet names are based on the NEO-PI-R.
Primary loadings of neuroticism items on their facets are generally high and secondary loadings are small (Table 1). Thus, the four neuroticism facets were clearly identified.
The measurement model for the extraversion facets also showed high primary loadings (Table 2).
For openness, one item of the values facet had a low loading, but the other two items clearly identified the facet (Table 3).
All primary loadings for the four agreeableness facets were above .4 and the facets were clearly identified (Table 4).
Finally, the conscientiousness facets were also clearly identified (Table 5).
In conclusion, the measurement model for the 24 facets showed that all facets were measured well and that the item content matches the facet labels. Thus, the basic requirement for examining the structural relationships among facets have been met.
The most important result are the loadings of the 24 facets on the Big Five factors. These results are shown in Table 6. To ease interpretation, primary loadings greater than .4 and secondary loadings greater than .2 are printed in bold. Results that are consistent with the previous study of the NEO-PI-R are highlighted with a gray background.
15 of the 24 primary loadings are above .4 and replicate results with the NEO-PI-R. This finding provides support for the Big Five model.
There are also four consistent loadings below .4, namely for the impulsivity facet of neuroticism, the openness facets feelings and actions, and the agreeableness facet of trust. This finding suggests that the model needs to be revised. Either there are measurement problems or openness is not related to emotions and actions. Problems with agreeableness have led to the creation of an alternative model with six factors (Ashton & Lee) that distinguishes two aspects of agreeableness.
There are eight facets with inconsistent loading patterns. Activity loaded with .34 on extraversion. More problematic was the high secondary loading on conscientiousness, suggesting that activity level is not clearly related to a single Big Five trait. The results for excitement seeking were particularly different. Excitement seeking did not load at all on extraversion in the NEO-PI-R model, but the loading in this sample was high.
The loading for values on openness was .45 for the NEO-PI-R and .29 in this sample. Thus, the results are more consistent than the use of the arbitrary .4 value suggests. Values are related to openness, but the relationship is weak and openness explains no more than 20% of the variance in holding liberal versus conservative values.
For agreeableness, compliance was represented by the anger facet of neuroticism, which did not load above .4 on agreeableness although the loading came close (-.32). Thus, the results do not require a revision of the model based on the present data. The biggest problem was that the modesty facet did not load highly on the agreeableness facet (.21). More research is needed to examine the relationship between modesty and agreeableness.
Results for conscientiousness were fairly consistent with theory and the loading of the deliberation facet on conscientiousness was just shy of the arbitrary .4 criterion (.37). Thus, no revision to the model is needed at this point.
The consistent results provide some robust markers of the Big Five, that is facets with high primary and low secondary loadings. Neuroticism is most clearly related to high anxiety and high vulnerability (sensitivity to stress), which could not be distinguished in the present dataset. However, other negative emotions like anger and depression are also consistently related to neuroticism. This suggests that neuroticism is a general disposition to experience more unpleasant emotions. Based on structural models of affective experiences, this dimension is related to Negative Activation and tense arousal.
Extraversion is consistently related to warmth and gregariousness, which could not be separated in the present dataset as well as assertiveness and positive emotions. One interpretation of this finding is that extraversion reflects positive activation or energetic arousal, which is a distinct affective system that has been linked to approach motivation. Structural analyses of affect also show that tense arousal and energetic arousal are two separate dimensions. The inconsistent results for action may reflect the fact that activity levels can also be influenced by more effortful forms of engagement that are related to conscientiousness. Extraverted activity might be more intrinsically motivated by positive emotions, whereas conscientious activity is more driven by extrinsic motivation.
The core facets of openness are fantasy/imagination, artistic interests, and intellectual engagement. The common feature of these facets is that attention is directed. The mind is traveling rather than being focused on the immediate situation. A better way to capture openness to feelings might be to distinguish feelings that arise from actual events versus emotions that are elicited by imagined events or art.
The core features of agreeableness are straighforwardness, altruism, and tender-mindedness, which could not be separated from altruism in this dataset. The common feature is a concern about the well-being of others versus a focus on the self. Trust may not load highly on agreeableness because the focus of trust is individuals own well-being. Trusting others may be beneficial for the self if others are trustworthy or harmful if they are not. However, it is not necessary to trust somebody in need to help them unless helping is risky.
The core feature of conscientiousness is self-discipline. Order and achievement striving (which is not the best label for this facet) also imply that conscientious people can stay focused on things that need to be done and are not easily distracted. Conscientious people are more likely to follow a set of rules.
There are only a few consistent secondary loadings. This suggest that many secondary loadings may be artifacts of item-selection or participant selection. The few consistent secondary loadings make theoretical sense. Anger loads negatively on agreeableness and this facet is probably best considered a blend of high neuroticism and low agreeableness.
Assertiveness shows a positive loading on conscientiousness. The reason is that conscientious people may have a strong motivation to follow their internal norms (a moral compass) and that they are wiling to assert these norms if necessary.
The openness to feelings facet has a strong “secondary” loading on neuroticism, suggesting that this facet does not measure what it was intended to measure. It just seems to be another measure of strong feelings without specifying the valence.
Finally, deliberation is negatively related to extraversion. This finding is consistent with the speculation that extraversion is related to approach motivation. Extraverts are more likely to give in to temptations, whereas conscientious individuals are more likely to resist temptations. In this context, the results for impulsivity are also noteworthy. At least in this dataset, implusive behaviors that were mostly related to eating were more strongly related to extraversion than to neuroticism. If neuroticism is strongly related to anxiety and avoidance behavior, it is not clear why neurotic individuals would be more impulsive. However, if extraversion reflects approach motivation, it makes more sense that extraverts are more likely to indulge.
Of course, these interpretations of the results are mere speculations. In fact, it is not even clear that the Big Five account for the correlations among facets. Some researchers have proposed that facets may directly influence each other. While this is possible, such models will have to explain the consistent pattern of correlations among facets. The present results are at least consistent with the idea that five broader dispositions explain some of these correlations.
Neuroticism is a broad disposition to experience a range of unpleasant emotions, not just anxiety.
Extraversion is a broad disposition to be actively and positively engaged.
Openness is a broad disposition to focus on internal processes (thoughts and feelings) rather than on external events and doing things.
Agreeableness is a broad disposition to be concerned with the well-being of others as opposed to being only concerned with oneself.
Conscientiousness is a broad disposition to be guided by norms and rules as opposed to being guided by momentary events or impulses.
While these broad dispositions account for some of the variance in specific personality traits, facets, facets add additional information that is not captured by the Big Five. The last column in Table 1 shows the residual (i.e., unexplained) variances in the facets. While the amount of unepxlained variance varies considerable across facets, the average is high (57%) and this estimate is consistent with the estimate in the NEO-PI-R (58%). Thus, the Big Five provide only a very broad and unclear impression of individuals’ personality.
Residual Correlations among Facets
Traditional factor analysis does not allow for correlations among the residual variances of indicators. CFA models that do not include these correlations have poor fit. I was only able to produce reasonable fit to the NEO-PI-R data by including correlated residuals in the model. This was also the case for this dataset. Table 7 shows the residual correlations. For the ease of interpretation, residual correlations that were consistent across models are highlighted in green.
As can be seen, most of the residual correlations were inconsistent across datasets. This shows that the structure of facet correlations is not stable across these two datasets. This complicates the search for a structural model of personality because it will be necessary to uncover moderators of these inconsistencies.
While dozens of articles have correlated Big Five measures with life-satisfaction scales, few studies have examined how facets are related to life-satisfaction (Schimmack, Oishi, Furr & Funder, 2004). The most consistent finding has been that the depression facet explains unique variance above and beyond neuroticism. This was also the case in this dataset. Depression was by far the strongest predictor of life-satisfaction (-.56). The only additional significant predictor that was added to the model was the competence facet (.22). Contrary to previous studies, the cheerfulness facet did not predict unique variance in life-satisfaction judgments. More research at the facet level and with other specific personality traits below the Big Five is needed. However, the present results confirm that facets contain unique information that contributes to the prediction of outcome variables. This is not surprising given the large amount of unexplained variance in facets.
Table 8 shows the loadings of the four desirable attributes on the DA factor. Loadings are similar to those in Anusic et al. (2009). As objective correlations among these characteristics are close to zero, the shared variance can be interpreted as an evaluative bias. Items did not load on the evaluative bias factor to examine the relationship with the evaluative bias factor at the factor level.
Unexpectedly, the DA factor correlated only modestly with the evaluative bias factor for the facet-items, r = .38. This correlation is notably weaker than the correlation reported by Anusic et al. (2009), r = .76. This raises some questions about the nature of the evaluative bias variance. It is unfortunate that we do not know more about this persistent variance in personality ratings, one-hundred years after Thorndike (1920) first reported it.
In an attempt to replicate a hierarchical model of personality, I was only partially successful to cross-validate the original model. On the positive side, 15 out of 24 distinct facets had loadings on the predicted Big Five factor. This supports the Big Five as a broad model of personality. However, I also replicated some results that question the relationship between some facets and Big Five factors. For example, impulsiveness does not seem to be a facet of neuroticism and openness to actions may not be a facet of openness. Moreover, there were many inconsistent results that suggest the structure of personality is not as robust as one might expect. More research is needed to identify moderators of facet correlations. For example, in student samples anxiety may be a positive predictor of achievement striving, whereas it may be unrelated or negatively related to achievement striving in middle or old age.
Past research with exploratory methods has also ignored that the Big Five do not explain all of the correlations among facets. However, these correlations also seem to vary across samples.
All of these results do not make for neat and sexy JPSP article that proposes a simple model of personality structure, but the present results do suggest that one should not trust such simple models because personality structure is much more complex and unstable than these models suggest.
I can only repeat that personality research will only make progress by using proper methods that can actually test and falsify structural models. Falsification of simple and false models is the impetus for theory development. Thus, for the sake of progress, I declare that the Big Five model that has been underpinning personality research for 40 needs to be revised. Most importantly, progress can only be made if personality measures do not already assume that the Big Five are valid. Measures like the BFI-2 (Soto & John) or the NEO-PI-R have the advantage that they capture a wide range of personality differences at the facet level. Surely there are more than 15, 24, or 30 facets. So more research about personality at the facet level is required. Personality research might also benefit from a systematic model that integrates the diverse range of personality measures that have been developed for specific research questions from attachment styles to machiavellianism. All of this research would benefit from the use of structural equation modeling because SEM can fit a diverse range of models and provides information of model fit.
Galileo had the clever idea to turn a microscope into a telescope and to point it towards the night sky. His first discovery was that Jupiter had four massive moons that are now known as the Galilean moons (Space.com).
Now imagine what would have happened if Galileo had an a priori theory that Jupiter has five moons and after looking through the telescope, Galileo decided that the telescope was faulty because he could see only four moons. Surely, there must be five moons and if the telescope doesn’t show them, it is a problem of the telescope. Astronomers made progress because they created credible methods and let empirical data drive their theories. Eventually even better telescopes discovered many more, smaller moons orbiting around Jupiter. This is scientific progress.
Alas, psychologists don’t follow the footsteps of natural sciences. They mainly use the scientific method to provide evidence that confirms their theories and dismiss or hide evidence that disconfirms their theories. They also show little appreciation for methodological improvements and often use methods that are outdated. As a result, psychology has made little progress in developing theories that rest of solid empirical foundations.
An example of this ill-fated approach to science is McCrae et al.’s (1996) attempt to confirm their five factor model with structural equation modeling (SEM). When they failed to find a fitting model, they decided that SEM is not an appropriate method to study personality traits because SEM didn’t confirm their theory. One might think that other personality psychologists realized this mistake. However, other personality psychologists were also motivated to find evidence for the Big Five. Personality psychologists had just recovered from an attack by social psychologists that personality traits does not even exist, and they were all too happy to rally around the Big Five as a unifying foundation for personality research. Early warnings were ignored (Block, 1995). As a result, the Big Five have become the dominant model of personality without subjecting the theory to rigorous tests and even dismissing evidence that theoretical models do not fit the data (McCrae et al., 1996). It is time to correct this and to subject Big Five theory to a proper empirical test by means of a method that can falsify bad models.
I have demonstrated that it is possible to recover five personality factors, and two method factors, from Big Five questionnaires (Schimmack, 2019a, 2019b, 2019c). These analyses were limited by the fact that the questionnaires were designed to measure the Big Five factors. A real test of Big Five theory requires to demonstrate that the Big Five factors explain the covariations among a large set of a personality traits. This is what McCrae et al. (1996) tried and failed to do. Here I replicate their attempt to fit a structural equation model to the 30 personality traits (facets) in Costa and McCrae’s NEO-PI-R.
In a previous analysis I was able to fit an SEM model to the 30 facet-scales of the NEO-PI-R (Schimmack, 2019d). The results only partially supported the Big Five model. However, these results are inconclusive because facet-scales are only imperfect indicators of the 30 personality traits that the facets are intended to measure. A more appropriate way to test Big Five theory is to fit a hierarchical model to the data. The first level of the hierarchy uses items as indicators of 30 facet factors. The second level in the hierarchy tries to explain the correlations among the 30 facets with the Big Five. Only structural equation modeling is able to test hierarchical measurement models. Thus, the present analyses provide the first rigorous test of the five-factor model that underlies the use of the NEO-PI-R for personality assessment.
The complete results and the MPLUS syntax can be found on OSF (https://osf.io/23k8v/). The NEO-PI-R data are from Lew Goldberg’s Eugene-Springfield community sample. Theyu are publicly available at the Harvard Dataverse.
The NEO-PI-R has 240 items. There are two reasons why I analyzed only a subset of items. First, 240 variables produce 28,680 covariances, which is too much for a latent variable model, especially with a modest sample size of 800 participants. Second, a reflective measurement model requires that all items measure the same construct. However, it is often not possible to fit a reflective measurement model to the eight items of a NEO-facet. Thus, I selected three core-items that captured the content of a facet and that were moderately positively correlated with each other after reversing reverse-scored items. Thus, the results are based on 3 * 30 = 90 items. It has to be noted that the item-selection process was data-driven and needs to be cross-validated in a different dataset. I also provide information about the psychometric properties of the excluded items in an Appendix.
The first model did not impose a structural model on the correlations among the thirty facets. In this model, all facets were allowed to correlate freely with each other. A model with only primary factor loadings had poor fit to the data. This is not surprising because it is virtually impossible to create pure items that reflect only one trait. Thus, I added secondary loadings to the model until acceptable model fit was achieved and modification indices suggested no further secondary loadings greater than .10. This model had acceptable fit, considering the use of single-items as indicators, CFI = .924, RMSEA = .025, .035. Further improvement of fit could only be achieved by adding secondary loadings below .10, which have no practical significance. Model fit of this baseline model was used to evaluate the fit of a model with the Big Five factors as second-order factors.
To build the actual model, I started with a model with five content factors and two method factors. Item loadings on the evaluative bias factor were constrained to 1. Item loadings for on the acquiescence factor were constrained to 1 or -1 depending on the scoring of the item. This model had poor fit. I then added secondary loadings. Finally, I allowed for some correlations among residual variances of facet factors. Finally, I freed some loadings on the evaluative bias factor to allow for variation in desirability across items. This way, I was able to obtain a model with acceptable model fit, CFI = .926, RMSEA = .024, SRMR = .045. This model should not be interpreted as the best or final model of personality structure. Given the exploratory nature of the model, it merely serves as a baseline model for future studies of personality structure with SEM. That being said, it is also important to take effect sizes into account. Parameters with substantial loadings are likely to replicate well, especially in replication studies with similar populations.
Table 1 shows the item-loadings for the six neuroticism facets. All primary loadings exceed .4, indicating that the three indicators of a facet measure a common construct. Loadings on the evaluative bias factors were surprisingly small and smaller than in other studies (Anusic et al., 2009; Schimmack, 2009a). It is not clear whether this is a property of the items or unique to this dataset. Consistent with other studies, the influence of acquiescence bias was weak (Rorer, 1965). Secondary loadings also tended to be small and showed no consistent pattern. These results show that the model identified the intended neuroticism facet-factors.
Table 2 shows the results for the six extraversion facets. All primary factor loadings exceed .40 and most are more substantial. Loadings on the evaluative bias factor tend to be below .20 for most items. Only a few items have secondary loadings greater than .2. Overall, this shows that the six extraversion facets are clearly identified in the measurement model.
Table 3 shows the results for Openness. Primary loadings are all above .4 and the six openness factors are clearly identified.
Table 4 shows the results for the agreeableness facets. In general, the results also show that the six factors represent the agreeableness facets. The exception is the Altruism facet, where only two items show a substantial loadings. Other items also had low loadings on this factor (see Appendix). This raises some concerns about the validity of this factor. However, the high-loading items suggest that the factor represents variation in selfishness versus selflessness.
Table 5 shows the results for the conscientiousness facets. With one exception, all items have primary loadings greater than .4. The problematic item is the item “produce and common sense” (#5) of the competence facet. However, none of the remaining five items were suitable (Appendix).
In conclusion, for most of the 30 facets it was possible to build a measurement model with three indicators. To achieve fit, the model included 76 out of 2,610 (3%) secondary loadings. Many of these secondary loadings were between .1 and .2, indicating that they have no substantial influence on the correlations of factors with each other.
Facet Loadings on Big Five Factors
Table 6 shows the loadings of the 30 facets on the Big Five factors. Broadly speaking the results provide support for the Big Five factors. 24 of the 30 facets (80%) have a loading greater than .4 on the predicted Big Five factor, and 22 of the 30 facets (73%) have the highest loading on the predicted Big Five factor. Many of the secondary loadings are small (< .3). Moreover, secondary loadings are not inconsistent with Big Five theory as facet factors can be related to more than one Big Five factor. For example, assertiveness has been related to extraversion and (low) agreeableness. However, some findings are inconsistent with McCrae et al.’s (1996) Five factor model. Some facets do not have the highest loading on the intended factor. Anger-hostility is more strongly related to low agreeableness than to neuroticism (-.50 vs. .42). Assertiveness is also more strongly related to low agreeableness than to extraversion (-.50 vs. .43). Activity is nearly equally related to extraversion and low agreeableness (-.43). Fantasy is more strongly related to low conscientiousness than to openness (-.58 vs. .40). Openness to feelings is more strongly related to neuroticism (.38) and extraversion (.54) than to openness (.23). Finally, trust is more strongly related to extraversion (.34) than to agreeableness (.28). Another problem is that some of the primary loadings are weak. The biggest problem is that excitement seeking is independent of extraversion (-.01). However, even the loadings for impulsivity (.30), vulnerability (.35), openness to feelings (.23), openness to actions (.31), and trust (.28) are low and imply that most of the variance in this facet-factors is not explained by the primary Big Five factor.
The present results have important implications for theories of the Big Five, which differ in the interpretation of the Big Five factors. For example, there is some debate about the nature of extraversion. To make progress in this research area it is necessary to have a clear and replicable pattern of factor loadings. Given the present results, extraversion seems to be strongly related to experiences of positive emotions (cheerfulness), while the relationship with goal-driven or reward-driven behavior (action, assertiveness, excitement seeking) is weaker. This would suggest that extraversion is tight to individual differences in positive affect or energetic arousal (Watson et al., 1988). As factor loadings can be biased by measurement error, much more research with proper measurement models is needed to advance personality theory. The main contribution of this work is to show that it is possible to use SEM for this purpose.
The last column in Table 6 shows the amount of residual (unexplained) variance in the 30 facets. The average residual variance is 58%. This finding shows that the Big Five are an abstract level of describing personality, but many important differences between individuals are not captured by the Big Five. For example, measurement of the Big Five captures very little of the personality differences in Excitement Seeking or Impulsivity. Personality psychologists should therefore reconsider how they measure personality with few items. Rather than measuring only five dimensions with high reliability, it may be more important to cover a broad range of personality traits at the expense of reliability. This approach is especially recommended for studies with large samples where reliability is less of an issue.
Residual Facet Correlations
Traditional factor analysis can produce misleading results because the model does not allow for correlated residuals. When such residual correlations are present, they will distort the pattern of factor loadings; that is, two facets with a residual correlation will show higher factor loadings. The factor loadings in Table 6 do not have this problem because the model allowed for residual correlations. However, allowing for residual correlations can also be a problem because freeing different parameters can also affect the factor loadings. It is therefore crucial to examine the nature of residual correlations and to explore the robustness of factor loadings across different models. The present results are based on a model that appeared to be the best model in my explorations. These results should not be treated as a final answer to a difficult problem. Rather, they should encourage further exploration with the same and other datasets.
Table 7 shows the residual correlation. First appear the correlations among facets assigned to the same Big Five factor. These correlations have the strongest influence on the factor loading pattern. For example, there is a strong correlation between the warmth and gregariousness facets. Removing this correlation would increase the loadings of these two facets on the extraversion factor. In the present model, this would also produce lower fit, but in other models this might not be the case. Thus, it is unclear how central these two facets are to extraversion. The same is also true for anxiety and self-consciousness. However, here removing the residual correlation would further increase the loading of anxiety, which is already the highest loading facet. This justifies the use of anxiety as the most commonly used indicator of neuroticism.
Table 7. Residual Factor Correlations
It is also interesting to explore the substantive implications of these residual correlations. For example, warmth and gregariousness are both negatively related to self-consciousness. This suggests another factor that influences behavior in social situations (shyness/social anxiety). Thus, social anxiety would be not just high neuroticism and low extraversion, but a distinct trait that cannot be reduced to the Big Five.
Other relationships are make sense. Modesty is negatively related to competence beliefs; excitement seeking is negatively related to compliance, and positive emotions is positively related to openness to feelings (on top of the relationship between extraversion and openness to feelings).
Future research needs to replicate these relationships, but this is only possible with latent variable models. In comparison, network models rely on item levels and confound measurement error with substantial correlations, whereas exploratory factor analysis does not allow for correlated residuals (Schimmack & Grere, 2010).
Personality psychology has a proud tradition of psychometric research. The invention and application of exploratory factor analysis led to the discovery of the Big Five. However, since the 1990s, research on the structure of personality has been stagnating. Several attempts to use SEM (confirmatory factor analysis) in the 1990s failed and led to the impression that SEM is not a suitable method for personality psychologists. Even worse, some researchers even concluded that the Big Five do not exist and that factor analysis of personality items is fundamentally flawed (Borsboom, 2006). As a result, personality psychologists receive no systematic training in the most suitable statistical tool for the analysis of personality and for the testing of measurement models. At present, personality psychologists are like astronomers who have telescopes, but don’t point them to the stars. Imagine what discoveries can be made by those who dare to point SEM at personality data. I hope this post encourages young researchers to try. They have the advantage of unbelievable computational power, free software (lavaan), and open data. As they say, better late than never.
Running the model with additional items is time consuming even on my powerful computer. I will add these results when they are ready.