Category Archives: Uncategorized

The Structure of Affective Dispositions

Extraversion and Neuroticism are some of the oldest constructs in personality psychology. They were the key dimensions in Eysenck’s theory of personality that was prominent in the 1970s. Although Eysenck’s theory of Neuroticism and Extraversion failed to be supported, extraversion and neuroticism remained prominent dimensions in the Big Five model of personality that emerged in the 1980s.

In an influential article, Costa and McCrae (1980) reconceptualized Extraversion and Neuroticism as two broad affective disposition that influences positive and negative affective experiences. They found that extraversion predicted positive affect and neuroticism predicted negative affect on Bradburn’s affect measure.

A key assumption of this model is that the disposition to experience negative affects and the disposition to experience positive affects are largely independent traits. This model underlies the development of the popular Positive Affect and Negative Affect Schedule (PANAS) that is widely used to measure affective experiences over longer time periods (Watson, Tellegen, & Clark, 1988).

The independence model of PA and NA has created a heated controversy in the emotion literature (see JPSP special issue 1999). Most of the debate focussed on the structure of momentary affective experiences. However, some articles also questioned the independence of positive and negative affective dispositions. Specifically, Diener, Smith, & Fujita (1995) used a mutli-method approach to measure a variety of positive and negative affective traits. They did find separate factors for PA and NA, but these factors were strongly negatively correlated (see Zou, Schimmack, & Gere, 2013, for a conceptual replication). Findings like these suggest that the independence model is too simplistic.

There are several ways to reconcile the negative relationship between positive and negative affects with the independence model. First, it is possible that the relationship between PA and NA depends on the selection of specific affects. Whereas happiness and sadness/depression may be negatively correlated, excitement and anxiety may be independent. If the correlation varies for specific affects, it is necessary to use proper statistical methods like CFA to examine the relationship between PA and NA without the influence of variance due to specific emotions. An alternative approach would be to directly measure the valence of emotions. Few studies have used this approach to remove emotion-specific variance from studies of PA and NA.

Another possibility is that PA and NA are not as strongly aligned with Extraversion and Neuroticism as Costa and McCrae’s (1980) model suggest. In fact, Costa and McCrae also developed a model in which positive affect was merely one of several traits called facets that are related to extraversion. According to the facet model, extraversion is a broader trait that encompasses affective and non-affective dispositions. For example, extraversion is also related to behaviours in social situations (sociability, assertiveness) and situations with uncertainty (risk taking). One implication of this model is that Extraversion and Neuroticism could be independent, while PA and NA can be negatively correlated.

The relationship between Extraversion and Neuroticism has been examined in hundreds of studies that measured the Big Five. Simple correlations between Extraversion and Neuroticism scales typically show small to moderate negative correlations. This finding contradicts the assumption that E and N are independent, but this finding has often been ignored. For example, structural models that allow for correlations between the Big Five maintain that E and N are independent (DeYoung, 2015).

One explanation for the negative correlation between E and N are response styles. Extraversion items tend to be desirable, whereas Neuroticism items tend to be undesirable. Thus, socially desirable responding can produce spurious correlations between E and N measures. In support of this hypothesis, the correlation weakens and sometimes disappears in multi-rater studies (Anusic et al., 2009; Biesanz & West, 2004; DeYoung, 2006). However, the correlation between E and N also depends on the item content. Scales that focus on sociability and anxiety tend to find weaker correlations than scales that measure E and N with a broader range of facets like the NEO-PI. Once more, this means that scale content moderates the results and that proper analyses of the relationship between the higher-order factors E and N requires a hierarchical CFA model to remove facet-specific variance from the correlation.

The aim of this blog post is to examine the structure of Extraversion and Neuroticism facets with a hierarchical CFA. A CFA model can reveal aspects of the data that a traditional EFA cannot reveal. Most importantly, it can reveal relationships between facets that are independent of the higher-order factors E and N. These residual correlations are important aspects of the relationship between traits that have been neglected in theoretical models based on EFA because EFA does not allow for these relationships.


Over the past decade, Condon and Revelle have assembled an impressive data set from over 50,000 participants who provided self-ratings of their personality for subsets of over 600 personality items that cover a broad range of personality traits at the facet level. Modern statistical methods make it possible to analyze these data with random missing data to examine the structure of all 600 personality items. The authors generously made their data openly available. I used the datasets that represent data collected between 2013 and 2014 and 2014 to 2015 ( I did not use all of the data to allow cross-validation of the results with a new sample.

Even modern computers would take too long to analyze the structure of over 600 items. For the present purpose, I focussed on items that have been shown to be valid indicators of extraversion and neuroticism facets (Schimmack, 2020a, 2020b). The actual items and their primary factor loadings are shown below in Tables 1-3.


Preliminary analyses showed problems with model identification because some Neuroticism and Extraversion scales were strongly negatively related. Specifically, E-Boldness was strongly negatively related to N-Self-Consciousness, E-Happiness was strongly negatively correlated with N-Depression, and E-Excitement Seeking was strongly negatively related to N-Fear. These findings already show that E and N are not independent domains that fit a simple structure; that is, E facets are not related to N-facets. To accommodate these preliminary findings, I created three bipolar facet factors. I then fitted a measurement model for 4 N-facets, 6 E-facets, and the three bipolar facets. The measurement model allowed for secondary loadings and correlated item residuals based on modification indices. All primary factors were allowed to correlate freely. In addition, the model included a method factor for acquiescence bias with fixed loadings depending on the scoring of items. As this model was data-driven, the results are exploratory and require cross-validation in a new sample.

Model fit of the final model met standard fit indices for overall model fit (CFI > .95, RMSEA < .06), CFI = .951, RMSEA = .006. However, standard fit indices have to be interpreted with caution for models with many missing values (Zhang & Savaley, 2019). More important, modification indices suggested no major further improvements to the measurement model by allowing additional secondary loadings. It is also known that minor misspecifications of the measurement model have relatively little influence on the theoretically important correlations among the primary factors. Thus, results are likely to be robust across different specifications of the measurement model. Table 1 shows the items and their primary factor loadings for the Neuroticism facets.

Table 2 shows the items and their primary loadings for the Extraversion facets.

Table 3 shows the items for the three bipolar Extraversion-Neuroticism factors.

Table 4 shows the correlations among the 13 Extraversion and Neuroticism facets.

All correlations among the N-facets are positive and above r = .3. All of the correlations among the E-facets are positive and only two are below .30. Two of the bipolar facets, namely Happiness-Depression and Boldness-Self-Consciousness are negatively correlated with neuroticism facets (all r < -.3). Most of the correlations with the extraversion facets are positive and above .3, but some are smaller and one is practically zero (r = -.03). Surprisingly, the Excitement Seeking versus Fear facet has very few notable relationships with neuroticism and extraversion facets. This suggests that this dimension is not central to either domains.

The correlations between extraversion facets and neuroticism facets tend to be mostly negative, but most of them are fairly small. This suggests that Extraversion and Neuroticism are largely independent and relatively weakly correlated factors. The only notable exceptions were negative correlations of anxiety with novelty seeking and liveliness.

The aim of any structural theory of personality is to explain this complex pattern of relationships in Table 4. Although a model with two factors is viable, a model with a simple structure that assigns each facet to only one factor does not fit these data. To account for the actual structure it is necessary to allow for some facets to be related to extraversion and neuroticism and to allow for additional relationship between some facets. Allowing for these additional relationships produced a model that fit the data nearly as well as the measurement model: CFI = .938 vs. .951, RMSEA = .007 vs. .006. The results are depicted in Figure 1.

Figure 1 does not show the results for the excitment-seeking vs. fear factor which was only weakly related to E and N, but strongly related to Novelty-Seeking and Boldness. To accommodate this factor, the model included direct paths from Anxiety (-.55), Anger (.33), Happiness-Depression (-.58), Novelty Seeking (.45), and liveliness (.24). The strong positive relationship with Happiness-Depression is particularly noteworthy. It could mean that depression or related negative affects like boredom motivate people to seek excitement. However, these results are preliminary and require further investigation.

The key finding is that Extraversion and Neuroticism emerge as slightly negatively correlated factors. The negative correlation in this study could be partially due to evaluative biases in self-ratings. Thus, the results are consistent with the conceptualization of Extraversion and Neuroticism as largely independent higher-order factors of personality. However, this does not mean that affective dispositions are largely independent.

The happiness factor lacked discriminant validity from the depression factor, showing a strong negative relationship between these two affective traits. Moreover, the happiness-depression factor was related to anger and anxiety because it was related to neuroticism. Thus, high levels of neuroticism not only increase NA, they also lower happiness.

The results also explain the independence of PA and NA in the PANAS scales. The PANAS scales were not developed to measure basic affects like happiness, sadness, fear and anxiety. Instead, they were created to measure affect with two independent traits. While the NA dimension closely corresponds to neuroticism, the PA dimension corresponds more closely to positive activation or positive energy than to happiness. The PANAS-PA construct of Positive Activation is more closely aligned with the liveliness factor. As shown in Figure 1, liveliness loads on Extraversion and is fairly independent of negative affects. It is only related to anxiety and anger through the small correlation between E and N. For depression, it has an additional relationship because liveliness and depression load on Extraversion. It is therefore important to make a clear conceptual distinction between Positive Affect (Happiness) and Positive Activation (Liveliness).

Figure 1 also shows a number of correlated residuals that were needed to achieve model fit. These correlated residuals are by no means arbitrary. Activity is related to being lively, presumably because energy is required to be active. Amusement is related to sociability, presumably because humor helps to establish and maintain positive relationships. Boldness is related to assertiveness because both traits require a dominant role in social relationships and groups. Anxiety is negatively related to boldness because bold behaviours are risky. Moody is related to anger and depression, presumably because mood swings can be produced by either anger or depressive episodes. Although these relationships are meaningful they are often ignored because EFA fails to show these relationships and fails to show that models without these relationships do not fit the data. The present results show that theoretical progress requires developing models that explain these relationships. In this regard, the present results merely show that these relationships exist without explaining them.

It is also noteworthy that the correlated residuals do not show a simple pattern that is postulated by some theories. Most notably, DeYoung (2015) proposed that facets are linked to Big Five factors by means of aspects. Aspects are supposed to represent shared variance among some facets that is not explained by the Big Five traits. A simple way to examine the presence of aspects is to find groups of facets that share correlated residuals. Contrary to the aspect model, most facets have either one or no correlated residuals. Sociability has two correlated residuals. It is related to amusement and dependence, but amusement and dependence are not related. Thus, there is no aspect linking these three facets. Moody is related to anger and depression, but anger and depression are unrelated. Again, this implies that there is no aspect linking these three facets. Boldness is linked to three facets. It is positively related to assertiveness, but negatively related to anxiety and dependence, but anxiety and dependence are unrelated, and assertiveness is only related to boldness. This means that there is no evidence for DeYoung’s Extraversion and Neuroticism aspects. These results are by no means inconsistent with previous findings. The aspect model was developed with EFA and EFA may separate a facet from other facets and create the illusion of aspects. This is the first test of the aspects model with CFA and it shows no support for the model.


In conclusion, the present study examined the structure of affective traits using hierarchical CFA. The results broadly confirm the Big Five model of personality. Neuroticism represents shared variance among several negative affective traits like anxiety, anger, and depression, and self-conscious emotions. Extraversion is a broader trait that includes affective and non-affective traits. The core affective traits are happiness and positive energy (liveliness). Extraversion and Neuroticism are only slightly negatively correlated and this correlation could be inflated by rating biases. Thus, it is reasonable to conceptualize them as largely independent higher-order traits. However, at the facet level, the structure is more complex and does not fit a simple structure. Some E-facets and N-facets are highly negatively correlated and could be conceptualized as opposite ends of a single trait, namely Happiness-Depression, Boldness-Self-Consciousness, and Excitement-Seeking vs. Fear. It is therefore questionable to classify Happiness, Boldness, and Excitement-Seeking under Extraversion and Depression, Self-Consciousness and Fear under Neuroticism. These traits are related to Extraversion and Neuroticism. The present results do not provide explanations for the structure of affective trait. The main contribution is to provide a description of the structure that actually represents the structure in the data. In contrast, many prominent models are overly simplistic, focus on subsets of facets, and do not fit the data. The present results integrate these models into one general model that can stimulate future research.

No Evidence for A Higher-Oder Plasticity Factor

Social psychologists have discovered confirmation bias as a powerful human trait. Rather than looking carefully for all relevant information, humans tend to prefer to look for information that confirms their existing beliefs. One major advantage of the scientific method is that it provides objective information that forces individuals to update false beliefs. As a result, evidence that disconfirms or falsifies existing beliefs is a powerful driver of science.

Unfortunately, psychologists do not use the scientific method properly. Instead of collecting data that can falsify existing theories to advance psychological theories, they have a confirmation bias in the application of the scientific evidence. Rather than updating theories in the light of disconfirming evidence, they tend to ignore disconfirming evidence. As a result, psychological science has made little progress over the past decades.

DeYoung (2015) proposed a theory of personality with a higher-order factor of plasticity. He emphasized that his cybernetic Big Five theory affords a wealth of testable hypotheses. Here I subject one of these testable hypothesis to an empirical test by fitting a hierarchical model to 15 primary traits that have been linked to extraversion and neuroticism. The plasticity hypothesis predicts that most of the correlations among these 15 traits are positive and that a model with higher-order factors of extraversion and openness shows a positive correlation between the two traits. To foreshadow the results. The correlation between E and O was close to zero and slightly negative.


In the 1980s, personality psychologists agreed on a hierarchical structure of personality with many correlated primary traits that are encoded in everyday language (e.g., helpful, talkative, curious, efficient, etc.) and five largely independent higher-order factors. These higher-order factors are known as the Big Five. The five factors reflect sensitivity to negative information (Neuroticism), positive energy / approach motivation (Extraversion), a focus on ideas (Openness), pro-social behaviours (Agreeabeleness), and a rational, deliberate way of decision making (Conscientiousness).

In 1997, Digman proposed that the Big Five factors are not independent and systematically related to each other by two even higher-order factors. He suggested that one factor produces negative correlations of neuroticism with agreeableness and conscientiousness and a positive correlation between agreeableness and conscientiousness. A positive correlation between extraversion and openness was attributed to a second factor called beta.

DeYoung (2006) changed the names of the two factors. Alpha was called stability and beta was called plasticity. To test the theory of stability and plasticity, DeYoung analyzed multi-rater data that avoid the problem of spurious correlations among Big Five scales in self-ratings (Anusic et al., 2009). The key finding was that the shared variance among raters in E-scores and the shared variance among raters in O-scores on the Big Five Inventory corelated r = .24 (.59 x .40). In contrast, the corresponding correlation for the Mini-Marker scales was considerably lower, r = .09 (.69 x .13).

This article has been cited 389 times in Web of Science. In contrast, another article by Biesanz and West that used the same methodology and found no support for the plasticity factor has been cited only 100 times (Biesanz and West, 2004). This bias in citations shows the prevalence of confirmation bias in psychology. Given the weak correlation of r = .09 with the Mini-Markers and Biesanz and West’s failure to find a plasticity factor at all, the evidence for a plasticity factor is weak at best.

Moreover, the inconsistency of results points to a methodological problem in all existing tests of a plasticity factor. The problem is that all tests relied on scale scores to test theories about factors. This is a problem because scale scores are biased by the items that were used to measure a construct. As a result, additional relationships between item-specific content can produce spurious correlations that are inconsistent across measures with different item content. A simple solution to this problem is to conduct a hierarchical factor analysis. In a hierarchical factor analysis, the Big Five are represented as the shared variance among items that are used as indicators of a Big Five factor. As far as I know, this approach has not been used to examine the correlation between the extraversion factor and the openness factor.

What are Extraversion and Openness?

Another problem for empirical tests of Plasticity is that extraversion and openness are poorly defined constructs. Most of the time, personality psychologists are satisfied with operational definitions. That is, extraversion is whatever an extraversion scale measures and openness is whatever an openness scale measures. This is a problem when the correlation between E-scales and O-scales varies across different scales.

To avoid this problem, it is necessary to define and measure extraversion and openness in a more stringent way. Short of a classic definition of these constructs in terms of defining features, it is possible to define these construct by listing prototypical exemplars. For example, core primary traits of extraversion are sociability and positive energy (lively, energetic). Thus, an extraversion factor can be defined as a factor with high loadings of sociability and positive energy. Some theories of Extraversion have established longer lists of primary factors that are related to extraversion. The NEO-PI lists six primary factors that are often called facets. A competing model called the HEXACO model lists four primary factors. After accounting for overlap, this provides a list of 7 primary factors that can be used to define extraversion. According to this definition extraversion is the shared variance between these 8 factors. The same logic applies to the definition of openness to experience. After taking overlap into account, the NEO-PI and HEXACO models suggest that openness can be defined as the shared variance among 8 primary factors.

It is noteworthy that this definition of extraversion also implies a way to test this particular theory of extraversion. If one of these 8 factors does not load on a common extraversion factor, the theory is falsified. This does not mean that extraversion does not exist. For example, if only one factor does not load on the extraversion factor, the extraversion theory can be modified to exclude this factor from the definition of extraversion.

Only after an empirically validated model of Extraversion and Openness has been established, it is possible to test the Plasticity theory. A straightforward prediction of this theory is that all primary factors of Extraversion share variance with all primary factors of Openness. Once more, rejecting this theory does not automatically imply that there is no Plasticity factor. Additional relationships between specific facets could influence the pattern of correlations. However, this would mean that Plasticity alone is insufficient to explain the relationship between E-factors and O-factors and a simplistic Plasticity theory is insufficient.


One problem for empirical tests at the facet level is that the measurement of many facets requires a lot of items and that factor analyses at the item level require many participants. One solution to this problem is to ask participants to complete only a subset of all items and to use advanced statistical methods to analyze data with planned missing values. It has also become easier to collect data from large samples using online surveys.

Over the past decade, Condon and Revelle have collected data from tenth of thousands of participants for over 600 personality items that were selected to represent several personality questionnaires including the NEO-PI and HEXACO scales. The authors generously made their data openly available. I used the datasets that represent data collected between 2013 and 2014 and 2014 to 2015 ( I did not use all of the data to allow cross-validation of the results with a new sample.

Measurement Model

The measurement model presented here is strictly exploratory. To speed up data exploration, I computed the covariances among items and analyzed the covariance matrix with a sample size of 1,000, which was the minimum number of cases for all item pairs. Each primary factor was represented by 10 items. However, the items were not validated by strict psychometric tests and CFA results showed items with correlated residuals, low primary loadings, or high secondary loadings. These items were eliminated and only the four or five items with the best psychometric properties were retained.

I was able to identify 15 of the 16 theoretically postulated primary factors. The only factor that created problems was the social self-esteem factor of Extraversion in the HEXACO model. Thus, the measurement model had 7 extraversion factors and 8 openness factors.

After completing the preliminary analyses, I fitted a proper model based on the raw data with planned missing data, which is the default option in MPLUS. The main disadvantage of this method is that it is computing intensive. It took 5 hours for this model to converge.

All items had primary loadings on the theoretically assigned factor and loadings on an acquiescence factor depending on the direct or reverse scoring of the item. In addition, some items had secondary loadings that were all below .4 and all but 2 were below .3 (see complete results on OSF, Overall model fit of this model was excellent for the RMSEA and acceptable for the CFI, RMSEA = .006, CFI = .936. However, overall model fit for data with many missing values can be misleading because fit is determined by a different formula ( (Zhang & Savaley, 2019). More important is that inspection of modification indices showed no major modifications to improve model fit. Model fit is most important to provide a comparison standard for models of the correlations among the primary factors.

Table 1 shows the items and their primary loadings for Extraversion factors. Point estimates are reported because sampling error is less than .02 and any deviations from the point estimate by two standard errors have no practical significance. The information in this table can be used to select items or to write new items with similar item content for future studies.

Table 2 shows the correlations among the 7 primary E-factors.

The key finding in Table 2 is that all correlations are positive. This finding confirms the assumption that all primary factors share variance with each other and that the correlations among the primary factors can be modeled with a higher-order factor. This also makes it possible to define Extraversion as the factor that represents the shared variance among these primary factors.

The next finding is that all primary factors have distinct variance as all correlations are significantly different from 1. However, the correlation between sociability and boldness is very high, suggesting that these two factors have little unique variance and could be merged into a single factor. Other pairs of strongly related primary factors are boldness and assertiveness and liveliness and activity level. All other correlations are below .70.

Table 3 shows the primary loadings for the openness items on the primary openness factors.

Table 4 shows the correlations among the primary O-factors.

The key finding is that all correlations are positive. This justifies the definition of Openness as a higher-order factor that represents the shared variance among these 8 factors.

The second observation is that only one pair of primary factors shows a correlation greater than .70. Namely, Inquisitive and reflective are correlated r = .77. Although it was possible to find a distinction between these factors, they were both derived from items belonging to the Inquisitive scale of the HEXACO model and the intellect scale of the NEO-PI. Thus, it would also possible to reduce the number of factors to 7.

Table 5 shows the correlations between the 7 primary E-factors and the 8 primary O-factors. If plasticity is a higher-order factor that produces shared variance between E-factors and O-factors, most of these correlations should be positive, although their magnitude should be lower than the E-E and O-O correlations in Tables 2 and 4.

– Drumroll –

Table 5 shows the results. In support of a Plasticity factor, 24 of the 56 correlations are positive and above .10, whereas only 12 correlations are below -.10. However, the pattern of correlations suggests that some O-factors are not positively related to E-factors. Specifically, fantasy and progressive attitudes tend to be negatively related with extraversion factors. In comparison, novelty seeking shows very strong and consistent positive relationships with all E-factors, suggesting that novelty seeking is related to Openness and Extraversion. To a lesser extend, this also appears to be the case for Imagination.

A Higher-Order Factor Model

To further explore the pattern of correlations in Table 5, I fitted a higher-order model with an E-factor and an O-factor. Such a simple model did not fit the data. I therefore modified the model to achieve fit that closely approximated the fit of the measurement model, while retaining interpretability. Figure 1 shows the final model. The model had an RMSEA of .007 vs. .006 for the measurement model and a CFI of .917 vs. .936 for the measurement model. Modification indices suggested no notable improvements by adding secondary loadings of primary factors on E and O or further correlated residuals among the primary factors.

The most notable finding was that the correlation between Extraversion and Openness was close to zero and negative. This finding contradicts the prediction of the plasticity model that E and O are positively correlated due to the shared influence of a common factor. For the most part, primary factors had only loadings on their theoretically predicted factor. The main exception was novelty seeking which is based on items for the NEO-PI adventurous scale. The novelty factor actually loaded more strongly on extraversion than on openness. However, even in th NEO-PI model, this factor was a hybrid of E and O, but with a stronger loading on openness. The hybrid nature of this factor does not necessarily require a change of the definition of Extraversion and Openness. It is still possible to define Extraversion as a factor that influences among other things novelty seeking and to define Openness as a factor that defines among other things novelty seeking. The remaining secondary loadings are weaker and do not require a change of the definition to accomodate them.

In conclusion, the key finding is that extraversion can be defined as the shared variance among eight basic traits and openness can be defined as the shared variance among eight basic traits, with one overlapping trait. When extraversion and openness are defined in this way, they emerged as largely independent factors. This finding is inconsistent with the plasticity model that postulates a positive correlation between extraversion and openness.

The present results are not inconsistent with previous findings. As noted in the Introduction, previous studies produced inconsistent results and the inconsistency could be attributed to the use of scales with different item content.


The present findings have relatively little implications for the measurement of personality and for the use of personality questionnaires to predict behavior. In hierarchical models all of the variance of higher-order traits is captured by lower order traits that also contain unique variance. Therefore, higher-order traits can never predict behavior better than lower order traits. Aggregating E and O to create a plasticity scale only destroys valid variation in E and O that is not shared and makes it impossible to say whether explained variance was due to E, O, or the shared variance between E and O.

The results are more important for theories of personality, especially theories about the nature and causes of personality traits, such as DeYoung’s cybernetic theory of the Big Five (DeYoung, 2015). This theory entails the assumption that E and O are related by plasticity.

Although the Big Five traits were initially assumed to be independent and, thus, the highest level of the hierarchy, they are, in fact, regularly intercorrelated such that there exist two higher order traits, or metatraits, which we have labeled Stability and Plasticity (DeYoung, 2006; DeYoung, Peterson, & Higgins, 2002; Digman, 1997; see Section 5 for explanation of these labels). Although Stability and Plasticity are positively correlated in ratings by single informants, this correlation appears to result from rater bias, as they are typically uncorrelated in multi-informant studies (Anusic, Schimmack, Pinkus, & Lockwood, 2009; Chang, Connelly, & Geeza, 2012; DeYoung, 2006; McCrae et al., 2008). The metatraits, therefore, appear to be the highest level of the personality hierarchy, with no ‘‘general factor of personality’’ above them (Revelle & Wilt, 2013).” (p. 36).

DeYoung tried to characterize the glue that binds Extraversion and Openness together as “a cybernetic function to explore, create new goals, interpretations, and strategies (cf. Table 1, p. 42). The theory also postulates that dopaminergic systems in the brain are shared between extraversion and openness traits to provide a neurobiological explanation for the plasticity factor. He also suggests that plasticity is related to externalizing problems like delinquency and hyperactivity. However, this has never been shown by demonstrating that all or at least most facets of extraversion and neuroticism are related to these outcomes. Although future research is needed to examine this question, the present finding that E and O facets are largely independent renders it unlikely that this would be the case.

Novelty Seeking versus Plasticity

Many claims about plasticity may be valid for the adventurousness facet of the NEO-PI that corresponds to the Novelty Seeking factor in the present model. Novelty seeking is related to exploration, making new goals, and engagement in risky activities. It would not be surprising to see that it is also related to externalizing rather than internalizing problems. Novelty seeking is also related to all extraversion and openness facets. Thus, in many ways, novelty seeking has many of the characteristics attributed to plasticity. The key difference between novelty seeking and plasticity is that novelty seeking is a lower-order (facet) trait whereas plasticity is supposed to be a higher-order trait. The difference between the two is that a higher-order trait is assumed to produce shared variance among all E and O factors, whereas a lower-order trait can be related to all E and O factors without influencing the relationship between them. That is, in Figure 1 the causal arrows from E and O to novelty seeking would be reversed and the plasticity factor would produce correlations among all E and O factors. Given the lack of a correlation in the model without these factors, it is clear that there is no higher-order Plasticity factor.

This has important implications for theories of personality. It is unlikely that all E and O factors share a single dopaminergic system. Rather the focus might be directed that the lower-order trait of Novelty Seeking.


DeYoung (2015) emphasized that his cybernetic Big Five theory affords a wealth of testable hypotheses. Testable hypothesis are useful because they make it possible to falsify false predictions and to modify and improve theories of personality. One obvious prediction of the theory is that the plasticity factor produces positive correlations among primary traits related to extraversion and openness to experience. This follows directly from the notion that higher-order factors represent shared variance among lower-order factors. Here I presented the first test of this prediction and found no support for it. While a single failure is not sufficient to abandon a theory, it should be noted that the CBFT has not been subjected to many tests and that the results were inconsistent. Given the lack of strong support for the theory in the first place, the present results need to be taken seriously. I also provided a simple way to revise the theory by moving the plasticity factor of exploration from the higher-oder level to the facet level and to equate plasticity with novelty seeking.

Conflict of Interest Statement: I was the author of an article that introduced a model that also included a plasticity factor (Anusic et al., 2009). We included the plasticity factor mainly under pressure from DeYoung as a reviewer of the paper, while our focus was on the evaluative bias or halo factor. I never really believed in a beta-factor and I am very pleased with the present results. I hope that proponents of the plasticity model analyze the open data to examine whether they are influenced by unconscious biases.

The Structure of Neuroticism

The construct of neuroticism is older than psychological science. It has its roots in Freud’s theories of mental illnesses. Thanks to to influence of psychoanalysis on the thinking of psychologists, the first personality questionnaires included measures of neuroticism or anxiety, which were considered to be highly related or even identical constructs.

Eysenck’s research on personality first focussed on Neuroticism and Extraversion as the key dimensions of personality traits. He then added psychoticism as a third dimension.

In the 1980s, personality psychologists agreed on a model with five major dimensions that included neuroticism and extraversion as prominent dimensions. Psychoticism was divided into agreeableness and conscientiousness and a fifth dimension openness was added to the model.

Today, the Big Five model dominates personality psychology and many personality questionnaires focus on the measurement of the Big Five.

Despite the long history of research on neuroticism, the actual meaning of the term and the construct that is being measured by neuroticism scales is still unclear. Some researchers see neuroticism as a general disposition to experience a broad range of negative emotions. In the emotion literature, anxiety, anger, and sadness are often considered to be basic negative emotions, and the prominent NEO-PI questionnaires considers neuroticism to be a general disposition to experience these three basic emotions more intensely and frequently.

Neuroticism has also been linked to more variability in mood states, higher levels of self-consciousness and lower self-esteem.

According to this view of neuroticism, it is important to distinguish between neuroticism as a more general disposition to experience negative feelings and anxiety, which is only one of several negative feelings.

A simple model of neuroticism would assume that a general disposition to respond more strongly to negative emotions produces correlations among more specific dispositions to experience more anxiety, anger, sadness, and self-conscious emotions like embarrassment. This model implies a hierarchical structure with neuroticism as a higher-order factor of more specific negative dispositions.

In the early 2000s, Ashton and Lee published an alternative model of personality with six factors called the HEXACO model. The key difference between the Big Five model and the HEXACO model is the conceptualization of pro- and anti-social traits. While these traits are considered to be related to a single higher-order factor of agreeableness in the Big Five model, the HEXACO model distinguishes between agreeableness and honesty-humility as two distinct traits. However, this is not the only difference between the two models. Another important difference is the conceptualization of affective dispositions. The HEXACO model does not have a factor corresponding to neuroticism. Instead it has an emotionality factor. The only common trait to neuroticism and emotionality is anxiety, which is measured with similar items in Big Five questionnaires and in HEXACO questionnaires. The other three traits linked to emotionality are unique to the HEXACO model.

The four primary factors, also called facets) that are used to identify and measure emotionality are anxiety, fear, dependence, and sentimentality. Fear is distinguished from anxiety by a focus on immediate and often physical danger. In contrast, anxiety and worry tend to be elicited by thoughts about uncertain events in the future. Dependence is defined by a need for social comfort in difficult times. Sentimentality is a disposition to respond strongly to negative events that happen to other people, including fictional characters.

In a recent target article, Ashton and Lee argued that it is time to replace the Big Five model with the superior HEXACO model. A change from neuroticism to emotionality would be a dramatic shift given the prominence of neuroticism in the history of personality psychology. Here, I examine empirically how Emotionality is related to Neuroticism and whether personality psychologists should adapt the HEXACO framework to understand individual differences in affective dispositions.


A key problem in research on the structure of personality is that researchers often rely on questionnaires that were developed with a specific structure in mind. As a result, the structure is pre-determined by the selection of items and constructs. To overcome this problem, it is necessary to sample a broad and ideally representative sample of primary traits. The next problem is that motivation and attention-span of participants limits the number of items that a personality questionnaire can include. These problems have been resolved by Revelle and colleagues survey that asks participants to complete only a subset of over 600 items. Modern statistical methods can analyze datasets with planned missing data. Thus, it is possible to examine the structure of hundreds of personality items. Condon and Revelle (2018) also made these data openly available ( I am very grateful for their initiative that provides an unprecedented opportunity to examine the structure of personality.

The items were picked to represent primary factors (facets) of the HEXACO questionnaire and the NEO-PI questionnaire. In addition, the questionnaire covered neuroticism items from the EPQ and other questionnaires. The items are based on the IPIP item pool. Each primary factor is represented by 10 items. I picked the items that represent the four HEXACO Emotionality factors, anxiety, fear, dependency, sentimentality, and four of the NEO-PI Neuroticism factors, anxiety, anger, depression, and self-consciousness. The anxiety factor overlaps and is represented by mostly overlapping items. Thus, this item selection resulted in 70 items that were intended to measure 7 primary factors. I added four additional items that represented variable moods (moodiness) that were included in the BFAS and EPQ, which might form an independent factor.


The data were analyzed with confirmatory factor analysis (CFA), using the MPLUS software. CFA has several advantages over traditional factor analytic methods that have been employed by proponents of the HEXACO and the Big Five models. The main advantages are that it is possible to model hierarchical structures that represent the Big Five or HEXACO factors as higher-order factors of primary factors. A second advantage is that CFA provides information about model fit whereas traditional EFA produces solutions without evaluating model fit.

Measurement Model

A first step in establishing a measurement model was to select items with high primary loadings, low secondary loadings, and low correlated residuals. The aim was to represent each primary factor with the best four items. While four items may not be enough to create a good scale, four items are sufficient to establish a measurement model of primary factors. Limiting the number of items to four items is also advantages because computing time increases with additional items and models with missing data can take a long time to converge.

Aside from primary loadings, the model included an acquiescence factor based on the coding of items. Directed coded items had unstandardized loadings of 1 and reverse coded items had an unstandardized loading of -1. There were no secondary loadings or correlated residuals.

The model met standard criteria of model fit such as a CFI > .95 and RMSEA < .05, CFI = .954, RMSEA = .007. However, models with missing data should not be evaluated based on these fit indices because fit is determined by a different formula ( (Zhang & Savaley, 2019).  More importantly, modification indices showed no notable changes in model fit if fixed parameters were freed. Table 1 shows the items and their primary factor loadings.

Table 2 shows the correlations among the primary factors.

The first four factors are assumed to belong to the HEXACO-Emotionality factor. As expected, fear, anxiety, and dependence are moderately to highly positively correlated. Contrary to expectations, sentimentality showed low correlations especially with fear.

Factors 4 to 8 are assumed to be related to Big Five neuroticism. As expected, all of these factors are moderately to highly correlated.

In addition, the dependence factor from the HEXACO model also shows moderate to high correlations with all Big Five neuroticism factors. The fear factor also shows positive relations with the neuroticism factors, especially for self-consciousness.

With the exception of Sentimentality, all of the factors tend to be positively correlated, suggesting that they are related to a common higher-order factor.

Overall, this pattern of results provides little support for the notion that HEXACO-Emotionality is a distinct higher-order factor from Big-Five neuroticism.


The first model assumed that all factors are related to each other by means of a single higher-order factor. In addition, the model allowed for correlated residuals among the four HEXACO factors. This makes it possible to examine whether these four factors share additional variance with each other that is not explained by a general Negative Emotionality factor.

Model fit decreased compared to the measurement model which serves as a comparison standard for theoretical models, CFI: .916 vs. 954, RMSEA = .009 vs. .007.

All primary factors except sentimentality had substantial loadings on the Negative Emotionality factor. Table 3 shows the residual correlations for the four HEXACO factors.

All correlations are positive suggesting that the HEXACO Emotionality factor captures some shared variance among these four factors that is not explained by the Negative Emotionality factor. However, two of the correlations are very low indicating that there is little shared variance between sentimentality and fear or dependence and anxiety.


The second model, modeled the relationship among the HEXACO factors with a factor. Model fit decreased, CFI = .914 vs. .916, RMSEA = .010 vs. 009. Loadings on the Emotionality factor ranged from .27 to .46. Fear, anxiety, and dependence had higher loadings on the Negative Emotionality factor than on the Emotionality factor.

The main conclusion from these results is that it would be problematic to replace the Big Five model with the HEXACO model because the Emotionality factor in the HEXACO model fails to capture the nature of the broader Neuroticism factor in the Big Five model. In fact, there is little evidence for a specific Emotionality factor in this dataset.


The discrepancy between the measurement model and Model 1 suggests that there are additional relationships between some primary factors that are not explained by the general Negative Emotionality factor. Examining modification indices suggested several changes to the model. Model 3 shows the final results. This model fit the data nearly as well as the measurement model, CFI = .949 vs. 954, RMSEA = .007 vs. .007. Inspection of the Modification Indices showed no further ways to improve the model by freeing correlated residuals among primary factors. In one case, three correlated residuals were consistent and were modeled as a factor. Figure 1 shows the results.

First, the model shows notable and statistically significant effects of neuroticism on all primary factors except sentimentality. Second the correlated residuals show an interesting patterns where primary factors can be arranged in a chain. that is, depression is related to moody, moody is related to anger, anger is related to anxiety, anxiety is related to fear, fear is related to self-consciousness and dependence, self-consciousness is related to dependence and finally, dependence is related to sentimentality. This suggests the possibility that a second broader dimension might be underlying the structure of negative emotionality. Research on emotions suggests that this dimension could be activation (fear is high, depression is low) or potency (anger is high, dependence is low).This is an important avenue for future research. The key finding in Figure 1 is that the traditional Neuroticism dimension is an important broad higher-order factor that accounts for the correlations among 7 of the 8 primary factors. These results favor the Big Five model over the HEXACO model.

A Big-5 Model of the Hexaco-100 Items

In the 1980s, personality psychologists celebrated the emergence of a five-factor model as a unifying framework for personality traits. Since then, the so-called Big-5 have dominated thinking and measurement of personality.

Two decades later, Ashton and Lee proposed an alternative model with six factors. This model has come to be known as the HEXACO model.

A recent special issue in the European Journal of Personality discussed the pros and cons of these two models. The special issue did not produce a satisfactory resolution between proponents of the two models.

In theory, it should be possible to resolve this dispute with empirical data, especially given the similarities between the two models. Five of the factors are more or less similar between the two models. One factor is Neuroticism with anxiety/worry as a key marker of this higher-order trait. A second factor is Extraversion with sociability and positive energy as markers. A third factor is Openness with artistic interests as a common marker. A forth factor is conscientiousness with orderliness and planful actions as markers. The key differences between the two models is concerned with pro-social and anti-social traits. In the Big Five model, a single higher-order trait of agreeableness is assumed to produce shared variance among all of these traits (e.g., morality, kindness, modesty). The HEXACO model assumes that there are two higher-order traits. One is also called agreeableness and the other one is called honesty and humility.

As Ashton and Lee (2005) noted, the critical empirical question is how the Big Five model accounts for the traits related to the honesty-humility factor in the HEXACO model. Although the question is straightforward, empirical tests of it are not. The problem is that personality researchers often rely on observed correlations between scales and that correlations among scales depend on the item-content of scales. For example, Ashton and Lee (2005) reported that the Big-Five Mini-Marker scale of Agreeableness correlated only r = .26 with their Honesty-Humility scale. This finding is not particularly informative because correlations between scales are not equivalent to correlations between the factors that the scales are supposed to reflect. It is also not clear whether a correlation of r = .26 should be interpreted as evidence that Honesty-Humility is a separate higher-order factor at the same level as the other Big Five traits. To answer this question, it would be necessary to provide a clear definition of a higher-order factor. For example, higher-order factors should account for shared variance among several primary factors that have only low secondary loadings on other factors.

Confirmatory factor analysis (CFA) addresses some of the problems of correlational studies with scale scores. One main advantage of CFA is that models do not depend on the item selection. It is therefore possible to fit a theoretical structure to questionnaires that were developed for a different model. I therefore used CFA to see whether it is possible to fit the Big Five model to the HEXACO-100 questionnaire that was explicitly designed to measure 4 primary factors (facets) for each of the six HEXACO higher-order traits. Each primary factor was represented by four items. This leads to 4 x 4 x 6 = 96 items. After consultation with Michael Ashton, I did not include the additional four altruism items.

Measurement Model

The Big-Five or HEXACO models are higher-order models that are supposed to explain the pattern of correlations among the primary factors. In order to test these models, it is necessary to first establish a measurement model for the primary factors. Starting point for the measurement model was a model with a simple structure where each item only has a primary loading on its designated factor. For example, the anxiety item “I sometimes can’t help worrying about little things” loaded only on the anxiety factor. All 24 primary factors were allowed to correlate freely with each other.

It is well-known that few data fit a simple structure for two reasons. First, the direction of items can influence responses. This can be modeled with an acquiescence factor that codes whether an item is a direct or a reverse coded items. Second, it is difficult to write items that reflect only variation in the intended primary trait. Thus, many items are likely to have small, but statistically significant, secondary loadings on other factors. These secondary loadings need to be modeled to achieve acceptable model fit, even if they have little practical significance. Another problem is that two items of the same factor may share additional variance because they share similar wordings or item content. For example, the two items “I clean my office or home quite frequently” and the reverse coded item “People often joke with me about the messiness of my room or desk” share specific content. This shared variance between items needs to be modeled with correlated residuals to achieve acceptable model fit.

Researchers can use Modification Indices to identify secondary loadings and correlated residuals that have a strong influence on model fit. Freeing the identified parameters improves model fit and can produce a measurement model with acceptable model fit. Moreover, MI can also provide information that there are no more fixed parameters that have a strong negative effect on model fit.

After modifying the simple-structure model accordingly, I established a measurement model that had acceptable fit, RMSEA = .021, CFI = .936. Although the CFI did not reach the threshold of .950, the MI did not show any further improvements that could be made. Freeing further secondary loadings resulted in secondary loadings less than .1. Thus, I stopped at this point.

16 primary factors had primary factor loadings of .4 or higher for all items. The remaining 8 primary factors had 3 primary factor loadings of .4 or higher. Only 4 items had secondary loadings greater than .3. Thus, the measurement model confirmed the intended structure of the questionnaire.

Importantly, the measurement model was created without imposing any structure on the correlations among higher-order factors. Thus, the freeing of secondary loadings and correlated residuals did not bias the results in favor of the Big Five or HEXACO model. Rather, the fit of the measurement model can be used to evaluate the fit of theoretical models about the structure of personality.

A simplistic model that is often presented in textbooks would imply that only traits related to the same higher-order factor are correlated with each other and that all other correlations are close to zero. Table 1 shows the correlations for the HEXACO-Agreeableness (A-Gent = gentle, A-Forg = forgiving, A-Pati = patient, & A-Flex = flexible) and the HEXACO-honesty-humility (H-Gree = greed-avoidance, H-Fair = fairness, H-Mode = modest, & H-Sinc = sincere) factors.

In support of the Big Five model, all correlations are positive. This suggests that all primary factors are related to a single higher-order factor. In support of the HEXACO model, correlations among A-factors and correlations among H-factors tend to be higher than correlations of A-factors with H-factors. Three notable exceptions are highlighted in red and all of them involve modesty. Modesty is more strongly related to A-Gent and A-Flex than to H-Mode.

Table 2 shows the correlations of the A and H factors with the four neuroticism factors (N-Fear = fear, N-Anxi = anxiety, N-Depe = dependence, N-Sent = sentimental). Notable correlations greater than .2 are highlighted. For the most part, the results show that neuroticism and pro-social traits are unrelated. However, there are some specific relations among factors. Notably, all four HEXACO-A factors are negatively related to anxiety. This shows some dissociation between A and H factors. In addition, fear is positively related to fairness and negatively related to sincerity. Sentimentality is positively related to fairness and modesty. Neither the Big Five model nor the HEXACO model has explanations for these relationships.

Table 3 shows the correlation with the Extraversion factors (E-soci = Sociable, E-socb = bold, E-live = lively, E-Sses = self-esteem). There are few notable relationships between A and H factors on the one hand and E factors on the other hand. This supports the assumption of both models that pro-social traits are unrelated to extraversion traits, including being sociable.

Table 4 shows the results for the Openness factors. Once more there are few notable relationships. This is consistent with the idea that pro-social traits are fairly independent of Openness.

Table 5 shows the results for conscientiousness factors (C-Orga = organized, C-Dili = diligent, C-Perf = Perfectionistic, & C-Prud = prudent). Most of the correlations are again small, indicating that pro-sociality is independent of conscientiousness. The most notable exceptions are positive correlations of the conscientiousness factors with fairness. This suggests that fairness is related to conscientiousness.

Table 6 shows the remaining correlations among the N, E, O, and C factors.

The green triangles show correlations among the primary factors belonging to the same higher-order factor. The strong correlations confirm the selection of primary factors to be included in the HEXACO-100. Most of the remaining correlations are below .2. The grey fields show correlations greater than .2. The most notable correlations are for diligence (C-Dili), which is correlated with all E-factors. This suggests a notable secondary loading of diligence on the higher-order factor E. Another noteworthy finding is a strong correlation between self-esteem (E-Sses) and anxiety (N-anx). This is to be expected because self-esteem is known to have strong relationships with neuroticism. It is surprising, however, that self-esteem is not related to the other primary factors of neuroticism. One problem in interpreting these results is that the other neuroticism facets are unique to the HEXACO-100.

In conclusion, inspection of the correlations among the 24 primary factors shows clear evidence for 5 mostly independent factors that correspond to the Big Five factors. In addition, the correlations among the pro-social factors show a distinction between the four HEXACO-A factors and the four HEXACO-H factors. Thus, it is possible to represent the structure with 6 factors that correspond to the HEXACO model, but the higher-order A and H factors would not be independent.

A Big Five Model of the HEXACO-100

I fitted a model with five higher-order factors to examine the ability of the Big Five model to explain the structure of the HEXACO-100. Importantly, I did not alter the measurement model of the primary factors. It is clear from the previous results that a simple-structure would not fit the data. I therefore allowed for secondary loadings of primary factors on the higher-order factors. In addition, I allowed for residual correlations among primary factors. Furthermore, when several primary factors showed consistent correlated residuals, I modeled them as factors. In this way, the HEXACO-A and HEXACO-H factors could be modeled as factors that account for correlated residuals among pro-social factors. Finally, I added a halo factor to the model. The halo factor has been identified in many Big Five questionnaires and reflects the influence of item-desirability on responses.

Model fit was slightly less than model fit for the measurement model, RMSEA = .021 vs. .021, CFI = .927 vs. .936. However, inspection of MI did not suggest additional plausible ways to improve the model. Figure 1 shows the primary loadings on the Big Five factors and the two HEXACO factors, HEXACO-Agreeableness (HA) and HEXACO-Honesty-Humility.

The first notable observation is that primary factors have loadings above .5 for four of the Big Five factors. For the Agreeableness factor, all loadings were statistically significant and above .2, but four loadings were below .5. This shows that agreeableness explains less variance in some primary factors than the other Big Five factors. Thus, one question is whether the magnitude of loadings on the Big Five factors should be a criterion for model selection.

The second noteworthy observation is that the model clearly identified HEXACO-A and HEXACO-H as distinct factors. That is, the residuals of the corresponding primary factors were all positively correlated. All loadings were above .2, but several of the loadings were also below .5. Moreover, for the HEXACO-A factors the loadings on the Big5-A factor were stronger than the loadings on the HEXACO-A factor. Modesty (H-Mode) also loaded more highly on Big5-A than HH. The results for HEXACO-A are not particularly troubling because the HEXACO model does not consider this factor to be particularly different from Big5-A. Thus, the main question is whether the additional shared variance among HEXACO-H factors warrants the creation of a model with six factors. That is, does Honesty-Humility have the same status as the Big Five factors?

Alternative Model 1

The HEXACO model postulates six factors. Comparisons of the Big Five and HEXACO model tend to imply that the HEXACO factors are just as independent as the Big Five factors. However, the data show that HEXACO-A factors and HEXACO-H factors are not as independent of each other as other factors. To fit a six-factor model to the data, it would be possible to allow for a correlation between HEXACO-A and HEXACO-H. To make this model fit as well as the Big-Five model, an additional secondary loading of modesty (H-Mode) on HEXACO-A was needed, RMSEA = .22, CFI = .926. This secondary loading was low, r = .25, and is not displayed in Figure 2.

The most notable finding is a substantial correlation between Hexaco-A and Hexaco-H of r = .49. Although there are no clear criteria for practical independence, this correlation is strong and suggests that there is an important common factor that produces a positive correlation between these two factors. This makes this model rather unappealing. The main advantage of the Big Five model would be that it captures the highest level of independent factors in a hierarchy of personality traits.

Alternative Model 2

An alternative solution to represent the correlations among HEXACO-A and HEXACO-H factors is to treat HEXACO-A and HEXACO-H as independent factors and to allow for secondary loadings of HEXACO-H factors on HEXACO-A or vice versa. Based on the claim that the H-factor adds something new to the structure, I modelled secondary loadings of the primary H-factors on HEXACO-A. Fit was the same as for the first alternative model, RMSEA = .22, CFI = .927. Figure 3 shows substantial secondary loadings for three of the four H-factors, and for modesty the loading on the HEXACO-A factor is even stronger than the loading on the HEXACO-H factor.

The following table shows the loading pattern along with all secondary loadings greater than .1. Notable secondary loadings greater than .3 are highlighted in pink. Aside from the loading of some H-factors on A, there are some notable loadings of two C-factors on E. This finding is consistent with other results that high achievement motivation is related to E and C.

The last column provides information about correlated residuals (CR) in the last column. Primary factors with the same letter have a correlated residual. For example, there is a strong negative relationship between anxiety (N-anxiety) and self-esteem (E-Sses) that was apparent in the correlations among the primary factors in Table 6. This relationship could not be modeled as a negative secondary loading on neuroticism because the other neuroticism factors showed much weaker relationships with self-esteem.


In sum, the choice between the Big5 model and the HEXACO model is a relatively minor stylistic choice. The Big Five model is a broad model that predicts variance in a wide variety of primary personality factors that are often called facets. There is no evidence that the Big Five model fails to capture variation in the primary factors that are used to measure the Honesty-Humility factor of the HEXACO model. All four H-factors are related to a general agreeableness factor. Thus, it is reasonable to maintain the Big Five model as a model of the highest level in a hierarchy of personality traits and to consider the H-factor a factor that explains additional relationships among pro-social traits. However, an alternative model with Honesty-Humility as a sixth factor is also consistent with the data. This model only appears different from the Big Five model if secondary loadings are ignored. However, all H-factors had secondary loadings on agreeableness. Thus, agreeableness remains a broader trait that links all pro-social traits, while Honesty-Humility explains additional relationships among a subset of this factors. If Honesty-Humility is indeed a distinct global factor it should be possible to find primary factors that are uniquely related to this factor without notable secondary loadings on Agreeableness. If such traits exists, they would strengthen the support for the HEXACO model. On the other hand, if all traits that are related to Honesty-Humility also load on Agreeableness, it seems more appropriate to treat Honesty-Humility as a lower-level factor in the hierarchy of traits. In conclusion, these structural models did not settle the issue, but they clarify the issue. Agreeableness factors and Honesty-Humilty factors form distinct, but related clusters of primary traits. This empirical finding can be represented with a Five-Factor model with Honest-Humility as shared variance among some pro-social traits or it can be represented with six factors and secondary loadings.


A major source of confusion in research on the structure of personality is the failure to distinguish between factors and scales. Many proponents of the HEXACO model point out that the HEXACO scales, especially the Honesty-Humilty scale, explain variance in criterion variables that is not explained by Big-Five scales. It has also been observed that the advantage of the HEXACO scales depends on the Big-Five scales that are used. The reason for these findings is that scales are imperfect measures of their intended factors. They also contain information about the primary factors that were used to measure the higher-order factors. The advantage of the HEXACO-100 is that it measures 24 primary factors. There is nothing special about the Honesty-Humility factor. As Figure 1 shows, the honesty-humilty factor explains only a portion of the variance in its designated primary factors, namely .67^2 = 45% of the variance in greed-avoidance, .55^2 = 30% of the variance in fairness, .32^2 = 10% of the variance in modesty, and .41^2 = 17% of the variance in sincerity. Averaging these scales to form a Honesty-Humilty scale destroys some of this variance and inevitably lowers the ability to predict some criterion variable that is strongly related to one of these primary factors. There is also no reason why Big Five questionnaires should not include some primary factors of Honesty-Humility and the NEO-PI-3 does include modesty and fairness.

Personality psychologists need to distinguish more clearly between factors and scales. The correlation of the NEO-PI-3 agreeableness scale will be different from those with the HEXACO-A scale or the BFI2-agreeableness scale. Scale correlations are biased by the choice of items, unless items are carefully selected to maximize correlation with the latent factor. For research purposes, researchers should use latent variable models that can decompose an observed correlation into the influence of the higher-order factor and the influence of specific factors.

Personality researchers should also carefully think about the primary factors they may want to include in their studies. For example, even researchers who favor a HEXACO model may include additional measures of anger and depression to explore the contribution of affective dispositions to outcome measures. Similarly, Big Five researchers may want to supplement their Big Five questionnaires with measures of primary traits related to honesty and morality if the Big-Five measure does not capture them. A focus on the highe-order factors is only justified in studies that require short measures with a few items.


My main contribution to the search for a structural model of personality is to examine this question with a statistical tool that makes it possible to test structural models of factors. The advantage of this method is that it is possible to separate structural models of factors from the items that are used to measure factors. While scales of the same factor can differ sometimes dramatically, structural models of factors are independent of the specific items that are used to measure a factor as long as some items reflect variance in the factor. Using this approach, I showed that the Big Five and HEXACO model only differ in the way they represent covariation among some primary factors. It is incorrect to claim that Big Five models fail to represent variation in honesty or humility. It is also incorrect to assume that all pro-social traits are independent after their shared variance in agreeableness is removed. Future research needs to examine more carefully the structural relationships among primary traits that are not explained by higher-order factors. This research question has been neglected because exploratory factor analysis is unable to examine this question. I therefore urge personality researchers to adopt confirmatory factor analysis to advance research on personality structure.

A Meta-Psychological Investigation of Intelligence Research with Z-Curve.2.0

A recent article by Nuijten, van Assen, Augusteijn, Crompvoets, and Wicherts reported the results of a meta-meta-analysis of intelligence research. The authors extracted 2442 eect sizes from 131 meta-analyses. The authors made these data openly available to allow “readers to pursue other categorizations and analyses” (p. 6). In this blog post, I report the results of an analysis of their data with z-curve.2.0 (Bartos & Schimmack, 2020; Brunner & Schimmack, 2020). Z-curve is a powerful statistical tool that can (a) examine the presence of publication bias and/or the use of questionable research practices, (b) provide unbiased estimate of statistical power before and after selection for significance when QRPs are present, and (c) estimate the maximum number of false positive results.

Questionable Research Practices

The term questionable research practices refers to a number of statistical practices that inflate the number of significant results in a literature (John et al., 2012). Nuijten et al. relied on the correlation between sample size and effect size to examine the presence of publication bias. Publication bias produces a negative correlation between sample size and effect size because larger effects are needed to get significance in studies with smaller samples. The method has several well-known limitations. Most important, a negative correlation is also expected if researchers use larger samples when they anticipate smaller effects, either in the form of a formal a priori power analysis or based on informal information about sample sizes in previous studies. For example, it is well-known that effect sizes in molecular genetics studies are tiny and that sample sizes are huge. Thus, a negative correlation is expected even without publication bias.

Z-curve.2.0 avoids this problem by using a different approach to detected the presence of publication bias. The approach compares the observed discovery rate (i.e., the percentage of significant results) to the expected discovery rate (i.e., the average power of studies before selection for significance). To estimate the EDR, z-curve.2.0 fits a finite mixture model to the significant results and estimates average power based on the weights of a finite number of non-centrality parameters.

I converted the reported information about sample size, effect size, and sampling error into t-values, and then converted the t-values. Extremely large t-values of 20 were fixed to a value of 20. Then t-values were converted into absolute z-scores.

Figure 1 shows a histogram of the z-scores in the critical range from 0 to 6. All z-scores greater than 6 are assumed to have a power of 1 with a significance threshold of .05 (z = 1.96).

The critical comparison of the observed discovery rate (52%) and the expected discovery rate (58%) shows no evidence of QRPs. In fact, the EDR is even higher than the ODR, but the confidence interval is wide and includes the ODR. When there is no evidence that QRPs are present, it is better to use all observed z-scores, including the credible non-significant results, to fit the finite mixture model. Figure 2 shows the results. The blue line moved to 0, indicating that all values were used for estimation.

Visual inspection shows a close match between the observed distribution of z-scores (blue line) and the predicted distribution by the finite mixture model (grey line). The observed discovery rate now closely matches the expected discovery rate of 52%. Thus, there is no evidence of publication bias in the meta-meta-analysis of effect sizes in intelligence research.

Interestingly, there is also no evidence that researchers used mild QRPs to move marginally significant results just below .05 on the other side of the significance criterion to produce just significant results. There are two possible explanation for this. On the one hand, intelligence researchers may be more honest than other psychologists. On the other hand, it is possible that meta-analyses are not representative of the focal hypothesis tests that led to publication of original research articles. A meta-analysis of focal hypothesis tests in original articles is needed to answer this question.

In conclusion, this superior analysis of the presence of bias in the intelligence literature showed no evidence of bias. In contrast, Nuijten et al. (2020) found a significant correlation between effect sizes and sample sizes which they call small study effect. The problem with this finding is that it can reveal either careful planning of sample sizes (good practices) or the use of QRPs (bad practices). Thus, their analyses does not tell us whether there is bias in the data. Z-curve.2.0 resolves this ambiguity and shows that there is no evidence of selection for significance in these data.

Statistical Power

Nuijten et al. used Cohen’s classic approach to investigate power (Cohen, 1962). Based on this approach, they concluded “we found an overall median power of 11.9% to detect a small effect,54.5% for a medium effect, and 93.9% for a large effect (corresponding to a Pearson’s r of 0.1, 0.3, and 0.5 or a Cohen’s d of 0.2, 0.5, and 0.8, respectively)”

This information merely provides information about the sample sizes in the different studies. Studies with small sample sizes have low power to detect a small effect size. As most studies had small sample sizes, the average power to detect small effects is low. However, this does not tell us anything about the actual power of studies to obtain significant results for two reasons. First, effect sizes in a meta-meta-analysis are extremely heterogeneous. Thus, not all studies are chasing small effect sizes. As a result, the power of studies is likely to be higher than the average power to detect small effect sizes. Second, the previous results showed that (a) sample sizes correlate with effect sizes and (b) there is no evidence of QRPs. This means that researchers are a priori deciding to use smaller samples to search for larger effects and larger samples to search for smaller effects. This means that formal or informal a priori power analyses ensure that small samples can have as much or more power than large samples. It is therefore not informative to conduct power analysis only based on information about sample size. Z-curve.2.0 avoids this problem and provides estimates of the actual mean power of studies. Moreover, it provides two estimates of power for two different populations of studies. One population are all studies that are conducted by intelligence researchers without selecting for significance. This estimate is the expected discovery rate. Z-curve also provides an estimate for the population of studies that produced a significant result. This population is of interest because only significant results can be used to claim a discovery; with an error rate of 5%. When there is heterogeneity in power, the mean power after selection for significance is higher than the average power before selection for significance (Brunner & Schimmack, 2020). When researchers attempt to replicate a significant results to verify that it was not a false positive result, mean power after selection for significance provides the average probability that an exact replication study will be significant. This information is valuable to evaluate the outcome of actual replication studies (cf. Schimmack, 2020).

Given the lack of publication bias, there are two ways to determine mean power before selection for significance. We can simply compute the average of significant results and we can use the estimated discovery rate. Figure 2 shows that both values are 52%. Thus, the average power of studies conducted by intelligence researchers is 52%. This is well-below the recommended level of 80%.

The picture is a bit better for studies with a significant result. Here the average power called the expected replication rate is 71% and the 95% confidence interval approaches 80%. Thus, we would expect that more than 50% of significant results in intelligence research can be replicated with a significant result in the replication study. This estimate is higher than for social psychology, where the expected replication rate is only 43%.

False Positive Psychology

The past decade has seen a number of stunning replication failures in social psychology (cf. Schimmack, 2020). This has led to a concern that most discoveries in psychology if not in all sciences are false positive results that were obtained with questionable research practices (Ioannidis, 2005 ; Simmons et al., 2011). So far, however, these concerns are based on speculations and hypothetical scenarios rather than actual data. Z-curve.2.0 makes it possible to examine this question empirically. Although it is impossible to say how many published results are in fact false positive results, it is possible to estimate the maximum number of false-positive results based on the discovery rate. (Soric, 1989). As the observed and expected discovery are identical, we can use the value of 52% as our estimate of the discovery rate. This implies that no more than 5% of the significant results are false positive results. Thus, the empirical evidence shows that most published results in intelligence research are not false positives.

Moreover, this finding implies that most non-significant results are false negatives or type-II errors. That is, the null-hypothesis is also false for non-significant results. This is not surprising because many intelligence studies are correlational and the nil-hypothesis that there is absolutely no relationship between two naturally occurring variables has a low a priori probability. This also means that intelligence researchers would benefit from specifying some minimal effect size for hypothesis testing or to focus on effect size estimation rather than hypothesis testing.


Nujiten et al. conclude that intelligence research is plagued by QRPs. “Based on our findings, we conclude that intelligence research from 1915 to 2013 shows signs that publication bias may have caused overestimated effects”. This conclusion ignores that small-sample effects are ambiguous. The superior z-curve analysis shows no evidence of publication bias. As a result, there is also no evidence that reported effect sizes are inflated.

The z-curve.2.0 analysis leads to a different conclusion. There is no evidence of publication bias, significant results have a probability of 70% to be replicated in exact replication studies and even if exact replication studies are impossible the discovery rate of 50% implies that we should expect the majority of replication attempts with the same sample sizes to be successful (Bartos & Schimmack, 2020). In replication studies with larger samples even more results should replicate. Finally, most of the non-significant results are false negative results because there are few true null-hypothesis in correlational research. A modest increase in sample sizes could easy achieve 80% power which is typically recommended.

A larger concern is the credibility of conclusions based on meta-meta-analyses. The problem is that meta-analysis focus on general main effects that are consistent across studies. In contrast, original studies may focus on unique patterns in the data that can not be subjected to meta-analysis because direct replications of these specific patterns are lacking. Future research therefore needs to code the focal hypothesis tests in intelligence articles to examine the credibility of intelligence research.

Another concern is the reliance on alpha = .05 as a significance criterion. Large genomic studies have a multiple comparison problem where 10,000 analyses can easily produce hundreds of significant results with alpha = .05. This problem is well-known and genetics studies now use much lower alpha levels to test for significance. A proper power analysis of these studies needs to use the actual alpha level rather than the standard level of .05. Z-curve is a flexible tool that can be used with different alpha levels. Therefore, I highly recommend z-curve for future meta-scientific investigations of intelligence research and other disciplines.


Bartoš, F., & Schimmack, U. (2020). z-curve.2.0: Estimating replication and discovery rates. Under review.

Brunner, J., & Schimmack, U. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta- Psychology. MP.2018.874,

Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145–153.

Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2, e124.

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the Prevalence of Questionable Research Practices With Incentives for Truth Telling. Psychological Science, 23(5), 524–532.

Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne. Advance online publication.

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22,  1359 –1366.

Sorić, B. (1989). Statistical “discoveries” and effect-size estimation.Journal of the American Statistical Association,84(406), 608-610.

The Structure of Agreeableness

In 1934, Thurstone published his groundbreaking article “The Vectors of Mind” that introduced factor analysis as an objective method to examine the structure of personality traits. His first application of his method to a list of 60 trait adjectives rated by 1,300 participants yielded five factors. It would take several more decades for personality psychologists to settle on a five-factor model of personality traits (Digman, 1990).

Although the five-factor model dominates personality, it is not the only theory of personality traits. The biggest rival of the Big Five model is the HEXACO model that postulates six factors (Ashton, Lee, Perugini et al, 2004; Lee & Ashton, 2004).

A recent special issue in the European Journal of Personality contained a target article by Ashton and Lee in favor of replacing the Big Five model with the HEXACO model and responses by numerous prominent personality psychologists in defense of the Big Five model.

The key difference between the Big Five model and the HEXACO model is the representation of pro-social (self-transcending) versus egoistic (self-enhancing) traits. Whereas the Big Five model assumes that a single general factor called agreeableness produces shared variance among pro-social traits, the HEXACO model proposes two distinct factors called Honesty-Humility and Agreeableness. While the special issue showcases the disagreement among persoality researchers, it fails to offer an empirical solution to this controversy.

I argue that the main reason for stagnation in research on the structure of personality is the reliance on Thurstone’s outdated method of factor analysis. Just like Thurstone’s multi-factor method was an important methodological contribution that replaced Spearman’s single-factor method, Jöreskog (1969) developed confirmatory factor analysis that addresses several limitations in Thurstone’s method. Neither Ashton and Lee nor any of the commentators suggested the use of CFA to test empirically whether pro-social traits are represented by one general or two broad traits.

There are several reasons why personality psychologists have resisted the use of CFA to study personality structure. Some of these reasons were explicitly states by McCrae, Zonderman, Costa, Bond, and Paunonen (1996).

1. “In this article we argue that maximum likelihood confirmatory factor analysis (CFA), as it has typically been applied in investigating personality structure, is systematically flawed” (p. 552).

2. “The CFA technique may be inappropriately applied” (p. 553)

3. “CFA techniques are best suited to the analysis of simple structure models” (p. 553)

4. “Even proponents of CFA acknowledge a long list of problems with the technique, ranging from technical difficulties in estimation of some models to the cost in time and effort involved.”

5. The major advantage claimed for CFA is its ability to provide statistical tests of the fit of empirical data to different theoretical models. Yet it has been known for years that the chi-square test on which most measures of fit are based is problematic”

6. A variety of alternative measures of goodness-of-fit have been suggested, but their interpretation and relative merits are not yet clear, and they do not yield tests of statistical

7. Data showing that chisquare tests lead to overextraction in this sense call into question
the appropriateness of those tests in both exploratory and confirmatory maximum likelihood models. For example, in the present study the largest single problem was a residual correlation between NEO-PI-R facet scales E4: Activity and C4: Achievement Striving. It would be possible to specify a correlated error term between these two scales, but the interpretation of such a term is unclear. Correlated error usually refers to a nonsubstantive source of variance. If Activity and Achievement Striving were, say, observer ratings, whereas all other variables were self-reports, it would make sense to control for this difference in method by introducing a correlated error term. But there are no obvious sources of correlated error among the NEO-PI-R
facet scales in the present study.

8, With increasing familiarity of the technique and the availability of convenient computer programs (e.g., Bentler, 1989; Joreskog & Sorbom, 1993), it is likely that many more researchers will conduct CFA analyses in the future. It is therefore essential to point out the dangers in an uncritical adoption and simplistic application of CFA techniques (cf. Breckler, 1990).

9. Structures that are known to be reliable showed poor fits when evaluated by CFA techniques. We believe this points to serious problems with CFA itself when used to examine personality structure.

I may be giving McCrae et al. (1996) to much credit for the lack of CFA studies of personality structure, but their highly cited article clearly did not encourage future generations to explore personality structure with CFA. This is unfortunate because CFA has many advantages over traditional EFA.

The first advantage is the ability to compare model fit. McCrae et al. are overly concerned about the chi-square statistic that is sensitive to sample size and rewards overly complex models. Even when they published their article, other fit indices were already used to address these problems. Today researchers have even more experience with the evaluation of model fit. More important, well-established fit indices also reward parsimony and can favor a more parsimonious model with five factors over a less parsimonious model with six factors. This opens the door to head to head model comparisons of the Big Five and HEXACO model.

The second advantage of CFA is that factors are theoretically specified by researchers rather than empirically driven by patterns in the data. This means that CFA can find factors that are represented by as few as two items and factors that are represented by 10 or more items. In contrast, EFA will favor factors with many items. This means that researchers need to know the structure to represent all factors equally, which is not possible in studies that try to discover a structure and the number of factors. Different sampling from the item space may explain differences in personality structures with EFA, but is not a problem for CFA as long as factors are represented by a minimum of two items.

A third advantage of CFA is that it is possible to model hierarchical structures. Theoretically, most personality researchers agree that the Big Five or HEXACO are higher-order factors that explain only some of the variance in so-called facets or primary traits like modesty, altruism, forgiveness, or morality. However, EFA cannot represent hierarchies. This makes it necessary to examine the higher order structure of personality with scales that average several items. However, these scales are impure indicators of the primary factors and the impurities can distort the higher-order factor structure.

A fourth advantage is that CFA can also model method factors that can distort the actual structure of personality. For example, Thurstone’s results appeared to be heavily influenced by an evaluative factor. With CFA it is possible to model evaluative biases and other response styles like acquiescence bias to separate systematic method variance from the actual correlations between traits (Anusic et al., 2009).

Finally, CFA requires researchers to think hard about the structure of personality to specify a plausible model. In contrast, EFA produces a five factor or six-factor solution in less than a minute. Most of the theoretical work is then to find post-hoc explanations for the structure.

In short, most of the problems listed by McCrae et al. (1996) are not bugs, but features of CFA. The fact that they were unable to create a fitting model to their data only shows that they didn’t take advantage of these features. In contrast, I was able to fit a hierarchical model with method factors to Costa and McCrae’s NEO-PI-R questionnaire (Schimmack, 2019). The results mostly confirmed the Big Five model, but some facets did not have primary loadings on the predicted factor.

Here I am using hierarchical CFA to examine the structure of pro-social and anti-social traits. The Big Five model predicts that all traits are related to one general factor that is commonly called Agreeableness. The HEXACO model predicts that there are two relatively independent factors that are called Agreeableness and Honestiy-Humility (Ashton & Lee, 2005).


The data were collected by Crowe, Lynam, and Miller (2017). 1205 participants provided self-ratings on 104 items that were selected from various questionnaires to measure Big-5 agreeableness or HEXACO-agreeableness and honesty and humility. The data were analyzed with EFA. In the recent debate about the structure of personality, Lynam, Crowe, Vize, and Miller (2020) pointed to the finding of a general factor to argue that there is “Little Evidence That Honesty-Humility Lives Outside of FFM Agreeableness” (p. 530). They also noted that “at no point in the hierarchy did a separate Honesty-Humility factor emerge” (p. 530).

In their response, Ashton and Lee criticize the item-selection and argue that the authors did not sample enough items that reflect HEXACO-Agreeableness. “Now, the Crowe et al. variable set did contain a substantial proportion of items that should represent good markers of Honesty-
Humility, but it was sharply lacking in items that represent some of the best markers of HEXACO Agreeableness: for example, one major omission was item content related to
even temper versus anger-proneness, which was represented only by three items of the Patience facet of HEXACO-PI-R Agreeableness” (p. 565). They also are concerned about oversampling of other facets. “The Crowe et al. “Compassion” factor is one of the
all-time great examples of a ‘bloated specific.’ (p. 565). These are valid concerns for analyses with EFA that allow researchers to influence the factor structure by undersampling or oversampling items. However, CFA is not influenced by the number of items that reflect a specific facet. Even two items are sufficient to create a measurement model of a specific factor and to examine whether these two factors are fairly independent or substantially correlated. Thus, it is a novel contribution to examine the structure of pro-social and anti-social traits using CFA.

Exploratory Analyses

Before I present the CFA results, I also used EFA as implemented in MPLUS to examine the structure of the 104 items. Given the predicted hierarchical structure, it is obvious that neither a one-factor nor a two-factor solution should fit the data. The main purpose of an EFA analysis would be to explore the number of primary factors / facets that is represented in the 104 items. To answer this question, it is inappropriate to rely on the scree test that is overly influenced by item-selection or on the chi-square test that leads to over-extraction. Instead, factor solutions can be compared with standard fit indices from CFA such as the Comparative Fit Index (CFI), the Root Mean Square Error of Approximation (RMSEA), the Akaike Information Criterion (AIC), The Bayesian Information Criterion (BIC), and the sample-size adjusted BIC (SSA-BIC).

Although all indices take parsimony into account, three of the indices favor the most complex structure with 20 factors. BIC favors parsimony the most and settles for 12 factors as the optimal number. The sample-size adjusted BIC favors 16 factors. Evidently, the number of primary factors is uncertain. The reason is that many larger primary factors may be split into highly correlated more specific factors that are called nuances. The structure of primary factors can be explored with CFA analyses of primary factors. Thus, I settled for the 12 factor solution to start the CFA analyses.

Another way to determine the number of primary factors is to look at the scales that were represented in the items. There were 9 HEXACO scales: forgiving (A), gentle (A), flexible (A), patient (A), modest (H), fair (H), greed avoidant (H), sincere (H), and altruistic. In addition, there were the Big Five facets empathetic, trusting, straightforward, modest, compassionate, and polite. Some of these facets overlap with HEXACO facets, suggesting that the 12 factor solution may reflect the full content of the HEXACO and Big Five facets.

Exploration of the Primary Factors with CFA

Before I start, it is important to point out a major misconception about CFA. The term confirmatory has mislead researchers to assume that CFA should only be used to confirm a theoretically expected structure. Any post-hoc modifications of a model to fit actual data would then be a questionable research practice. This is a misconception that is based on Joreskog’s unfortunate decision to label his statistical method confirmatory. Joreskog’s actual description of his method that few researchers have read makes it clear that CFA can be used for exploration.

“We shall give examples of how a preliminary interpretation of the factors can be successively modified to determine a final solution that is acceptable from the point of view of both goodness of fit and psychological interpretation. It is highly desirable that a hypothesis that has been generated in this way should subsequently be confirmed or disproved by obtaining new data and subjecting these to a confirmatory analysis.” (p. 183).

There is nothing different from using CFA to explore data than to run a multiple regression analysis or an exploratory ANOVA. Data exploration is an important part of science. It is only questionable when exploratory analyses are falsely presented as confirmatory. For example, I could pretend that I came up with an elaborate theory of agreeableness and present the final model as theoretically predicted. This is known as HARKing (Kerr, 1998). However exploration is needed to generate a model that is worthwhile testing in a confirmatory study. As nobody has examined the structure of agreeableness with CFA, the first attempt to fit a model is naturally exploratory and can only serve as a starting point for the development of a structural model of agreeableness.

Items were considered to be candidate items of a factor if they loaded at least .3 on a factor. This may seem a low level, but it is unrealistic to expect high loadings of single items on a single factor.

The next step was to fit a simple structure to the items of each factor. When possible this model also included an acquiescence factor that coded direct versus reverse scored items. Typically, this model did not fit the data well. The next step was to look for correlated residuals that show shared variance between items. These items violate the assumption of local independence. That is, the only reason for the correlation between items should be the primary factor. Items can show additional shared variance for a number of reasons such as similar wording or shared specific content (nuances). When many items were available, one of the items with correlated residuals was deleted. Another criterion for item selection was the magnitude of primary loadings. A third criterion aimed to get a balanced number of direct and reverse scored items when this was possible.

Out of the 12 factors, 9 factors were interpretable and matched one of the a priori facets. The 9 primary factors were allowed to correlate freely with each other. The model had acceptable overall fit, CFI = .954, RMSEA = .030. Table 2 shows the factors and the items with their source and primary loadings.

The factor analysis only failed to distinguish clearly between the gentle and flexible facets of HEXACO-agreeableness. Thus, HEXACO agreeableness is represented by three rather than four factors. More important is that the four Honesty-Humility facets of the HEXACO model, namely Greed-Avoidance (F11), Sincerity(F9), Modest (F7), and Morality (F5) were clearly identified. Thus, it is possible to examine the relationship of Honesty-Humilty to Big-Five agreeableness with these data. Importantly, CFA is not affected by the number of indicators. Three items with good primary loadings are sufficient to identify a factor.

Table 3 shows the pattern of correlations among the 9 factors. The aim of a structural model of agreeableness is to explain this pattern of correlations. However, visual inspection of these correlations alone can provide some valuable insights about the structure of agreeableness. Strong evidence for a two-factor model would be provided by high correlations among the Honesty-Humility facets, high correlations among the Agreeableness facets, and low correlations between Honesty-Humulity and Agreeableness facets (Campbell & Fiske, 1959).

The pattern of correlations is only partially consistent with the two-factor structure of the HEXACO model. All of the correlations among the Honesty-Humility facets and all of the correlations among the Agreeableness facets are above .38. However, 6 out of the 20 cross-trait correlations are also above .38. Moreover all of the correlations are positive, suggesting that Honesty-Humility and Agreeableness are not independent.

For the sake of comparability, I also computed scales corresponding to the nine factors. Table 3 shows the correlations for scales that could be used for a two-step hierarchical analysis by means of EFA. In general correlations in Table 3 are weaker than correlations in Table 2 because correlations among scale scores are attenuated by random measurement error. The pattern of correlations remains the same, but there are more cross-trait correlations that exceed the lowest same-trait correlation of .31.

The Structure of Agreeableness

A traditional EFA has severe limitations in examining the structure of correlations in Table 3. One major limitation is that structural relations have to be due to higher-order factors. The other limitation is that 9 variables can only identify a small number of factors. These problems are typically overlooked because traditionally EFA results are not examined for goodness of fit. However, an EFA analysis with fit indices in MPLUS shows that the two-factor model does not meet the standard fit criteria of .95 for CFI and .06 for RMSEA, CFI = .939, RMSEA = .092. The two factors clearly did correspond to the Hexaco-Humility and Agreeableness factors, but even with secondary loadings, the model fails to fully account for the pattern of correlations. Moreover, the two factors were correlated r = .41, suggesting that they are not independent of each other.

Figure 1 shows a CFA model that is based on the factor correlations in Table 2. This model does not fit the data as well, CFI = .950, RMSEA = .031, as the simple measurement model with correlated factors, CFI = .954, RMSEA = .030, but I was unable to find a plausible model with better fit. I encourage others to do so. On the other hand, the model fit the data much better than the two-factor EFA model, CFI = .939, RMSEA = 092.

The model does show the expected separation of Honesty-Humility and Agreeableness facets, but the structure is more complex. First morality has nearly equal loadings on the Honesity-Humility and the Agreeableness facet. The markers of Honesty-Humility are the other three facets, modest, manipulative (reversed), and materialistic (reversed). I suggest that the common element of these facets is self-enhancement. The loading of morality suggests that individuals who are highly motivated to self-enhance are more likely to engage in immoral behaviors to do so.

Four of the remaining factors have direct loadings on the agreeableness factor. Considerate (caring/altruistic) has the highest loading, but all four have substantial loadings. The aggressive factor has the weakest relationship with agreeableness. One reason is that it is also related to neuroticism in the Big Five model. In this model, aggressiveness is linked to agreeableness indirectly by consideration and self-enhancement, suggesting that individuals who do not care about others (low consideration) and who care about themselves (self-enhancement) are more likely to aggress against others.

In addition, there were several correlated residuals between some facets. Trusting and forgiving shared unique variance. Maybe forgiveness is more likely to occur when individuals trust people that they had good intentions and are not going to repeat their transgression again in the future. Aggression showed shared variance with modesty and morality. One explanation could be that modestly is related to low assertiveness and that assertive people are more likely to use aggression. Morality may relate to aggression because it is typically considered immoral to harm others. Morality was also related to manipulativeness. Here the connection is rather obvious because manipulating people is immoral.

The model in Figure 1 should not be considered the ultimate solution to the controversy about the structure of pro-social and anti-social behaviors. To the contrary. The model should be considered the first structural model that actually fits the data. In contrast, previous results based on EFA produced models that approximated the structure, but never fit the actual data. Future research should test alternative models and these models should be evaluated in terms of model fit and theoretical plausibility (Joreskog, 1969).

Do these results answer the question whether there are five or six higher-order factors of personality? The answer is no. Consistent with the Five Factor model, the Honesty-Humility or Self-Enhancement factor is not independent of agreeableness. It is therefore reasonable to think about Honesty-Humility as subordinate to Agreeableness in a hierarchical model of traits. Personally, I favor this interpretation of the results. However, proponents of the HEXACO model may argue that the correlation between the agreeableness factor and the honesty-humility factor is low enough to make honesty-humility a separate factor. Moreover, it was not possible to control for evaluative bias (halo) variance in this model, and halo bias may have inflated the correlation between the two factors. On the other hand, if correlations of .4 are considered low enough to split Big-Five factors, it is possible that closer inspection of other Big Five domains can also be split into distinct, yet positively correlated factors. The main appeal of the Big Five model is that the five factors are fairly independent after controlling for evaluative bias variance. Moreover, many facets have loadings of .4 or even lower on the Big Five factor. It is therefore noteworthy that all correlations among the 9 factors were positive and suggest that a general factor produces covariation among them. The common factor can also be clearly interpreted in terms of the focus on self-interest versus other’s interests or needs to guide behaviors. Individuals high in agreeableness take others’s needs and feelings into account, whereas those low in agreeableness are guided strongly by self-interest. The split into two factors may be due to the fact that importance of self and other are not always in conflict with each other. Especially individuals low in self-enhancement may still differ in terms of their pro-social behaviours.


The main contribution of this blog post is to show the importance of testing model fit in investigations of the structure of personality traits. While it may seem self-evident that a theoretical model should fit the data, personality psychologists have failed to test model fit or ignored feedback that their models do not fit the data. Not surprisingly, personality psychologists continue to argue over models because it is easy to propose models, if they do not have to fit actual data. If structural research wants to be an empirical science, it has to subject models to empirical tests that can falsify models that do not fit the data.

I empirically showed that a simple two-factor model does not fit the data. At the same time, I showed that a model with a general agreeableness factor and several independent facets also does not fit the data. Thus, neither of the dominant models fits the data. At the same time, the data are consistent with the idea of a general factor underlying pro-social and anti-social behaviors, while the relationship among facets remains to be explored in more detail. Future research needs to control for evaluative bias variance and examine how the structure of agreeableness is embedded in a larger structural model of personality.


Ashton, M. C., Lee, K., Perugini, M., Szarota, P., de Vries, R. E., Di Blas, et al.
(2004). A six-factor structure of personality-descriptive adjectives: Solutions
from psycholexical studies in seven languages. Journal of Personality and Social
Psychology, 86, 356–366

Lee, K., & Ashton, M. C. (2004). Psychometric properties of the HEXACO Personality
Inventory. Multivariate Behavioral Research, 39, 329–358.

Cross-Cultural Comparisons of Personality: Beware of Method Factors

Ulrich Schimmack
Shigehiro Oishi


Personality ratings on a 25-item Big Five measures by two national samples (US, Japan) were analyzed with an item-level measurement model that separates method factors (acquiescence, halo bias) and trait factors. Results reveal a strong influence of halo bias on US responses that distort cultural comparisons in personality. After correcting for halo bias, Japanese were more conscientious, extraverted, open to experience and less neurotic and agreeable. The results support cultural differences in positive illusions and raises questions about the validity of studies that rely on scale means to examine cultural differences in personality.


Cultural stereotypes imply cross-cultural differences in personality traits. However, cross-cultural studies of personality do not support the validity of these cultural stereotypes (Terracciano et al., 2005). Whenever two measures produce divergent results, it is necessary to examine the sources of these discrepancies. One obvious reason could be that cultural stereotypes are simply wrong. It is also possible that scientific studies of personalty across culture produce misleading results (Perugini & Richetin, 2007). One problem for empirical studies of cross-cultural differences in personality is that cultural differences tend to be small. Culture explains at most 10% of the variance and often the percentages are much smaller. For example, McCrae et al. (2010) found that culture explained only 1.5% of the variance in agreeableness ratings. As some of this variance is method variance, the variance due to actual differences in agreeableness is likely to be less than 1%. With small amounts of valid variance, method factors can have a strong influence on the pattern of mean differences across cultures.

One methodological problem in cross-cultural studies of personality is that personalty measures are developed with a focus on the correlation of items with each other within a population. The item means are not relevant with the exception that items should avoid floor or ceiling effects. However, cross-cultural comparisons rely on differences in the item means. As item means have not been subjected to psychometric evaluations, it is possible that item means lack construct validity. Take “working hard” as an example. How hard people work could be influenced by culture. For example, in poor cultures people have to work harder to make a living. The item “working hard” may correctly reflect variation in conscientiousness within poor cultures and within rich cultures, but the differences between cultures would reflect environmental conditions rather than conscientiousness. As a result, it is necessary to demonstrate that cultural differences in item means are valid measures of cultural differences in personality.

Unfortunately, obtaining data from a large sample of nations is difficult and sample sizes are often rather small. For example, McCrae et al. (2010) examined convergent validity of Big Five scores with 18 nations. The only significant evidence of convergent validity was obtained for neuroticism, r = .44, and extraversion, r = .45. Openness and agreeableness even produced small negative correlations, r = -.27, r = -.05, respectively. The largest cross-cultural studies of personality had 36 overlapping nations (Allik et al., 2017; Schmitt et al., 2007). The highest convergent validity was r = .4 for extraversion and conscientiousness. Low convergent validity, r = .2, was observed for neuroticism and agreeableness, and the convergent validity for openness was 0 (Schimmack, 2020). These results show the difficulty of measuring personality across cultures and the lack of validated measures of cultures’ personality profiles.

Method Factors in Personality Measurement

It is well-known that self-ratings of personality are influenced by method factors. One factor is a stylistic factor in the use of response formats known as acquiescence bias (Cronbach, 1942, 1965). The other factor reflects individual differences in responding to the evaluative meaning of items known as halo bias (Thorndike, 1920). Both method factors can distort cross-cultural comparisons. For example, national stereotypes suggest that Japanese individuals are more conscientious than US American individuals, but mean scores of conscientiousness in cross-cultural studies do not confirm this stereotype (Oishi & Roth, 2009). Both method factors may artificially lower Japan’s mean score because Japanese respondents are less likely to use extreme scores (Min, Cortina, & Miller, 2016) and Asians are less likely to inflate their scores on desirable traits (Kim, Schimmack, & Oishi, 2012). In this article, we used structural equation modeling to separate method variance from trait variance to distinguish cultural differences in response tendencies from cultural differences in personality traits.

Convenience Samples versus National Samples

Another problem for empirical studies of national differences is that psychologists often rely on convenience samples. The problem with convenience samples is that personality can change with age and that there are regional differences in personality within nations (). For example, a sample of students at New York University may differ dramatically from a student sample at Mississippi State University or Iowa State University. Although regional differences tend to be small, national differences are also small. Thus, small regional differences can bias national comparisons. To avoid these biases it is preferable to compare national samples that cover all regions of a nation and a broad age range.

Modeling Approach

The purpose of our study is to advance research on cultural differences in personality by comparing a Japanese and a US national sample that completed the same Big Five personality questionnaire using a measurement model that distinguishes personality factors and method factors. The measurement model is an improved version of Anusic et al.’s (2009) halo-alpha-beta model (Schimmack, 2019). The model is essentially a tri-factor model.

Figure 1

That is, each item loads on three factor, namely (a) a primary loading on one of the Big Five factors, (b) a loading on an acquiescence bias factor, and (c) a loading on the evaluative bias/halo factor. As Big Five measures typically do not show a simple structure, the model also can include secondary loadings on other Big Five factors. This measurement model has been successfully fitted to several Big Five questionnaires (Schimmack, 2019). This is the first time, the model is applied to a multiple-group model to compare measurement models for US and Japanese samples.

We first fitted a very restrictive model that assumed invariance across the two factors. Given the lack of psychometric cross-cultural comparisons, we expected that this model would not have acceptable fit. We then modified the model to allow for cultural differences in some primary factor loadings, secondary factor loadings, and item intercepts. This step makes our work exploratory. However, we believe that this exploratory work is needed as a first step towards psychometrically sound measurement of cultural differences.


Participants (N = 952 Japanese, 891 US) were recruited by Nikkei Research Inc. and its U.S. affiliate using a national probabilistic sampling method based on gender and age. The mean age was 44. The data have been used before to compare the influence of personality on life-satisfaction judgments, but without comparing mean levels in personality and life-satisfaction (Kim, Schimmack, Oishi, & Tsutsui, 2018).


The Big Five items were taken from the International Personality Item Pool (Goldberg et al., 2006). There were five items for each of the Big Five dimensions (Table 1).


We first fitted a model without mean structure to the data. A model with strict invariance for the two samples did not have acceptable fit using RMSEA < .06 and CFI > .95 as criterion values, RMSEA = .064, CFI = .834. However, CFI values should not be expected to reach .95 in models with single-item indicators (Anusic et al., 2009). Therefore, the focus is on RMSEA. We first examined modification indices (MI) of primary loadings. We used MI > 30 as a criterion to free parameters to avoid overfitting the model. We found seven primary loadings that would improve model fit considerably (n4, e3, a1, a2, a3, a4, c4). Freeing these parameter improved the model (RMSEA = .060, CFI = .857). We next examined loadings on the halo factor because it is likely that some items differ in their connotative meaning across languages. However, we found only two notable MIs (o1, c4). Freeing these parameters improved model fit (RMSEA = .057, CFI = .871). We identified six secondary loadings that differed notably across cultures. One was a secondary loading on neuroticism (e4) and four were secondary loadings on agreeableness (n5, e1, e3, o4), and one was a secondary loading on conscientiousness (n3). Freeing these parameters improved model fit (RMSEA = .052, CFI = .894). We were satisfied with this measurement model and continued with the means model. The first model fixed the item intercepts and factor means to be identical. This model had worse fit than the model without a means structure (RMSEA = .070, CFI = .803). The biggest MI was observed for the mean of the halo factor. Allowing for mean differences in halo improved model fit considerably (RMSEA = .060, CFI = .849). MIs next suggested to allow for mean differences in extraversion and agreeableness. We next allowed for mean differences in the other factors. This further improved model fit (RMSEA = .058, CFI = .864), but not as much. MIs suggested seven items with different item intercepts (n1, n5, e3, o3, a5, c3 c5). Relaxing these parameters improved model fit close to the level for the model without a mean structure (RMSEA = .053, CFI = .888).

Table 1 shows the primary loadings and the loadings on the halo factor for the 25 items.

Table 1

The results show very similar primary loadings for most items. This means that factors have similar meaning in the two samples and that it is possible to compare the two cultures. Nevertheless, there are some differences that could bias comparisons based on item-sum-scores. The item “feeling comfortable around people” loads much more strongly on the extraversion factor in the US than in Japan. The agreeableness items “insult people” and “sympathize with others’ feelings” also load more strongly in the US than in Japan. Finally, “making a mess of things” is a conscientiousness item in the US, but not in Japan. The fact that item loadings are more consistent with the theoretical structure can be attributed to the development of the items in the US.

A novel and important finding is that most loadings on the halo factor are also very similar across nations. For example, the item “have excellent ideas” shows a high loading for the US and Japan. This finding contradicts the idea that evaluative biases are culture-specific (Church et al., 2014). The only notable difference is the item “make a mess of things” that has no notable loading on the halo factor in Japan. Even in English, the meaning of this item is ambiguous and future studies should replace this item with a better item. The correlation between the halo loadings for the two samples is high, r = .96.

Table 2 shows the item means and the item intercepts of the model.

Table 2

The item means of the US sample are strongly correlated with the loadings on the halo factor, r = .81. This is a robust finding in Western samples. More desirable items are endorsed more. The reason could be that individuals actually act in desirable ways most of the time and that halo bias influences item means. Surprisingly, there is no notable correlation between item means and loadings on the halo factor for the Japanese sample, r = .08. This pattern of results suggests that US means are much more strongly influenced by halo bias than Japanese means. Further evidence is provided by inspecting the mean differences. For desirable items (low N, high E, O, A, & C) US means are always higher than Japanese’ means. For undesirable items, the US means are always lower than Japanese’ means, except for the item “stay in the background” where the means are identical. The difference scores are also positively correlated with the halo loadings, r = .90. In conclusion, there is strong evidence that halo bias distorts the comparison of personality in these two samples.

The item intercepts show cultural differences in items after taking cultural differences in halo and the other factors into account. Notable differences were observed from some items. Even after controlling for halo and extraversion, US respondents report higher levels of being comfortable around people than Japanese. This difference fits cultural stereotypes. After correcting for halo bias, Japanese now score higher on getting chores done right away than Americans. This also fits cultural stereotypes. However, Americans still report paying more attention to detail than Japanese, which is inconsistent with cultural stereotypes. Extensive validation research is needed to examine whether these results reflect actual cultural differences in personality and behaviours.

Figure 2 shows the mean differences on the Big Five factors and the two bias factors.

Figure 2

Figure 2 shows a very large difference in halo bias. The difference is so large that it seems implausible. Maybe the model is overcorrecting, which would bias the mean differences for the actual traits in the opposite direction. There is little evidence of cultural differences in acquiescence bias. One open question is whether the strong halo effect is entirely due to evaluative biases. It is also possible that a modesty bias plays a role because modesty implies less extreme responses to desirable items and less extreme responses to undesirable items. To separate the two, it would be necessary to include frequent and infrequent behaviours that are not evaluative.

The most interesting result for the Big Five factors is that the Japanese sample scores higher in conscientiousness than the US sample after halo bias is removed. This reverses the mean differences in this sample and previous studies that show higher conscientiousness for US than Japanese samples (). The present results suggest that halo bias masks the actual difference in conscientiousness. However, other results are more surprising. In particular, the present results suggest that Japanese people are more extraverted than Americans. This contradicts cultural stereotypes and previous studies. The problem is that cultural stereotypes could be wrong and that previous studies did not control for halo bias. More research with actual behaviours and less evaluative items is needed to draw strong conclusions about personality differences between cultures.


It has been known for 100 years that self-ratings of personality are biased by connotative meaning. At least in North America it is common to see a strong correlation between the desirability of items and the means of self-ratings. There is also consistent evidence that Americans rate themselves in a more desirable manner than the average American (). However, this does not mean that Americans are seeing themselves as better than everybody else. In fact, self-ratings tend to be slightly less favorable than ratings of friends or family members (), indicating a general evaluative biases to rate oneself and close others favorably.

Given the pervasiveness of evaluative biases in personality ratings it is surprising that halo bias has received so little attention in cross-cultural studies of personality. One reason could be the lack of a good method to measure and remove halo variance from personality ratings. Despite early attempts to detect socially desirable responding, lie scales have shown little validity as bias measures (ref). The problem is that manifest scores on lie scales contain as much valid personality variance as bias variance. Thus, correcting for scores on these scales literally throws out the baby (valid variance) with the bathwater (bias variance). Structural equation modeling (SEM) solves this problem by spitting observed variances into unobserved or latent variances. However, personality psychologists have been reluctant to take advantage of SEM because item models require large samples and theoretical models were too simplistic and produced bad fit. Informed by multi-rater studies that emerged in the 1990s, we developed a measurement model of the Big Five that separates personality variance from evaluative bias variance (Anusic, et al., 2009; Schimmack, Kim, & 2012; Schimmack, 2019). Here we applied this model for the first time to cross-cultural data to examine whether cultures differ in halo bias. The result suggest that halo bias has a strong influence on personality ratings in the US, but not in Japan. The differences in halo bias distort comparisons on the actual personality traits. While raw scores suggest that Japanese people are less conscientious than Americans, the corrected factor means suggest the opposite. Japanese participants also appeared to be less neurotic, more extraverted and open to experiences, which was a surprising result. Correcting for halo bias did not change the cultural differences in agreeableness. Americans were more agreeable than Japanese with and without correction for halo bias. Our results do not provide a conclusive answer about cultural differences in personality, but they shed a new light on several questions in personality research.

Cultural Differences in Self-enhancement

One unresolved question in personality psychology is whether positive biases in self-perceptions also known as self-enhancement are unique to American or Western cultures or whether they are a universal phenomenon (Church et al., 2016). One problem are different approaches to the measurement of self-enhancement. The most widely used method are social comparisons where individuals compare themselves to an average person. These studies tend to show a persistent better-than-average effect in all cultures (ref). However, this finding does not imply that halo biases are equally strong in all cultures. Brown and Kobayashi (2002) found better-than-average effects in the US and Japan, but Japanese ratings of the self and others were less favorable than those in the US. Kim et al. (2012) explain this pattern with a general norm to be positive in North America that influences ratings of the self as well as ratings of others. Our results are consistent with this view and suggests that self-enhancement is not a universal tendency. More research with other cultures is needed to examine which cultural factors moderate halo biases.

Rating Biases or Self-Perception Biases

An open question is whether halo biases are mere rating biases or reflect distorted self-perceptions. One model suggests that participants are well aware of their true personality, but merely present themselves in a more positive light to others. Another model suggests that individuals truly believe that their personality is more desirable than it actually is. It is not easy to distinguish between these two models empirically. zzz

Halo Bias and the Reference Group Effect

In an influential article, Heine et al. (2002) criticized cross-cultural comparisons in personality ratings as invalid. The main argument was that respondents adjust the response categories to cultural norms. This adjustment was called the reference group effect. For example, the item “insult people” is not answered based on the frequency of insults or a comparison of the frequency of insults to other behaviours. Rather it is answered in comparison to the typical frequency of insults in a particular culture. The main prediction made by the reference group effect is that responses in all cultures should cluster around the mid-point of a Likert-scale that represents the typical frequency of insults. As a result, cultures could differ dramatically in the actual frequency of insults, while means on the subjective rating scales are identical.

The present results are inconsistent with a simple reference group effect. Specifically, the US sample showed notable variation in item means that was related to item desirability. As a result, undesirable items like “insult people” had a much lower mean, M = 1.83, than the mid-point of the scale (3), and desirable items “have excellent ideas” had a higher mean (M = 3.73) than the midpoint of the scale. This finding suggests that halo bias rather than a reference group effect threatens the validity of cross-cultural comparisons.

Reference group effects may play a bigger role in Japan. Here item means were not related to item desirabilty and clustered more closely around the mid-point of the scale. The highest mean was 3.56 for worry and the lowest mean was 2.45 for feeling comfortable around people. However, other evidence contradicts this hypothesis. After removing effects of halo and the other personality factors, item intercepts were still highly correlated across the two national samples, r = .91. This finding is inconsistent with culture-specific reference groups that would not produce consistent item intercepts.

Our results also provide a new explanation for the low conscientiousness of Japanese samples. A reference group effect would not predict a significantly lower level of conscientiousness. However, a stronger halo effect in the US explains this finding because conscientiousness is typically assessed with desirable items. Our results are also consistent with the finding that self-esteem and self-enhancement are more pronounced in the US than in Japan (Heine & Buchtel, 2009). These aforementioned biases inflate conscientiousness scores in the US. After removing this bias, Japanese rate themselves as more conscientious than US Americans.

Limitations and Future Directions

We echo previous calls for validation of personality scores of nations (Heine & Buchtel, 2009). The current results are inconsistent across questionnaires and even the low level of convergent validity may be inflated by cultural differences in response styles. Future studies should try to measure personality with items that minimize social desirability and use response formats that avoid the use of reference groups (e.g., frequency estimates). Moreover, results based on ratings should be validated with objective indicators of behaviours.

Future research also needs to take advantage of developments in psychological measurement and use models that can identify and control for response artifacts. The present model shows the ability of separating evaluative biases or halo variance from actual personality variance. Future studies should use this model to compare a larger number of nations.

The main limitation of our study is the relatively small number of items. The larger the number of items, the easier it is to distinguish item-specific variance, method variance, and trait variance. The measure also did not properly take into account that the Big Five are higher-order factors of more basic traits called facets. Measures like the BFI-2 or the NEO-PI3 should be used to study cultural differences at the facet level, which often shows unique influences of culture that are different from effects on the Big Five (Schimmack, 2020).

We conclude with a statement of scientific humility. The present results should not be taken as clear evidence about cultural differences in personality. Our article is merely a little step towards the goal of measuring personality differences across cultures. One obstacle in revealing such differences is that national differences appear to be relatively small compared to the variation in personality within nations. One possible explanation for this is that variation in personality is caused more by biological than cultural factors. For example, twin studies suggest that 40% of the variance in personality traits is caused by genetic variation within a population, whereas cross-cultural studies suggest that at most 10% of the variance is caused by cultural influences on population means. Thus, while uncovering cultural variation in personality is of great scientific interest, evidence of cultural differences between nations should not be used to stereotype individuals from different nations. Finally, it is important to distinguish between personality traits that are captured by Big Five traits and other personality attributes like attitudes, values, or goals that may be more strongly influenced by culture. The key novel contribution of this article is to demonstrate that cultural differences in response styles exists and distort national comparisons of personality with simple scale means. Future studies need to take response styles into account.


Cronbach, L. J. (1942). Studies of acquiescence as a factor in the true-false test. Journal of Educational Psychology, 33(6), 401–415.

Heine, S. J., & Buchtel, E. E. (2009). Personality: The universal and the culturally specific. Annual Review of Psychology, 60, 369–394.

Perugini, M., & Richetin, J. (2007). In the land of the blind, the one-eyed man is king. European Journal of Personality, 21(8), 977–981.

Schimmack, U. (2020). Personality science: The science of human diversity. TopHat, 978-1-77412-253-2.

Terracciano, A. et al. (2005). National character does not reflect mean personality
trait levels in 49 cultures. Science, 310, 96–100.

JPSP:PPID = Journal of Pseudo-Scientific Psychology: Pushing Paradigms – Ignoring Data


Ulrich Orth, Angus Clark, Brent Donnellan, Richard W. Robins (DOI: 10.1037/pspp0000358) present 10 studies that show the cross-lagged panel model (CLPM) does not fit the data. This does not stop them from interpreting a statistical artifact of the CLPM as evidence for their vulnerability model of depression. Here I explain in great detail why the CLPM does not fit the data and why it creates an artifactual cross-lagged path from self-esteem to depression. It is sad that the authors, reviewers, and editors were blind to the simple truth that a bad-fitting model should be rejected and that it is unscientific to interpret parameters of models with bad fit. Ignorance of basic scientific principles in a high-profile article reveals poor training and understanding of the scientific method among psychologists. If psychology wants to gain respect and credibility, it needs to take scientific principles more seriously.


Psychology is in a crisis. Researchers are trained within narrow paradigms, methods, and theories that populate small islands of researchers. The aim is to grow the island and to become a leading and popular island. This competition between islands is rewarded by an incentive structure that imposes the reward structure of capitalism on science. The winner gets to dominate the top journals that are mistaken as outlets of quality. However, just like Coke is not superior to Pepsi (sorry Coke fans), the winner is not better than the losers. They are just market leaders for some time. No progress is being made because the dominant theories and practices are never challenged and replaced with superior ones. Even the past decade that has focused on replication failures has changed little in the way research is conducted and rewarded. Quantity of production is rewarded, even if the products fail to meet basic quality standards as long as naive consumers of researchers are happy.

This post is about the lack of training in the analysis of longitudinal data with a panel structure. A panel study essentially repeats the measurement of one or several attributes several times. Nine years of undergradute and graduate training leave most psychologists without any training how to analyze these data. This explains why the cross-lagged panel model (CLPM) was criticized four decades ago (Rogosa, 1980), but researchers continue to use it with the naive assumption that it is a plausible model to analyze panel data. Critical articles are simply ignored. This is the preferred way of dealing with criticism by psychologists. Here, I provide a detailed critique of CLPM using Orth et al.’s data ( and simulations.

Step 1: Examine your data

Psychologists are not trained to examine correlation matrices for patterns. They are trained to submit their data to pre-specified (cookie-cutter) models and hope that the data fit the model. Even if the model does not fit, results are interpreted because researchers are not trained in modifying cookie cutter models to explore reasons for bad fit. To understand why a model does not fit the data, it is useful to inspect the actual pattern of correlations.

To illustrate the benefits of visual inspection of the actual data, I am using the data from the Berkeley Longitudinal Study (BLS), which is the first dataset listed in Orth et al.’s (2020) table that lists 10 datasets.

To ease interpretation, I break up the correlation table into three components, namely (a) correlations among self-esteem measures (se1-se4 with se1-se4), correlations among depression measures (de1-de4 with de1-de4), and correlations of self-esteem measures with depression measures (se1-se4 with de1-de4);

Table 1

Table 1 shows the correlation matrix for the four repeated measurements of self-esteem. The most important information in this table is how much the magnitude of the correlations decreases along the diagonals that represent different time lags. For example, the lag-1 correlations are .76, .79, and .74, which approximately average to a value of .76. The lag-2 correlations are .65 and .69, which averages to .67. The lag-3 correlation is .60.

The first observation is that correlations are getting weaker as the time-lag gets longer. This is what we would expect from a model that assumes self-esteem actually changes over time, rather than just fluctuating around a fixed set-point. The latter model implies that retest correlations remain the same over different time lags. So, we do have evidence that self-esteem changes over time, as predicted by the cross-lagged panel model.

The next question is how much retest correlations decrease with increasing time lags. The difference from lag-1 to lag-2 is .74 – .67 = .07. The difference from lag-2 to lag-3 is .67 – .60, which is also .07. This shows no leveling off of the decrease in these data. It is possible that the next wave would produce a lag-4 correlation of .53, which would be .07 lower than then lag-3 correlation. However, a difference of .07 is not very different from 0, which would imply that change asymptotes at .60. The data are simply insufficient to provide strong information about this.

The third observation is that the lag-2 correlation is much stronger than the square of the lag-1 correlations, .67 > .74^2 = .55. Similarly, the lag-3 correlation is stronger than the product of the lag-1 and lag-2 correlations, .60 > .74 * .67 = .50 This means that a simple autoregressive model with observed variables does not fit the data. However, this is exactly the model of Orth et al.’s CLPM.

It is easy to examine the fit of this part of the CLPM model, by fitting an autoregressive model to the self-esteem panel data.

se2-se4 PON se1-3 ! This command regresses each measure on the previous measure (n on n-1).
! There is one thing I learned from Orth et al., and it was the PON command of MPLUS

Table 2

Table 2 shows the fit of the autoregressive model. While CFI meets the conventional threshold of .95 (higher is better), RMSEA shows terrible fit of the model (.06 or lower are considered acceptable). This is a problem for cookie-cutter researchers who think CLPM is a generic model that fits all data. Here we see that the model makes unrealistic assumptions and we already know what the problem is based on our inspection of the correlation table. The model predicts more change than the data actually show. We are therefore in a good position to reject the CLPM as a viable model for these data. This is actually a positive outcome. The biggest problem in correlational research are data that fit all kinds of models. Here we have data that actually disconfirm some models. Progress can be made, but only if we are willing to abandon the CLPM.

Now let’s take a look at the depression data, following the same steps as for the self-esteem data.

Table 3

The average lag-1 correlation is .43. The average lag-2 correlaiton is .45, and the lag-3 correlation is .4. These results are problematic for an autoregressive model because the lag-2 correlation is not even lower than the lag-1 correlation.

Once more it is hard to tell, whether retest-correlations are approaching an asymptote. In this case, the lag-2 minus lag-1 difference is -.02 and the lag-3 minus lag-2 difference is .05.

Finally, it is clear that the autoregressive model with manifest variables overestimates change. The lag-2 correlation is stronger than the square of the lag-1 correlations, .45 > .43^2 = .18, and the lag-3 correlation is stronger than the lag-1 * lag-2 correlation, .40 > .43*.45 = .19.

Given these results, it is not surprising that the autoregressive model fits the data even less than for the self-esteem measures (Table 4).

de2-de4 PON de1-de3 ! regress each depression measure on the previous one.

Talble 4

Even the CFI value is now in the toilet and the RMSEA value is totally unacceptable. Thus, the basic model of stability and change implemented in CLPM is inconsistent with the data. Nobody should proceed to build a more complex, bivariate model if the univariate models are inconsistent with the data. The only reason why psychologists do so all the time is that they do not think about CLPM as a model. They think CLPM is like a t-test that can be fitted to any panel data without thinking. No wonder psychology is not making any progress.

Step 2: Find a Model That Fits the Data

The second step may seem uncontroversial. If one model does not fit the data, there is probably another model that does fit the data and this model has a higher chance of being the model that reflects the causal processes that produced the data. However, psychologists have an uncanny ability to mess up even the simplest steps in data analysis. They have convinced themselves that it is wrong to fit models to data. The model has to come first so that the results can be presented as confirming a theory. However, what is the theoretical rational of the CLPM? It is not motivated by any theory of development, stability, or change. It is as atheoretical as any other model. It only has the advantage that it became popular on an island of psychology and now people use it without being questioned about it. Convention and conformity are not pillars of science.

There are many alternative models to CLPM that can be tried. One model is 60 years old and was introduced by Heise (1969). It is also an autoregressive model, but it also allows for occassion specific variance. That is, some factors may temporarily change individuals’ self-esteem or depression without any lasting effects on future measurements. This is a particularly appealing idea for a symptom checklist of depression that asks about depressive symptoms in the past four weeks. Maybe somebody’s cat died or it was a midterm period and depressive symptoms were present for a brief period, but these factors have no influence on depressive symptoms a year later.

I first fitted Heise’s model to the self-esteem data.

sse1 BY se1@1;
sse2 BY se2@1;
sse3 BY se3@1;
sse4 BY se4@1;
sse2-sse4 PON sse1-sse3 (stability);
se1-se4 (se_osv) ! occasion specific variance in self-esteem

Model fit for this model is perfect. Even the chi-square test is not significant (which in SEM is a good thing, because it means the model closely fits the data).

Model results show that there is significant occasion specific variance. After taking this variance into account the stability of the variance that is not occassion-specific, called state variance by Heise, is around r = .9 from one occasion to the next.

Fit for the depression data is also perfect.

There is even more occasion specific variance in depressive symptoms, but the non-occasion-specific variance is even more stable as the non-occasion-specific variance in self-esteem.

These results make perfect sense if we think about the way self-esteem and depression are measured. Self-esteem is measured with a trait measure of how individuals see themselves in general, ignoring ups and downs and temporary shifts in self-esteem. In contrast, depression is assessed with questions about a specific time period and respondents are supposed to focus on their current ups and downs. Their general disposition should be reflected in these judgments only to the extent that it influences their actual symptoms in the past weeks. These episodic measures are expected to have more occasion specific variance if they are valid. These results show that participants are responding to the different questions in different ways.

In conclusion, model fit and the results favor Heise’s model over the cookie-cutter CLPM.

Step 3: Putting the two autoregressive models together

Let’s first examine the correlations of self-esteem measures with depression measures.

The first observation is that the same-occasion correlations are stronger (more negative) than the cross-occasion correlations. This suggests that occasion specific variance in self-esteem is correlated with occasion specific variance in depression.

The second observation is that the lagged self-esteem to depression correlations (e.g., se1 with de2) do not become weaker (less negative) with increasing time lag, lag-1 r = -.36, lag-2 r = -.32, lag-3 r = .33.

The third observation is that the lagged depression to self-esteem correlations (e.g., de1 with se2) do not decrease from lag-1 to lag-2, although they do become weaker from lag-2 to lag-3, lag-1 r = -.44, lag-2 r = -.45, lag-3 r = -.35.

The fourth observation is that the lagged self-esteem to depression correlations (se1 with de2) are weaker than the lagged depression to self-esteem (de1 with se2) correlations . This pattern is expected because self-esteem is more stable than depressive symptoms. As illustrated in the Figure below, the path from de1-se4 is stronger than the path form se1 to de4 because the path from se1 to se4 is stronger than the path from de1 to de4.

Regression analysis or structural equation modeling is needed to examine whether there are any additional lagged effects of self-esteem on depressive symptoms. However, a strong cross-lagged path from se1 to de4 would produce a stronger correlation of se1 and de4, if stability were equal or if the effect is strong. So, a stronger lagged self-esteem to depression correlation than a lagged depression to self-esteem correlation would imply a cross-lagged effect from self-esteem to depression, but the reverse pattern is inconclusive because self-esteem is more stable.

Like Orth et al. (2020) I found that Heise’s model did not converge. However, unlike Orth et al. I did not conclude from this finding that the CLPM model is preferable. After all, it does not fit the data. Model convergence is sometimes simply a problem of default starting values that work for most models but not for all models. In this case, the high stability of self-esteem produced a problem with default starting values. Just setting this starting value to 1 solved the convergence problem and produced a well-fitting result.

The model results show no negative lagged prediction of depression from self-esteem. In fact, a small positive relationship emerged, but it was not statistically significant.

It is instructive to compare these results with the CLPM results. The CLPM model is nested in the Heise model. The only difference is that the occasion-specific variances of depression and self-esteem are fixed to zero. As these parameters were constrained across occasions, this model has two fewer parameters and the model df increase from 24 to 26. Model fit decreased in the more parsimonious model. However, the overall fit is not terrible, although RMSEA should be below .06 [Interestingly, the CFI value changed from a value over .95 to a value .94 when I estimated the model with MPLUS8.2, whereas Orth et al. used MPLUS8]. This shows the problem of relying on overall fit to endorse models. Overall fit is often good with longitudinal data because all models predict weaker correlations over longer time intervals. The direct model comparison shows that the Heise model is the better model.

In the CLPM model, self-esteem is a negative lagged predictor of depression. This is the key finding that Orth and colleagues have been using to support the vulnerability model of depression (low self-esteem leads to depression).

Why does the CLPM model produce negative lagged effects of self-esteem on depression. The reason is that the model underestimates the long-term stability of depression from time 1 to time 3 and time 4. To compensate for this it can use self-esteem that is more stable and then link self-esteem at time 2 with depression at time 3 (.745 * -.191) and self-esteem at time 3 with depression at time 4 (.742 * .739 * -.190). But even this is not sufficient to compensate for the misprediction of depression over time. Hence, the worse fit of the model. This can be seen by examining the model reproduced correlation matrix in the MPLUS Tech1 output.

Even with the additional cross-lagged path, the model predicts only a correlation of r = .157 from de1 to de4, while the observed correlation was r = .403. This discrepancy merely confirms what the univariate models showed. A model without occasion-specific variances underestimates long-term stability.

Interem Conclusion

Closer inspection of Orth et al.’s data shows that the CLPM does not fit the data. This is not surprising because it is well-known that the cross-lagged panel model often underestimates long-term stability. Even Orth has published univariate analyses of self-esteem that show a simple autoregressive model does not fit the data (Kuster & Orth, 2013). Here I showed that using the wrong model of stability creates statistical artifacts in the estimation of cross-lagged path coefficients. The only empirical support for the vulnerability model of depression is a statistical artifact.

Replication Study

I picked the My Work and I (MWI) dataset for a replication study. I picked it because it used the same measures and had a relatively large sample size (N = 663). However, the study is not an exact or direct replication of the previous one. One important difference is that measurements were repeated every two months rather than every year. The length of the time interval can influence the pattern of correlations.

There are two notable differences in the correlation table. First, the correlations increase with each measurement from .782 for se1 with se2 to .871 from se4 to se5. This suggests a response artifact, such as a stereotypic response styles that inflates consistency over time. This is more likely to happen for shorter intervals. Second, the difference between correlations with different lags are much smaller. They were .07 in the previous study. Here the differences are .02 to .03. This means there is hardly any autoregressive structure, suggesting that a trait model may fit the data better.

The pattern for depression is also different from the previous study. First, the correlations are stronger, which makes sense, because the retest interval is shorter. Somebody who suffers from depressive symptoms is more likely to still suffer two months later than a year later.

There is a clearer autoregressive structure for depression and no sign of stereotypic responding. The reason could be that depression was assessed with a symptom checklist that asks about the previous four weeks. As this question covers a new time period each time, participants may avoid stereotypic responding.

The depression-self-esteem correlations also become stronger (more negative) over time from r = -.538 to r = -.675. This means that a model with constrained coefficients may not fit the data.

The higher stability of depression explains why there is no longer a consistent pattern of stronger lagged depression to self-esteem correlations (de1 with se2) above the diagonal than self-esteem to depression correlations (se1 with de2) below the diagonal. Five correlations are stronger one way and five correlations are stronger the other way.

For self-esteem, the autoregressive model without occasion-specific variance had poor fit (RMSEA = .170, CFI = .920). Allowing for occasion-specific variance improved fit and fit was excellent (RMSEA = .002, CFI = .999). For depression, the autoregressive model without occasion-specific variance had poor fit (RMSEA = .113, CFI = .918). The model with occasion-specific variance fit better and had excellent fit (RMSEA = .029, CFI = .995). These results replicate the previous results and show that CLPM does not fit because it underestimates stability of self-esteem and depression.

The CLPM model also had bad fit in the original article (RMSEA = .105, CFI = .932). In comparison, the model with occasion specific variances had much better fit (RMSEA = .038, CFI = .991). Interestingly, this model did show a small, but statistically significant path from self-esteem to depression (effect size r = -.08). This raises the possibility that the vulnerability effect may exist for shorter time intervals of a few months, but not for longer time intervals of a year or more. However, Orth et al. do not consider this possibility. Rather, they try to justify the use of the CLPM to analyze panel data even though the model does not fit.


Orth et al. note “fit values were lowest for the CLPM” (p. 21) with a footnote that recognizes the problem of the CLPM, “As discussed in the Introduction, the CLPM underestimates the long-term stability of constructs, and this issue leads to misfit as the number of waves increases” (p. 63).

Orth et al. also note correctly that the cross-lagged effect of self-esteem on depression emerges more consistently with the CLPM model. By now it is clear why this is the case. It emerges consistently because it is a statistical artifact produced by the underestimation of stability in depression in the CLPM model. However, Orth et al.’s belief in the vulnerability effect is so strong that they are unable to come to a rational conclusion. Instead they propose that the CLPM model, despite its bad fit, shows something meaningful.

We argue that precisely because the prospective effects tested in the CLPM are also based on between-person variance, it may answer questions that cannot be assessed with models that focus on within-person effects. For example, consider the possible effects of warm parenting on children’s self-esteem (Krauss, Orth, & Robins, 2019): A cross-lagged effect in the CLPM would indicate that children raised by warm parents would be more likely to develop high self-esteem than children raised by less warm parents. A cross-lagged effect in the RI-CLPM would indicate that children who experience more parental warmth than usual at a particular time point will show a subsequent increase in self-esteem at the next time point, whereas children who experience less parental warmth than usual at a particular time point will show a subsequent drop in self-esteem at the next time point

Orth et al. then point out correctly that the CLPM is nested in other models and makes more restrictive assumptions about the absence of occasion specific variance or trait variance, but they convince themselves that this is not a problem.

As was evident also in the present analyses, the fit of the CLPM is typically not as good as the fit of the RI-CLPM (Hamaker et al., 2015; Masselink, Van Roekel, Hankin, et al., 2018). It is important to note that the CLPM is nested in the RI-CLPM (for further information about how the models examined in this research are nested, see Usami, Murayama, et al., 2019). That is, the CLPM is a special case of the RI-CLPM, where the variances of the two random intercept factors and the covariance between the random intercept factors are constrained to zero (thus, the CLPM has three additional degrees of freedom). Consequently, with increasing sample size, the RI-CLPM necessarily fits significantly better than the CLPM (MacCallum, Browne, & Cai, 2006). However, does this mean that the RI-CLPM should be preferred in model selection? Given that the two models differ in their conceptual meaning (see the discussion on between- and within-person effects above), we believe that the decision between the CLPM and RI-CLPM should not be based on model fit, but rather on theoretical considerations.

As shown here, the bad fit of CLPM is not an unfair punishment of a parsimonious model. The bad fit reveals that the model fails to model stability correctly. To disregard bad fit and to favor the more parsimonious model even if it doesn’t fit makes no sense. By the same logic, a model without cross-lagged paths would be more parsimonious than a model with cross-lagged paths and we could reject the vulnerability model simply because it is not parsimonious. For example, when I fitted the model with occasion specific variances and without cross-lagged paths, model fit was better than model fit of the CLPM (RMSEA = .041 vs. RMSEA = .109) and only slightly worse than model fit of the model with occasion specific variance and cross-lagged paths (RMSEA = .040).

It is incomprehensible to methodologists that anybody would try to argue in favor of a model that does not fit the data. If a model consistently produces bad fit, it is not a proper model of the data and has to be rejected. To prefer a model because it produces a consistent artifact that fits theoretical preferences is not science.

Replication II

Although the first replication mostly confirmed the results of the first study, one notable difference was the presence of statistically significant cross-lagged effects in the second study. There are a variety of explanations for this inconsistency. The lack of an effect in the first study could be a type-II error. The presence of an effect in the first replication study could be a type-I errror. Finally, the difference in time intervals could be a moderator.

I picked the Your Personality (YP) dataset because it was the only dataset that used the same measures as the previous two studies. The time interval was 6 months, which is in the middle of the other two intervals. This made it interesting to see whether results would be more consistent with the 2-month or the 1-year intervals.

For self-esteem, the autoregressive model with occasion specific variance had a good fit to the data (RMSEA = .016, CFI = .999). Constraining the occasion specific variance to zero reduced model fit considerably (RMSEA = .160, CFI = .912). Results for depression were unexpected. The model with occasion specific variance showed non-significant and slightly negative residuals for the state variances. This finding implies that there are no detectable changes in depression over time and that depression scores only have a stable trait and occasion specific variance. Thus, I fixed the autoregressive parameters to 1 and the residual state variances to zero. This model is equivalent to a model that specifies a trait factor. Even this model had barely acceptable fit (RMSEA = .062, CFI = .962). Fit could be increased by relaxing the constraints on the occasion specific variance (RMSEA = .060, CFI = .978). However, a simple trait model fit the data even better (RMSEA = .000, CFI = 1.000). The lack of an autoregressive structure makes it implausible that there are cross-lagged effects on depression. If there is no new state variance, self-esteem cannot be a predictor of new state variance.

The presence of a trait factor for depression suggests that there could also be a trait factor for self-esteem and that some of the correlations between self-esteem and depression are due to correlated traits. Therefore I added a trait factor to the measurement model of self-esteem. This model had good fit (RMSEA = .043, .993) and fit was superior to the CLPM (RMSEA = .123, CFI = .883). The model showed no significant cross-lagged effect from self-esteem to depression and the parameter estimate was positive rather than negative, .07. This finding is not surprising given the lack of decreasing correlations over time for depression.

Replication III

The last openly shared datasets are from the California Families Project (CFP). I first examined the children’s data (CFP-C) because Orth et al. (2020) reported a significant vulnerability effect with the RI-CLPM.

For self-esteem, the autoregressive model without occasion-specific variance had bad fit (RMSEA = .108, CFI = .908). Even the model with occasion-specific variance had poor fit (RMSEA = .091, CFI = .945). In contrast, a model with a trait factor and without occasion specific variance had good fit (RMSEA = .023, CFI = .997). This finding suggests that it is necessary to include a stable trait factor to model stability of self-esteem correctly in this dataset.

For depression, the autoregressive model without occasion-specific variance had bad fit (RMSEA = .104, CFI = .878). Even the model with occasion-specific variance had poor fit (RMSEA = .103, CFI = .897). Adding a trait factor produced a model with acceptable fit (RMSEA = .051, CFI = .983).

The trait-state model fit the data well (RMSEA = .989, CFI = .032) and much better than the CLPM (RMSEA = .079, CFI = .914). The autoregressive effect of self-esteem on depression was not significant, and only have the size of the effect size in the RI-CLPM ( -.05 vs. -.09). The difference is due to the constraint on the trait factor. Relaxing these constraints improves model fit and the vulnerability effect becomes non-significant.

Replication IV

The last dataset is based on the mothers’ self-reports in the California Families Project (CFP-M).

For self-esteem, the autoregressive model without occasion-specific variance had bad fit (RMSEA = .139, CFI = .885). The model with occasion specific variance improved fit (RMSEA = .049, CFI = .988). However, the trait-state model had even better fit (RMSEA = .046, CFI = .993).

For depression, the autoregressive model without occasion-specific variance had bad fit (RMSEA = .127, CFI = .880). The model with occasion-specific variance had excellent fit (RMSEA = .000, CFI = 1.000). The trait-state model also had excellent fit (RMSEA = .000, CFI = 1.000).

The CLPM had bad fit to the data (RMSEA = .092, CFI = .913). The Heise model improved fit (RMSEA = .038, CFI = .987). The trait-state model had even better fit (RMSEA = .031, CFI = .992). The cross-lagged effect of self-esteem on depression was negative, but small and not significant, -.05 (95%CI = -.13 to .02).

Simulation Study 1

The first simulation demonstrates that a cross-lagged effect emerges when the CLPM is fitted to data with a trait factor and one of the constructs has more trait variance which produces more stability over time.

I simulated 64% trait variance and 36% occasion-specific variance for self-esteem.

I simulated 36% trait variance and 64% occasion-specific variance for depression.

The correlation between the two trait factors was r = -.7. This produced manifest correlations of r = -.71*sqrt(.36)*sqrt(.64) = -.7 * .6 * .8 = -.34.

For self-esteem the autoregressive model without occasion specific variance had bad fit (). For depression, the autoregressive model without occasion specific variance had bad fit. The CLPM model also had bad fit (RMSEA = .141, CFI = .820). Although the simulation did not include cross-lagged paths, the CLPM showed a significant cross-lagged effect from self-esteem to depression (-.25) and a weaker cross-lagged effect from depression to self-esteem (-.14).

Needless to say, the trait-state model had perfect fit to the data and showed cross-lagged path coefficients of zero.

This simulation shows that CLPM produces artificial cross-lagged effects because it underestimates long-term stability. This problem is well-known, but Orth et al. (2020) deliberately ignore it when they interpret cross-lagged parameters in CLPM with bad fit.

Simulation Study 2

The second simulation shows that a model with a significant cross-lagged path can fit the data, if this path is actually present in the data. The cross-lagged effect was specified as a moderate effect with b = .3. Inspection of the correlation matrix shows the expected pattern that cross-lagged correlations from se to de (se1 with de2) are stronger than cross-lagged correlations from de to se (se2 with de1). The differences are strongest for lag-1.

The model with the cross-lagged paths had perfect fit (RMSEA = .000, CFI = 1.000). The model without cross-lagged paths had worse fit and RMSEA was above .06 (RMSEA = .073, CFI = .968).


The publication of Orth et al.’s (2020 article in JPSP is an embarrassment for the PPID section of JPSP. The authors did not make an innocent mistake. Their own analyses showed across 10 datasets that CLPM does not fit their data. One would expect that a team of researchers would be able to draw the correct conclusion from this finding. However, the power of motivated reasoning is strong. Rather than admitting that the vulnerability model of depression is based on a statistical artifact, the authors try to rationalize why the model with bad fit should not be rejected.

The authors write “the CLPM findings suggest that individual differences in self-esteem predict changes in individual differences in depression, consistent with the vulnerability model” (p. 39).

This conclusions is blatantly false. A finding in a model with bad fit should never be interpreted. After all, the purpose of fitting models to data and to examine model fit is to falsify models that are inconsistent with the data. However, psychologists have been brainwashed into thinking that the purpose of data analysis is only to confirm theoretical predictions and to ignore evidence that is inconsistent with theoretical models. It is therefore not a surprise that psychology has a theory crisis. Theories are nothing more than hunches that guided first explorations and are never challenged. Every discovery in psychology is considered to be true. This does not stop psychologists from developing and supporting contradictory models, which results in an every growing number of theories and confusion. It is like evolution without a selection mechanism. No wonder psychology is making little progress.

Numerous critics of psychology have pointed out that nil-hypothesis testing can be blamed for the lack of development because null-results are ambiguous. However, this excuse cannot be used here. Structural equation modeling is different from null-hypothesis testing because significant results like a high Chi-square value and derived fit indices provide clear and unambiguous evidence that a model does not fit the data. To ignore this evidence and to interpret parameters in these models is unscientific. The fact that authors, reviewers, and editors were willing to publish these unscientific claims in the top journal of personality psychology shows how poorly methods and statistics are understood by applied researchers. To gain respect and credibility, personality psychologists need to respect the scientific method.

Personality Science: The Science of Human Diversity

I wrote a textbook about personality psychology. The textbook is an e-textbook with online engagement for students. I am going to pilot the textbook this fall with my students and revise it with some additional chapters in 2021.

The book also provides an up-to-date review of the empirical literature. The content is freely accessible through a demo version of the course.

Please provide comments, corrections, additional references, etc. in the comments section or email me directly at

A review of “Low self-esteem prospectively predicts depression in adolescence and young adulthood”

In 2007, I was asked to review a ms. about the relationship between self-esteem and depression. The authors used a cross-lagged panel model to examine “prospective prediction” which is a code word for causal claims in a non-experimental study. The problem is that the cross-lagged model is fundamentally flawed because it ignores stable traits and underestimates stability. To compensate for this flaw, it uses cross-lagged paths which leads to false and inflated cross-lagged effects, especially from the more stable to the less stable construct.

I wrote a long and detailed review that was ignored by the editor and the authors and the flawed cross-lagged panel model was published (Orth, Robins, & Roberts, 2008). The article served as the basis for several follow up articles (Orth, Robins, Meier, & Conger, 2016; Rieger, Göllner, Trautwein, & Roberts, 2016; Orth, Robins, Widaman, & Conger, 2014; Orth & Robins, 2013; Sowislo & Orth, 2013; Kuster, Orth, Meier, 2012; Orth, Robins, Trzesniewski, Maes, & Schmitt, 2009; Orth, Robins, & Meier, 2009) and the main author continues to push the flawed cross-lagged panel model (Orth, Clark, Donnellan, & Robins, 2020), although he himself published a model with a trait factor to explain stability in self-esteem (Kuster & Orth, 2013). It is scientifically unjustified to omit this trait factor from bivariate models that relate self-esteem to depression, if ample evidence shows that a trait factor underlies stability of self-esteem (Kuster & Orth, 2013). So, an entire literature is based on a statistical artifact that has been well known four four decades (Rogosa, 1980).

I just found my old review while looking into a folder called “file drawer” and I thought I share it here. It just shows how peer-review doesn’t serve the purpose of quality control and that ambition often trumps the search for truth.

Review – Dec / 3 / 2017

This article tackles an important question: What is the causal relation between depression and self-esteem? As always, at the most abstract level there are three answers to this question. Self-esteem causes (low) depression. Depression causes (low) self-esteem. The correlation is due to a third unobserved variable. To complicate matters, these causal models are not mutually exclusive. It is possible that all three causal models contribute to the observed correlations between self-esteem and depression.

The authors hope to test causal models by means of longitudinal studies, and their empirical data are better than data from many previous studies to examine this question. However, the statistical analyses have some shortcomings that may lead to false inferences about causality.

The first important question is the definition of depression and self-esteem. Depression and self-esteem can be measured in different ways. Self-esteem measures can measure state self-esteem or self-esteem in general. Similarly, depression measures can ask about depressive symptoms over a short time interval (a few weeks) or dispositional depression. The nature of the measure will have a strong impact on the observed retest correlations, even after taking random measurement error into account.

In the empirical studies, self-esteem was measured with a questionnaire that asks about general tendencies (Rosenberg’s self-esteem scale). In contrast, depression was assessed by asking about symptoms within the preceding seven days (CES-D).  Surprisingly, Study 1 shows no differences in the retest correlations of depression and self-esteem. Less surprising is the fact that in the absence of different stabilities, cross-lagged effects are small and hardly different from each other, whereas Study 2 shows differences in stability and asymmetrical patterns of cross-lagged coefficients. This pattern of results suggests that the cross-lagged coefficients are driven by the stability of the measures (see Rogosa, 1980, for an excellent discussion of cross-lagged panel studies).

The good news is that the authors’ data are suitable to test alternative models. One important alternative model would be a model that postulates two latent dispositions for depression and self-esteem (not a single common factor). The latent disposition would produce stability in depression and self-esteem over time. The lower retest correlations of depression would be due to more situational fluctuation of depressive symptoms. The model could allow for a correlation between the latent trait factors of depression and self-esteem. Based on Watson’s model, one would predict a very strong negative correlation between the two trait factors (but less than -1), while situational fluctuation of depression could be relatively weakly related to fluctuation in self-esteem.

The main difference between the cross-lagged model and the trait model concerns the pattern of correlations across different retest intervals. The cross-lagged model predicts a simplex structure (i.e., the magnitude of correlations decreases with increasing retest intervals). In contrast, the trait model predicts that retest correlations are unrelated to the length of the retest interval. With adequate statistical power, it is therefore possible to test these models against each other. With more complex statistical methods it is even possible to test a hybrid model that allows for all three causal effects (Kenny & Zautra, 1995).

The present manuscript simply presents one model with adequate model fit. However, model fit is largely influenced by the measurement model. The measurement model fits the data well because it is based on parcels (i.e., parcels are made to be parallel indicators of a construct and are bound to fit well). Therefore, the fit indices are insensitive to the specification of the longitudinal pattern of correlations. To illustrate, global fit is based on the fit to a correlation matrix with 276 parameters (3 indicators * 2 constructs * 4 waves = 24 indicators , 24 * 23 / 2 = 276 correlations). At the latent level, there are only 28 parameters (2 constructs * 4 waves = 8 latent factors, 8 * 7 / 2 = 28 parameters). The cross-lagged model constrains only 12 of these parameters (12 / 276 < 5%). Thus, the fit of the causal model should be evaluated in terms of the relative fit of the measurement model to the structural model. Table 2 shows the relevant information. Surprisingly, it shows only a difference of 6 degrees of freedom between Model 2 and 3, where I would have expected 12 degrees of freedom difference (?). More important, with six degrees of freedom, the chi-square difference is quite large 59. Although the qui-square test may be overly sensitive, it would be important to know why the model fit is not better. My guess is that the model underestimates long-term stability due to the failure to include a trait component. The same test for Study 2 suggests a better fit of the cross-lagged model in Study 2. However, even a good fit does not indicate that the model is correct. A trait model may fit the data as well or even better.

Regarding Study 1, the authors commit the common fallacy to interpret null-effects as evidence for the lack of a significant effect. Even if in Study 1, self-esteem was a significant (p < .05) lagged predictor of depression, and depression was not a significant (p > .05) lagged predictor of self-esteem, it is incorrect to conclude that self-esteem has an effect, but depression does not have an effect. Indeed, given the small magnitude of the two effects (-.04 vs -.10 in Figure 1) it is likely that these effects are not significantly different from each other (it is good practice in SEM studies to report confidence intervals, which would make it easier to interpret the results).

The limitation section does acknowledge that “the study designs do not allow for strong conclusions regarding the causal influence of self-esteem on depression” However, without more detail and explicit discussion of alternative models, the importance of this disclaimer in the fine print is lost to most readers unfamiliar with structural equation modeling, and the statement seems to contradict the conclusions drawn in the abstract and causal interpretations of the results in the discussion (e.g., Future research should seek to identify the mediating processes of the effect of self esteem on depression).

I have no theoretical reasons to favor any causal model. I am simply trying to point out that alternative models are plausible and likely to fit the data as well as those presented in the manuscript. At a minimum a revision should acknowledge this, and present the actual empirical data (correlation tables) to allow other researchers to test alternative models.