Category Archives: Uncategorized

The Structure of Neuroticism

The construct of neuroticism is older than psychological science. It has its roots in Freud’s theories of mental illnesses. Thanks to to influence of psychoanalysis on the thinking of psychologists, the first personality questionnaires included measures of neuroticism or anxiety, which were considered to be highly related or even identical constructs.

Eysenck’s research on personality first focussed on Neuroticism and Extraversion as the key dimensions of personality traits. He then added psychoticism as a third dimension.

In the 1980s, personality psychologists agreed on a model with five major dimensions that included neuroticism and extraversion as prominent dimensions. Psychoticism was divided into agreeableness and conscientiousness and a fifth dimension openness was added to the model.

Today, the Big Five model dominates personality psychology and many personality questionnaires focus on the measurement of the Big Five.

Despite the long history of research on neuroticism, the actual meaning of the term and the construct that is being measured by neuroticism scales is still unclear. Some researchers see neuroticism as a general disposition to experience a broad range of negative emotions. In the emotion literature, anxiety, anger, and sadness are often considered to be basic negative emotions, and the prominent NEO-PI questionnaires considers neuroticism to be a general disposition to experience these three basic emotions more intensely and frequently.

Neuroticism has also been linked to more variability in mood states, higher levels of self-consciousness and lower self-esteem.

According to this view of neuroticism, it is important to distinguish between neuroticism as a more general disposition to experience negative feelings and anxiety, which is only one of several negative feelings.

A simple model of neuroticism would assume that a general disposition to respond more strongly to negative emotions produces correlations among more specific dispositions to experience more anxiety, anger, sadness, and self-conscious emotions like embarrassment. This model implies a hierarchical structure with neuroticism as a higher-order factor of more specific negative dispositions.

In the early 2000s, Ashton and Lee published an alternative model of personality with six factors called the HEXACO model. The key difference between the Big Five model and the HEXACO model is the conceptualization of pro- and anti-social traits. While these traits are considered to be related to a single higher-order factor of agreeableness in the Big Five model, the HEXACO model distinguishes between agreeableness and honesty-humility as two distinct traits. However, this is not the only difference between the two models. Another important difference is the conceptualization of affective dispositions. The HEXACO model does not have a factor corresponding to neuroticism. Instead it has an emotionality factor. The only common trait to neuroticism and emotionality is anxiety, which is measured with similar items in Big Five questionnaires and in HEXACO questionnaires. The other three traits linked to emotionality are unique to the HEXACO model.

The four primary factors, also called facets) that are used to identify and measure emotionality are anxiety, fear, dependence, and sentimentality. Fear is distinguished from anxiety by a focus on immediate and often physical danger. In contrast, anxiety and worry tend to be elicited by thoughts about uncertain events in the future. Dependence is defined by a need for social comfort in difficult times. Sentimentality is a disposition to respond strongly to negative events that happen to other people, including fictional characters.

In a recent target article, Ashton and Lee argued that it is time to replace the Big Five model with the superior HEXACO model. A change from neuroticism to emotionality would be a dramatic shift given the prominence of neuroticism in the history of personality psychology. Here, I examine empirically how Emotionality is related to Neuroticism and whether personality psychologists should adapt the HEXACO framework to understand individual differences in affective dispositions.


A key problem in research on the structure of personality is that researchers often rely on questionnaires that were developed with a specific structure in mind. As a result, the structure is pre-determined by the selection of items and constructs. To overcome this problem, it is necessary to sample a broad and ideally representative sample of primary traits. The next problem is that motivation and attention-span of participants limits the number of items that a personality questionnaire can include. These problems have been resolved by Revelle and colleagues survey that asks participants to complete only a subset of over 600 items. Modern statistical methods can analyze datasets with planned missing data. Thus, it is possible to examine the structure of hundreds of personality items. Condon and Revelle (2018) also made these data openly available ( I am very grateful for their initiative that provides an unprecedented opportunity to examine the structure of personality.

The items were picked to represent primary factors (facets) of the HEXACO questionnaire and the NEO-PI questionnaire. In addition, the questionnaire covered neuroticism items from the EPQ and other questionnaires. The items are based on the IPIP item pool. Each primary factor is represented by 10 items. I picked the items that represent the four HEXACO Emotionality factors, anxiety, fear, dependency, sentimentality, and four of the NEO-PI Neuroticism factors, anxiety, anger, depression, and self-consciousness. The anxiety factor overlaps and is represented by mostly overlapping items. Thus, this item selection resulted in 70 items that were intended to measure 7 primary factors. I added four additional items that represented variable moods (moodiness) that were included in the BFAS and EPQ, which might form an independent factor.


The data were analyzed with confirmatory factor analysis (CFA), using the MPLUS software. CFA has several advantages over traditional factor analytic methods that have been employed by proponents of the HEXACO and the Big Five models. The main advantages are that it is possible to model hierarchical structures that represent the Big Five or HEXACO factors as higher-order factors of primary factors. A second advantage is that CFA provides information about model fit whereas traditional EFA produces solutions without evaluating model fit.

Measurement Model

A first step in establishing a measurement model was to select items with high primary loadings, low secondary loadings, and low correlated residuals. The aim was to represent each primary factor with the best four items. While four items may not be enough to create a good scale, four items are sufficient to establish a measurement model of primary factors. Limiting the number of items to four items is also advantages because computing time increases with additional items and models with missing data can take a long time to converge.

Aside from primary loadings, the model included an acquiescence factor based on the coding of items. Directed coded items had unstandardized loadings of 1 and reverse coded items had an unstandardized loading of -1. There were no secondary loadings or correlated residuals.

The model met standard criteria of model fit such as a CFI > .95 and RMSEA < .05, CFI = .954, RMSEA = .007. However, models with missing data should not be evaluated based on these fit indices because fit is determined by a different formula ( (Zhang & Savaley, 2019).  More importantly, modification indices showed no notable changes in model fit if fixed parameters were freed. Table 1 shows the items and their primary factor loadings.

Table 2 shows the correlations among the primary factors.

The first four factors are assumed to belong to the HEXACO-Emotionality factor. As expected, fear, anxiety, and dependence are moderately to highly positively correlated. Contrary to expectations, sentimentality showed low correlations especially with fear.

Factors 4 to 8 are assumed to be related to Big Five neuroticism. As expected, all of these factors are moderately to highly correlated.

In addition, the dependence factor from the HEXACO model also shows moderate to high correlations with all Big Five neuroticism factors. The fear factor also shows positive relations with the neuroticism factors, especially for self-consciousness.

With the exception of Sentimentality, all of the factors tend to be positively correlated, suggesting that they are related to a common higher-order factor.

Overall, this pattern of results provides little support for the notion that HEXACO-Emotionality is a distinct higher-order factor from Big-Five neuroticism.


The first model assumed that all factors are related to each other by means of a single higher-order factor. In addition, the model allowed for correlated residuals among the four HEXACO factors. This makes it possible to examine whether these four factors share additional variance with each other that is not explained by a general Negative Emotionality factor.

Model fit decreased compared to the measurement model which serves as a comparison standard for theoretical models, CFI: .916 vs. 954, RMSEA = .009 vs. .007.

All primary factors except sentimentality had substantial loadings on the Negative Emotionality factor. Table 3 shows the residual correlations for the four HEXACO factors.

All correlations are positive suggesting that the HEXACO Emotionality factor captures some shared variance among these four factors that is not explained by the Negative Emotionality factor. However, two of the correlations are very low indicating that there is little shared variance between sentimentality and fear or dependence and anxiety.


The second model, modeled the relationship among the HEXACO factors with a factor. Model fit decreased, CFI = .914 vs. .916, RMSEA = .010 vs. 009. Loadings on the Emotionality factor ranged from .27 to .46. Fear, anxiety, and dependence had higher loadings on the Negative Emotionality factor than on the Emotionality factor.

The main conclusion from these results is that it would be problematic to replace the Big Five model with the HEXACO model because the Emotionality factor in the HEXACO model fails to capture the nature of the broader Neuroticism factor in the Big Five model. In fact, there is little evidence for a specific Emotionality factor in this dataset.


The discrepancy between the measurement model and Model 1 suggests that there are additional relationships between some primary factors that are not explained by the general Negative Emotionality factor. Examining modification indices suggested several changes to the model. Model 3 shows the final results. This model fit the data nearly as well as the measurement model, CFI = .949 vs. 954, RMSEA = .007 vs. .007. Inspection of the Modification Indices showed no further ways to improve the model by freeing correlated residuals among primary factors. In one case, three correlated residuals were consistent and were modeled as a factor. Figure 1 shows the results.

First, the model shows notable and statistically significant effects of neuroticism on all primary factors except sentimentality. Second the correlated residuals show an interesting patterns where primary factors can be arranged in a chain. that is, depression is related to moody, moody is related to anger, anger is related to anxiety, anxiety is related to fear, fear is related to self-consciousness and dependence, self-consciousness is related to dependence and finally, dependence is related to sentimentality. This suggests the possibility that a second broader dimension might be underlying the structure of negative emotionality. Research on emotions suggests that this dimension could be activation (fear is high, depression is low) or potency (anger is high, dependence is low).This is an important avenue for future research. The key finding in Figure 1 is that the traditional Neuroticism dimension is an important broad higher-order factor that accounts for the correlations among 7 of the 8 primary factors. These results favor the Big Five model over the HEXACO model.

A Big-5 Model of the Hexaco-100 Items

In the 1980s, personality psychologists celebrated the emergence of a five-factor model as a unifying framework for personality traits. Since then, the so-called Big-5 have dominated thinking and measurement of personality.

Two decades later, Ashton and Lee proposed an alternative model with six factors. This model has come to be known as the HEXACO model.

A recent special issue in the European Journal of Personality discussed the pros and cons of these two models. The special issue did not produce a satisfactory resolution between proponents of the two models.

In theory, it should be possible to resolve this dispute with empirical data, especially given the similarities between the two models. Five of the factors are more or less similar between the two models. One factor is Neuroticism with anxiety/worry as a key marker of this higher-order trait. A second factor is Extraversion with sociability and positive energy as markers. A third factor is Openness with artistic interests as a common marker. A forth factor is conscientiousness with orderliness and planful actions as markers. The key differences between the two models is concerned with pro-social and anti-social traits. In the Big Five model, a single higher-order trait of agreeableness is assumed to produce shared variance among all of these traits (e.g., morality, kindness, modesty). The HEXACO model assumes that there are two higher-order traits. One is also called agreeableness and the other one is called honesty and humility.

As Ashton and Lee (2005) noted, the critical empirical question is how the Big Five model accounts for the traits related to the honesty-humility factor in the HEXACO model. Although the question is straightforward, empirical tests of it are not. The problem is that personality researchers often rely on observed correlations between scales and that correlations among scales depend on the item-content of scales. For example, Ashton and Lee (2005) reported that the Big-Five Mini-Marker scale of Agreeableness correlated only r = .26 with their Honesty-Humility scale. This finding is not particularly informative because correlations between scales are not equivalent to correlations between the factors that the scales are supposed to reflect. It is also not clear whether a correlation of r = .26 should be interpreted as evidence that Honesty-Humility is a separate higher-order factor at the same level as the other Big Five traits. To answer this question, it would be necessary to provide a clear definition of a higher-order factor. For example, higher-order factors should account for shared variance among several primary factors that have only low secondary loadings on other factors.

Confirmatory factor analysis (CFA) addresses some of the problems of correlational studies with scale scores. One main advantage of CFA is that models do not depend on the item selection. It is therefore possible to fit a theoretical structure to questionnaires that were developed for a different model. I therefore used CFA to see whether it is possible to fit the Big Five model to the HEXACO-100 questionnaire that was explicitly designed to measure 4 primary factors (facets) for each of the six HEXACO higher-order traits. Each primary factor was represented by four items. This leads to 4 x 4 x 6 = 96 items. After consultation with Michael Ashton, I did not include the additional four altruism items.

Measurement Model

The Big-Five or HEXACO models are higher-order models that are supposed to explain the pattern of correlations among the primary factors. In order to test these models, it is necessary to first establish a measurement model for the primary factors. Starting point for the measurement model was a model with a simple structure where each item only has a primary loading on its designated factor. For example, the anxiety item “I sometimes can’t help worrying about little things” loaded only on the anxiety factor. All 24 primary factors were allowed to correlate freely with each other.

It is well-known that few data fit a simple structure for two reasons. First, the direction of items can influence responses. This can be modeled with an acquiescence factor that codes whether an item is a direct or a reverse coded items. Second, it is difficult to write items that reflect only variation in the intended primary trait. Thus, many items are likely to have small, but statistically significant, secondary loadings on other factors. These secondary loadings need to be modeled to achieve acceptable model fit, even if they have little practical significance. Another problem is that two items of the same factor may share additional variance because they share similar wordings or item content. For example, the two items “I clean my office or home quite frequently” and the reverse coded item “People often joke with me about the messiness of my room or desk” share specific content. This shared variance between items needs to be modeled with correlated residuals to achieve acceptable model fit.

Researchers can use Modification Indices to identify secondary loadings and correlated residuals that have a strong influence on model fit. Freeing the identified parameters improves model fit and can produce a measurement model with acceptable model fit. Moreover, MI can also provide information that there are no more fixed parameters that have a strong negative effect on model fit.

After modifying the simple-structure model accordingly, I established a measurement model that had acceptable fit, RMSEA = .021, CFI = .936. Although the CFI did not reach the threshold of .950, the MI did not show any further improvements that could be made. Freeing further secondary loadings resulted in secondary loadings less than .1. Thus, I stopped at this point.

16 primary factors had primary factor loadings of .4 or higher for all items. The remaining 8 primary factors had 3 primary factor loadings of .4 or higher. Only 4 items had secondary loadings greater than .3. Thus, the measurement model confirmed the intended structure of the questionnaire.

Importantly, the measurement model was created without imposing any structure on the correlations among higher-order factors. Thus, the freeing of secondary loadings and correlated residuals did not bias the results in favor of the Big Five or HEXACO model. Rather, the fit of the measurement model can be used to evaluate the fit of theoretical models about the structure of personality.

A simplistic model that is often presented in textbooks would imply that only traits related to the same higher-order factor are correlated with each other and that all other correlations are close to zero. Table 1 shows the correlations for the HEXACO-Agreeableness (A-Gent = gentle, A-Forg = forgiving, A-Pati = patient, & A-Flex = flexible) and the HEXACO-honesty-humility (H-Gree = greed-avoidance, H-Fair = fairness, H-Mode = modest, & H-Sinc = sincere) factors.

In support of the Big Five model, all correlations are positive. This suggests that all primary factors are related to a single higher-order factor. In support of the HEXACO model, correlations among A-factors and correlations among H-factors tend to be higher than correlations of A-factors with H-factors. Three notable exceptions are highlighted in red and all of them involve modesty. Modesty is more strongly related to A-Gent and A-Flex than to H-Mode.

Table 2 shows the correlations of the A and H factors with the four neuroticism factors (N-Fear = fear, N-Anxi = anxiety, N-Depe = dependence, N-Sent = sentimental). Notable correlations greater than .2 are highlighted. For the most part, the results show that neuroticism and pro-social traits are unrelated. However, there are some specific relations among factors. Notably, all four HEXACO-A factors are negatively related to anxiety. This shows some dissociation between A and H factors. In addition, fear is positively related to fairness and negatively related to sincerity. Sentimentality is positively related to fairness and modesty. Neither the Big Five model nor the HEXACO model has explanations for these relationships.

Table 3 shows the correlation with the Extraversion factors (E-soci = Sociable, E-socb = bold, E-live = lively, E-Sses = self-esteem). There are few notable relationships between A and H factors on the one hand and E factors on the other hand. This supports the assumption of both models that pro-social traits are unrelated to extraversion traits, including being sociable.

Table 4 shows the results for the Openness factors. Once more there are few notable relationships. This is consistent with the idea that pro-social traits are fairly independent of Openness.

Table 5 shows the results for conscientiousness factors (C-Orga = organized, C-Dili = diligent, C-Perf = Perfectionistic, & C-Prud = prudent). Most of the correlations are again small, indicating that pro-sociality is independent of conscientiousness. The most notable exceptions are positive correlations of the conscientiousness factors with fairness. This suggests that fairness is related to conscientiousness.

Table 6 shows the remaining correlations among the N, E, O, and C factors.

The green triangles show correlations among the primary factors belonging to the same higher-order factor. The strong correlations confirm the selection of primary factors to be included in the HEXACO-100. Most of the remaining correlations are below .2. The grey fields show correlations greater than .2. The most notable correlations are for diligence (C-Dili), which is correlated with all E-factors. This suggests a notable secondary loading of diligence on the higher-order factor E. Another noteworthy finding is a strong correlation between self-esteem (E-Sses) and anxiety (N-anx). This is to be expected because self-esteem is known to have strong relationships with neuroticism. It is surprising, however, that self-esteem is not related to the other primary factors of neuroticism. One problem in interpreting these results is that the other neuroticism facets are unique to the HEXACO-100.

In conclusion, inspection of the correlations among the 24 primary factors shows clear evidence for 5 mostly independent factors that correspond to the Big Five factors. In addition, the correlations among the pro-social factors show a distinction between the four HEXACO-A factors and the four HEXACO-H factors. Thus, it is possible to represent the structure with 6 factors that correspond to the HEXACO model, but the higher-order A and H factors would not be independent.

A Big Five Model of the HEXACO-100

I fitted a model with five higher-order factors to examine the ability of the Big Five model to explain the structure of the HEXACO-100. Importantly, I did not alter the measurement model of the primary factors. It is clear from the previous results that a simple-structure would not fit the data. I therefore allowed for secondary loadings of primary factors on the higher-order factors. In addition, I allowed for residual correlations among primary factors. Furthermore, when several primary factors showed consistent correlated residuals, I modeled them as factors. In this way, the HEXACO-A and HEXACO-H factors could be modeled as factors that account for correlated residuals among pro-social factors. Finally, I added a halo factor to the model. The halo factor has been identified in many Big Five questionnaires and reflects the influence of item-desirability on responses.

Model fit was slightly less than model fit for the measurement model, RMSEA = .021 vs. .021, CFI = .927 vs. .936. However, inspection of MI did not suggest additional plausible ways to improve the model. Figure 1 shows the primary loadings on the Big Five factors and the two HEXACO factors, HEXACO-Agreeableness (HA) and HEXACO-Honesty-Humility.

The first notable observation is that primary factors have loadings above .5 for four of the Big Five factors. For the Agreeableness factor, all loadings were statistically significant and above .2, but four loadings were below .5. This shows that agreeableness explains less variance in some primary factors than the other Big Five factors. Thus, one question is whether the magnitude of loadings on the Big Five factors should be a criterion for model selection.

The second noteworthy observation is that the model clearly identified HEXACO-A and HEXACO-H as distinct factors. That is, the residuals of the corresponding primary factors were all positively correlated. All loadings were above .2, but several of the loadings were also below .5. Moreover, for the HEXACO-A factors the loadings on the Big5-A factor were stronger than the loadings on the HEXACO-A factor. Modesty (H-Mode) also loaded more highly on Big5-A than HH. The results for HEXACO-A are not particularly troubling because the HEXACO model does not consider this factor to be particularly different from Big5-A. Thus, the main question is whether the additional shared variance among HEXACO-H factors warrants the creation of a model with six factors. That is, does Honesty-Humility have the same status as the Big Five factors?

Alternative Model 1

The HEXACO model postulates six factors. Comparisons of the Big Five and HEXACO model tend to imply that the HEXACO factors are just as independent as the Big Five factors. However, the data show that HEXACO-A factors and HEXACO-H factors are not as independent of each other as other factors. To fit a six-factor model to the data, it would be possible to allow for a correlation between HEXACO-A and HEXACO-H. To make this model fit as well as the Big-Five model, an additional secondary loading of modesty (H-Mode) on HEXACO-A was needed, RMSEA = .22, CFI = .926. This secondary loading was low, r = .25, and is not displayed in Figure 2.

The most notable finding is a substantial correlation between Hexaco-A and Hexaco-H of r = .49. Although there are no clear criteria for practical independence, this correlation is strong and suggests that there is an important common factor that produces a positive correlation between these two factors. This makes this model rather unappealing. The main advantage of the Big Five model would be that it captures the highest level of independent factors in a hierarchy of personality traits.

Alternative Model 2

An alternative solution to represent the correlations among HEXACO-A and HEXACO-H factors is to treat HEXACO-A and HEXACO-H as independent factors and to allow for secondary loadings of HEXACO-H factors on HEXACO-A or vice versa. Based on the claim that the H-factor adds something new to the structure, I modelled secondary loadings of the primary H-factors on HEXACO-A. Fit was the same as for the first alternative model, RMSEA = .22, CFI = .927. Figure 3 shows substantial secondary loadings for three of the four H-factors, and for modesty the loading on the HEXACO-A factor is even stronger than the loading on the HEXACO-H factor.

The following table shows the loading pattern along with all secondary loadings greater than .1. Notable secondary loadings greater than .3 are highlighted in pink. Aside from the loading of some H-factors on A, there are some notable loadings of two C-factors on E. This finding is consistent with other results that high achievement motivation is related to E and C.

The last column provides information about correlated residuals (CR) in the last column. Primary factors with the same letter have a correlated residual. For example, there is a strong negative relationship between anxiety (N-anxiety) and self-esteem (E-Sses) that was apparent in the correlations among the primary factors in Table 6. This relationship could not be modeled as a negative secondary loading on neuroticism because the other neuroticism factors showed much weaker relationships with self-esteem.


In sum, the choice between the Big5 model and the HEXACO model is a relatively minor stylistic choice. The Big Five model is a broad model that predicts variance in a wide variety of primary personality factors that are often called facets. There is no evidence that the Big Five model fails to capture variation in the primary factors that are used to measure the Honesty-Humility factor of the HEXACO model. All four H-factors are related to a general agreeableness factor. Thus, it is reasonable to maintain the Big Five model as a model of the highest level in a hierarchy of personality traits and to consider the H-factor a factor that explains additional relationships among pro-social traits. However, an alternative model with Honesty-Humility as a sixth factor is also consistent with the data. This model only appears different from the Big Five model if secondary loadings are ignored. However, all H-factors had secondary loadings on agreeableness. Thus, agreeableness remains a broader trait that links all pro-social traits, while Honesty-Humility explains additional relationships among a subset of this factors. If Honesty-Humility is indeed a distinct global factor it should be possible to find primary factors that are uniquely related to this factor without notable secondary loadings on Agreeableness. If such traits exists, they would strengthen the support for the HEXACO model. On the other hand, if all traits that are related to Honesty-Humility also load on Agreeableness, it seems more appropriate to treat Honesty-Humility as a lower-level factor in the hierarchy of traits. In conclusion, these structural models did not settle the issue, but they clarify the issue. Agreeableness factors and Honesty-Humilty factors form distinct, but related clusters of primary traits. This empirical finding can be represented with a Five-Factor model with Honest-Humility as shared variance among some pro-social traits or it can be represented with six factors and secondary loadings.


A major source of confusion in research on the structure of personality is the failure to distinguish between factors and scales. Many proponents of the HEXACO model point out that the HEXACO scales, especially the Honesty-Humilty scale, explain variance in criterion variables that is not explained by Big-Five scales. It has also been observed that the advantage of the HEXACO scales depends on the Big-Five scales that are used. The reason for these findings is that scales are imperfect measures of their intended factors. They also contain information about the primary factors that were used to measure the higher-order factors. The advantage of the HEXACO-100 is that it measures 24 primary factors. There is nothing special about the Honesty-Humility factor. As Figure 1 shows, the honesty-humilty factor explains only a portion of the variance in its designated primary factors, namely .67^2 = 45% of the variance in greed-avoidance, .55^2 = 30% of the variance in fairness, .32^2 = 10% of the variance in modesty, and .41^2 = 17% of the variance in sincerity. Averaging these scales to form a Honesty-Humilty scale destroys some of this variance and inevitably lowers the ability to predict some criterion variable that is strongly related to one of these primary factors. There is also no reason why Big Five questionnaires should not include some primary factors of Honesty-Humility and the NEO-PI-3 does include modesty and fairness.

Personality psychologists need to distinguish more clearly between factors and scales. The correlation of the NEO-PI-3 agreeableness scale will be different from those with the HEXACO-A scale or the BFI2-agreeableness scale. Scale correlations are biased by the choice of items, unless items are carefully selected to maximize correlation with the latent factor. For research purposes, researchers should use latent variable models that can decompose an observed correlation into the influence of the higher-order factor and the influence of specific factors.

Personality researchers should also carefully think about the primary factors they may want to include in their studies. For example, even researchers who favor a HEXACO model may include additional measures of anger and depression to explore the contribution of affective dispositions to outcome measures. Similarly, Big Five researchers may want to supplement their Big Five questionnaires with measures of primary traits related to honesty and morality if the Big-Five measure does not capture them. A focus on the highe-order factors is only justified in studies that require short measures with a few items.


My main contribution to the search for a structural model of personality is to examine this question with a statistical tool that makes it possible to test structural models of factors. The advantage of this method is that it is possible to separate structural models of factors from the items that are used to measure factors. While scales of the same factor can differ sometimes dramatically, structural models of factors are independent of the specific items that are used to measure a factor as long as some items reflect variance in the factor. Using this approach, I showed that the Big Five and HEXACO model only differ in the way they represent covariation among some primary factors. It is incorrect to claim that Big Five models fail to represent variation in honesty or humility. It is also incorrect to assume that all pro-social traits are independent after their shared variance in agreeableness is removed. Future research needs to examine more carefully the structural relationships among primary traits that are not explained by higher-order factors. This research question has been neglected because exploratory factor analysis is unable to examine this question. I therefore urge personality researchers to adopt confirmatory factor analysis to advance research on personality structure.

A Meta-Psychological Investigation of Intelligence Research with Z-Curve.2.0

A recent article by Nuijten, van Assen, Augusteijn, Crompvoets, and Wicherts reported the results of a meta-meta-analysis of intelligence research. The authors extracted 2442 eect sizes from 131 meta-analyses. The authors made these data openly available to allow “readers to pursue other categorizations and analyses” (p. 6). In this blog post, I report the results of an analysis of their data with z-curve.2.0 (Bartos & Schimmack, 2020; Brunner & Schimmack, 2020). Z-curve is a powerful statistical tool that can (a) examine the presence of publication bias and/or the use of questionable research practices, (b) provide unbiased estimate of statistical power before and after selection for significance when QRPs are present, and (c) estimate the maximum number of false positive results.

Questionable Research Practices

The term questionable research practices refers to a number of statistical practices that inflate the number of significant results in a literature (John et al., 2012). Nuijten et al. relied on the correlation between sample size and effect size to examine the presence of publication bias. Publication bias produces a negative correlation between sample size and effect size because larger effects are needed to get significance in studies with smaller samples. The method has several well-known limitations. Most important, a negative correlation is also expected if researchers use larger samples when they anticipate smaller effects, either in the form of a formal a priori power analysis or based on informal information about sample sizes in previous studies. For example, it is well-known that effect sizes in molecular genetics studies are tiny and that sample sizes are huge. Thus, a negative correlation is expected even without publication bias.

Z-curve.2.0 avoids this problem by using a different approach to detected the presence of publication bias. The approach compares the observed discovery rate (i.e., the percentage of significant results) to the expected discovery rate (i.e., the average power of studies before selection for significance). To estimate the EDR, z-curve.2.0 fits a finite mixture model to the significant results and estimates average power based on the weights of a finite number of non-centrality parameters.

I converted the reported information about sample size, effect size, and sampling error into t-values, and then converted the t-values. Extremely large t-values of 20 were fixed to a value of 20. Then t-values were converted into absolute z-scores.

Figure 1 shows a histogram of the z-scores in the critical range from 0 to 6. All z-scores greater than 6 are assumed to have a power of 1 with a significance threshold of .05 (z = 1.96).

The critical comparison of the observed discovery rate (52%) and the expected discovery rate (58%) shows no evidence of QRPs. In fact, the EDR is even higher than the ODR, but the confidence interval is wide and includes the ODR. When there is no evidence that QRPs are present, it is better to use all observed z-scores, including the credible non-significant results, to fit the finite mixture model. Figure 2 shows the results. The blue line moved to 0, indicating that all values were used for estimation.

Visual inspection shows a close match between the observed distribution of z-scores (blue line) and the predicted distribution by the finite mixture model (grey line). The observed discovery rate now closely matches the expected discovery rate of 52%. Thus, there is no evidence of publication bias in the meta-meta-analysis of effect sizes in intelligence research.

Interestingly, there is also no evidence that researchers used mild QRPs to move marginally significant results just below .05 on the other side of the significance criterion to produce just significant results. There are two possible explanation for this. On the one hand, intelligence researchers may be more honest than other psychologists. On the other hand, it is possible that meta-analyses are not representative of the focal hypothesis tests that led to publication of original research articles. A meta-analysis of focal hypothesis tests in original articles is needed to answer this question.

In conclusion, this superior analysis of the presence of bias in the intelligence literature showed no evidence of bias. In contrast, Nuijten et al. (2020) found a significant correlation between effect sizes and sample sizes which they call small study effect. The problem with this finding is that it can reveal either careful planning of sample sizes (good practices) or the use of QRPs (bad practices). Thus, their analyses does not tell us whether there is bias in the data. Z-curve.2.0 resolves this ambiguity and shows that there is no evidence of selection for significance in these data.

Statistical Power

Nuijten et al. used Cohen’s classic approach to investigate power (Cohen, 1962). Based on this approach, they concluded “we found an overall median power of 11.9% to detect a small effect,54.5% for a medium effect, and 93.9% for a large effect (corresponding to a Pearson’s r of 0.1, 0.3, and 0.5 or a Cohen’s d of 0.2, 0.5, and 0.8, respectively)”

This information merely provides information about the sample sizes in the different studies. Studies with small sample sizes have low power to detect a small effect size. As most studies had small sample sizes, the average power to detect small effects is low. However, this does not tell us anything about the actual power of studies to obtain significant results for two reasons. First, effect sizes in a meta-meta-analysis are extremely heterogeneous. Thus, not all studies are chasing small effect sizes. As a result, the power of studies is likely to be higher than the average power to detect small effect sizes. Second, the previous results showed that (a) sample sizes correlate with effect sizes and (b) there is no evidence of QRPs. This means that researchers are a priori deciding to use smaller samples to search for larger effects and larger samples to search for smaller effects. This means that formal or informal a priori power analyses ensure that small samples can have as much or more power than large samples. It is therefore not informative to conduct power analysis only based on information about sample size. Z-curve.2.0 avoids this problem and provides estimates of the actual mean power of studies. Moreover, it provides two estimates of power for two different populations of studies. One population are all studies that are conducted by intelligence researchers without selecting for significance. This estimate is the expected discovery rate. Z-curve also provides an estimate for the population of studies that produced a significant result. This population is of interest because only significant results can be used to claim a discovery; with an error rate of 5%. When there is heterogeneity in power, the mean power after selection for significance is higher than the average power before selection for significance (Brunner & Schimmack, 2020). When researchers attempt to replicate a significant results to verify that it was not a false positive result, mean power after selection for significance provides the average probability that an exact replication study will be significant. This information is valuable to evaluate the outcome of actual replication studies (cf. Schimmack, 2020).

Given the lack of publication bias, there are two ways to determine mean power before selection for significance. We can simply compute the average of significant results and we can use the estimated discovery rate. Figure 2 shows that both values are 52%. Thus, the average power of studies conducted by intelligence researchers is 52%. This is well-below the recommended level of 80%.

The picture is a bit better for studies with a significant result. Here the average power called the expected replication rate is 71% and the 95% confidence interval approaches 80%. Thus, we would expect that more than 50% of significant results in intelligence research can be replicated with a significant result in the replication study. This estimate is higher than for social psychology, where the expected replication rate is only 43%.

False Positive Psychology

The past decade has seen a number of stunning replication failures in social psychology (cf. Schimmack, 2020). This has led to a concern that most discoveries in psychology if not in all sciences are false positive results that were obtained with questionable research practices (Ioannidis, 2005 ; Simmons et al., 2011). So far, however, these concerns are based on speculations and hypothetical scenarios rather than actual data. Z-curve.2.0 makes it possible to examine this question empirically. Although it is impossible to say how many published results are in fact false positive results, it is possible to estimate the maximum number of false-positive results based on the discovery rate. (Soric, 1989). As the observed and expected discovery are identical, we can use the value of 52% as our estimate of the discovery rate. This implies that no more than 5% of the significant results are false positive results. Thus, the empirical evidence shows that most published results in intelligence research are not false positives.

Moreover, this finding implies that most non-significant results are false negatives or type-II errors. That is, the null-hypothesis is also false for non-significant results. This is not surprising because many intelligence studies are correlational and the nil-hypothesis that there is absolutely no relationship between two naturally occurring variables has a low a priori probability. This also means that intelligence researchers would benefit from specifying some minimal effect size for hypothesis testing or to focus on effect size estimation rather than hypothesis testing.


Nujiten et al. conclude that intelligence research is plagued by QRPs. “Based on our findings, we conclude that intelligence research from 1915 to 2013 shows signs that publication bias may have caused overestimated effects”. This conclusion ignores that small-sample effects are ambiguous. The superior z-curve analysis shows no evidence of publication bias. As a result, there is also no evidence that reported effect sizes are inflated.

The z-curve.2.0 analysis leads to a different conclusion. There is no evidence of publication bias, significant results have a probability of 70% to be replicated in exact replication studies and even if exact replication studies are impossible the discovery rate of 50% implies that we should expect the majority of replication attempts with the same sample sizes to be successful (Bartos & Schimmack, 2020). In replication studies with larger samples even more results should replicate. Finally, most of the non-significant results are false negative results because there are few true null-hypothesis in correlational research. A modest increase in sample sizes could easy achieve 80% power which is typically recommended.

A larger concern is the credibility of conclusions based on meta-meta-analyses. The problem is that meta-analysis focus on general main effects that are consistent across studies. In contrast, original studies may focus on unique patterns in the data that can not be subjected to meta-analysis because direct replications of these specific patterns are lacking. Future research therefore needs to code the focal hypothesis tests in intelligence articles to examine the credibility of intelligence research.

Another concern is the reliance on alpha = .05 as a significance criterion. Large genomic studies have a multiple comparison problem where 10,000 analyses can easily produce hundreds of significant results with alpha = .05. This problem is well-known and genetics studies now use much lower alpha levels to test for significance. A proper power analysis of these studies needs to use the actual alpha level rather than the standard level of .05. Z-curve is a flexible tool that can be used with different alpha levels. Therefore, I highly recommend z-curve for future meta-scientific investigations of intelligence research and other disciplines.


Bartoš, F., & Schimmack, U. (2020). z-curve.2.0: Estimating replication and discovery rates. Under review.

Brunner, J., & Schimmack, U. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta- Psychology. MP.2018.874,

Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145–153.

Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2, e124.

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the Prevalence of Questionable Research Practices With Incentives for Truth Telling. Psychological Science, 23(5), 524–532.

Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne. Advance online publication.

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22,  1359 –1366.

Sorić, B. (1989). Statistical “discoveries” and effect-size estimation.Journal of the American Statistical Association,84(406), 608-610.

The Structure of Agreeableness

In 1934, Thurstone published his groundbreaking article “The Vectors of Mind” that introduced factor analysis as an objective method to examine the structure of personality traits. His first application of his method to a list of 60 trait adjectives rated by 1,300 participants yielded five factors. It would take several more decades for personality psychologists to settle on a five-factor model of personality traits (Digman, 1990).

Although the five-factor model dominates personality, it is not the only theory of personality traits. The biggest rival of the Big Five model is the HEXACO model that postulates six factors (Ashton, Lee, Perugini et al, 2004; Lee & Ashton, 2004).

A recent special issue in the European Journal of Personality contained a target article by Ashton and Lee in favor of replacing the Big Five model with the HEXACO model and responses by numerous prominent personality psychologists in defense of the Big Five model.

The key difference between the Big Five model and the HEXACO model is the representation of pro-social (self-transcending) versus egoistic (self-enhancing) traits. Whereas the Big Five model assumes that a single general factor called agreeableness produces shared variance among pro-social traits, the HEXACO model proposes two distinct factors called Honesty-Humility and Agreeableness. While the special issue showcases the disagreement among persoality researchers, it fails to offer an empirical solution to this controversy.

I argue that the main reason for stagnation in research on the structure of personality is the reliance on Thurstone’s outdated method of factor analysis. Just like Thurstone’s multi-factor method was an important methodological contribution that replaced Spearman’s single-factor method, Jöreskog (1969) developed confirmatory factor analysis that addresses several limitations in Thurstone’s method. Neither Ashton and Lee nor any of the commentators suggested the use of CFA to test empirically whether pro-social traits are represented by one general or two broad traits.

There are several reasons why personality psychologists have resisted the use of CFA to study personality structure. Some of these reasons were explicitly states by McCrae, Zonderman, Costa, Bond, and Paunonen (1996).

1. “In this article we argue that maximum likelihood confirmatory factor analysis (CFA), as it has typically been applied in investigating personality structure, is systematically flawed” (p. 552).

2. “The CFA technique may be inappropriately applied” (p. 553)

3. “CFA techniques are best suited to the analysis of simple structure models” (p. 553)

4. “Even proponents of CFA acknowledge a long list of problems with the technique, ranging from technical difficulties in estimation of some models to the cost in time and effort involved.”

5. The major advantage claimed for CFA is its ability to provide statistical tests of the fit of empirical data to different theoretical models. Yet it has been known for years that the chi-square test on which most measures of fit are based is problematic”

6. A variety of alternative measures of goodness-of-fit have been suggested, but their interpretation and relative merits are not yet clear, and they do not yield tests of statistical

7. Data showing that chisquare tests lead to overextraction in this sense call into question
the appropriateness of those tests in both exploratory and confirmatory maximum likelihood models. For example, in the present study the largest single problem was a residual correlation between NEO-PI-R facet scales E4: Activity and C4: Achievement Striving. It would be possible to specify a correlated error term between these two scales, but the interpretation of such a term is unclear. Correlated error usually refers to a nonsubstantive source of variance. If Activity and Achievement Striving were, say, observer ratings, whereas all other variables were self-reports, it would make sense to control for this difference in method by introducing a correlated error term. But there are no obvious sources of correlated error among the NEO-PI-R
facet scales in the present study.

8, With increasing familiarity of the technique and the availability of convenient computer programs (e.g., Bentler, 1989; Joreskog & Sorbom, 1993), it is likely that many more researchers will conduct CFA analyses in the future. It is therefore essential to point out the dangers in an uncritical adoption and simplistic application of CFA techniques (cf. Breckler, 1990).

9. Structures that are known to be reliable showed poor fits when evaluated by CFA techniques. We believe this points to serious problems with CFA itself when used to examine personality structure.

I may be giving McCrae et al. (1996) to much credit for the lack of CFA studies of personality structure, but their highly cited article clearly did not encourage future generations to explore personality structure with CFA. This is unfortunate because CFA has many advantages over traditional EFA.

The first advantage is the ability to compare model fit. McCrae et al. are overly concerned about the chi-square statistic that is sensitive to sample size and rewards overly complex models. Even when they published their article, other fit indices were already used to address these problems. Today researchers have even more experience with the evaluation of model fit. More important, well-established fit indices also reward parsimony and can favor a more parsimonious model with five factors over a less parsimonious model with six factors. This opens the door to head to head model comparisons of the Big Five and HEXACO model.

The second advantage of CFA is that factors are theoretically specified by researchers rather than empirically driven by patterns in the data. This means that CFA can find factors that are represented by as few as two items and factors that are represented by 10 or more items. In contrast, EFA will favor factors with many items. This means that researchers need to know the structure to represent all factors equally, which is not possible in studies that try to discover a structure and the number of factors. Different sampling from the item space may explain differences in personality structures with EFA, but is not a problem for CFA as long as factors are represented by a minimum of two items.

A third advantage of CFA is that it is possible to model hierarchical structures. Theoretically, most personality researchers agree that the Big Five or HEXACO are higher-order factors that explain only some of the variance in so-called facets or primary traits like modesty, altruism, forgiveness, or morality. However, EFA cannot represent hierarchies. This makes it necessary to examine the higher order structure of personality with scales that average several items. However, these scales are impure indicators of the primary factors and the impurities can distort the higher-order factor structure.

A fourth advantage is that CFA can also model method factors that can distort the actual structure of personality. For example, Thurstone’s results appeared to be heavily influenced by an evaluative factor. With CFA it is possible to model evaluative biases and other response styles like acquiescence bias to separate systematic method variance from the actual correlations between traits (Anusic et al., 2009).

Finally, CFA requires researchers to think hard about the structure of personality to specify a plausible model. In contrast, EFA produces a five factor or six-factor solution in less than a minute. Most of the theoretical work is then to find post-hoc explanations for the structure.

In short, most of the problems listed by McCrae et al. (1996) are not bugs, but features of CFA. The fact that they were unable to create a fitting model to their data only shows that they didn’t take advantage of these features. In contrast, I was able to fit a hierarchical model with method factors to Costa and McCrae’s NEO-PI-R questionnaire (Schimmack, 2019). The results mostly confirmed the Big Five model, but some facets did not have primary loadings on the predicted factor.

Here I am using hierarchical CFA to examine the structure of pro-social and anti-social traits. The Big Five model predicts that all traits are related to one general factor that is commonly called Agreeableness. The HEXACO model predicts that there are two relatively independent factors that are called Agreeableness and Honestiy-Humility (Ashton & Lee, 2005).


The data were collected by Crowe, Lynam, and Miller (2017). 1205 participants provided self-ratings on 104 items that were selected from various questionnaires to measure Big-5 agreeableness or HEXACO-agreeableness and honesty and humility. The data were analyzed with EFA. In the recent debate about the structure of personality, Lynam, Crowe, Vize, and Miller (2020) pointed to the finding of a general factor to argue that there is “Little Evidence That Honesty-Humility Lives Outside of FFM Agreeableness” (p. 530). They also noted that “at no point in the hierarchy did a separate Honesty-Humility factor emerge” (p. 530).

In their response, Ashton and Lee criticize the item-selection and argue that the authors did not sample enough items that reflect HEXACO-Agreeableness. “Now, the Crowe et al. variable set did contain a substantial proportion of items that should represent good markers of Honesty-
Humility, but it was sharply lacking in items that represent some of the best markers of HEXACO Agreeableness: for example, one major omission was item content related to
even temper versus anger-proneness, which was represented only by three items of the Patience facet of HEXACO-PI-R Agreeableness” (p. 565). They also are concerned about oversampling of other facets. “The Crowe et al. “Compassion” factor is one of the
all-time great examples of a ‘bloated specific.’ (p. 565). These are valid concerns for analyses with EFA that allow researchers to influence the factor structure by undersampling or oversampling items. However, CFA is not influenced by the number of items that reflect a specific facet. Even two items are sufficient to create a measurement model of a specific factor and to examine whether these two factors are fairly independent or substantially correlated. Thus, it is a novel contribution to examine the structure of pro-social and anti-social traits using CFA.

Exploratory Analyses

Before I present the CFA results, I also used EFA as implemented in MPLUS to examine the structure of the 104 items. Given the predicted hierarchical structure, it is obvious that neither a one-factor nor a two-factor solution should fit the data. The main purpose of an EFA analysis would be to explore the number of primary factors / facets that is represented in the 104 items. To answer this question, it is inappropriate to rely on the scree test that is overly influenced by item-selection or on the chi-square test that leads to over-extraction. Instead, factor solutions can be compared with standard fit indices from CFA such as the Comparative Fit Index (CFI), the Root Mean Square Error of Approximation (RMSEA), the Akaike Information Criterion (AIC), The Bayesian Information Criterion (BIC), and the sample-size adjusted BIC (SSA-BIC).

Although all indices take parsimony into account, three of the indices favor the most complex structure with 20 factors. BIC favors parsimony the most and settles for 12 factors as the optimal number. The sample-size adjusted BIC favors 16 factors. Evidently, the number of primary factors is uncertain. The reason is that many larger primary factors may be split into highly correlated more specific factors that are called nuances. The structure of primary factors can be explored with CFA analyses of primary factors. Thus, I settled for the 12 factor solution to start the CFA analyses.

Another way to determine the number of primary factors is to look at the scales that were represented in the items. There were 9 HEXACO scales: forgiving (A), gentle (A), flexible (A), patient (A), modest (H), fair (H), greed avoidant (H), sincere (H), and altruistic. In addition, there were the Big Five facets empathetic, trusting, straightforward, modest, compassionate, and polite. Some of these facets overlap with HEXACO facets, suggesting that the 12 factor solution may reflect the full content of the HEXACO and Big Five facets.

Exploration of the Primary Factors with CFA

Before I start, it is important to point out a major misconception about CFA. The term confirmatory has mislead researchers to assume that CFA should only be used to confirm a theoretically expected structure. Any post-hoc modifications of a model to fit actual data would then be a questionable research practice. This is a misconception that is based on Joreskog’s unfortunate decision to label his statistical method confirmatory. Joreskog’s actual description of his method that few researchers have read makes it clear that CFA can be used for exploration.

“We shall give examples of how a preliminary interpretation of the factors can be successively modified to determine a final solution that is acceptable from the point of view of both goodness of fit and psychological interpretation. It is highly desirable that a hypothesis that has been generated in this way should subsequently be confirmed or disproved by obtaining new data and subjecting these to a confirmatory analysis.” (p. 183).

There is nothing different from using CFA to explore data than to run a multiple regression analysis or an exploratory ANOVA. Data exploration is an important part of science. It is only questionable when exploratory analyses are falsely presented as confirmatory. For example, I could pretend that I came up with an elaborate theory of agreeableness and present the final model as theoretically predicted. This is known as HARKing (Kerr, 1998). However exploration is needed to generate a model that is worthwhile testing in a confirmatory study. As nobody has examined the structure of agreeableness with CFA, the first attempt to fit a model is naturally exploratory and can only serve as a starting point for the development of a structural model of agreeableness.

Items were considered to be candidate items of a factor if they loaded at least .3 on a factor. This may seem a low level, but it is unrealistic to expect high loadings of single items on a single factor.

The next step was to fit a simple structure to the items of each factor. When possible this model also included an acquiescence factor that coded direct versus reverse scored items. Typically, this model did not fit the data well. The next step was to look for correlated residuals that show shared variance between items. These items violate the assumption of local independence. That is, the only reason for the correlation between items should be the primary factor. Items can show additional shared variance for a number of reasons such as similar wording or shared specific content (nuances). When many items were available, one of the items with correlated residuals was deleted. Another criterion for item selection was the magnitude of primary loadings. A third criterion aimed to get a balanced number of direct and reverse scored items when this was possible.

Out of the 12 factors, 9 factors were interpretable and matched one of the a priori facets. The 9 primary factors were allowed to correlate freely with each other. The model had acceptable overall fit, CFI = .954, RMSEA = .030. Table 2 shows the factors and the items with their source and primary loadings.

The factor analysis only failed to distinguish clearly between the gentle and flexible facets of HEXACO-agreeableness. Thus, HEXACO agreeableness is represented by three rather than four factors. More important is that the four Honesty-Humility facets of the HEXACO model, namely Greed-Avoidance (F11), Sincerity(F9), Modest (F7), and Morality (F5) were clearly identified. Thus, it is possible to examine the relationship of Honesty-Humilty to Big-Five agreeableness with these data. Importantly, CFA is not affected by the number of indicators. Three items with good primary loadings are sufficient to identify a factor.

Table 3 shows the pattern of correlations among the 9 factors. The aim of a structural model of agreeableness is to explain this pattern of correlations. However, visual inspection of these correlations alone can provide some valuable insights about the structure of agreeableness. Strong evidence for a two-factor model would be provided by high correlations among the Honesty-Humility facets, high correlations among the Agreeableness facets, and low correlations between Honesty-Humulity and Agreeableness facets (Campbell & Fiske, 1959).

The pattern of correlations is only partially consistent with the two-factor structure of the HEXACO model. All of the correlations among the Honesty-Humility facets and all of the correlations among the Agreeableness facets are above .38. However, 6 out of the 20 cross-trait correlations are also above .38. Moreover all of the correlations are positive, suggesting that Honesty-Humility and Agreeableness are not independent.

For the sake of comparability, I also computed scales corresponding to the nine factors. Table 3 shows the correlations for scales that could be used for a two-step hierarchical analysis by means of EFA. In general correlations in Table 3 are weaker than correlations in Table 2 because correlations among scale scores are attenuated by random measurement error. The pattern of correlations remains the same, but there are more cross-trait correlations that exceed the lowest same-trait correlation of .31.

The Structure of Agreeableness

A traditional EFA has severe limitations in examining the structure of correlations in Table 3. One major limitation is that structural relations have to be due to higher-order factors. The other limitation is that 9 variables can only identify a small number of factors. These problems are typically overlooked because traditionally EFA results are not examined for goodness of fit. However, an EFA analysis with fit indices in MPLUS shows that the two-factor model does not meet the standard fit criteria of .95 for CFI and .06 for RMSEA, CFI = .939, RMSEA = .092. The two factors clearly did correspond to the Hexaco-Humility and Agreeableness factors, but even with secondary loadings, the model fails to fully account for the pattern of correlations. Moreover, the two factors were correlated r = .41, suggesting that they are not independent of each other.

Figure 1 shows a CFA model that is based on the factor correlations in Table 2. This model does not fit the data as well, CFI = .950, RMSEA = .031, as the simple measurement model with correlated factors, CFI = .954, RMSEA = .030, but I was unable to find a plausible model with better fit. I encourage others to do so. On the other hand, the model fit the data much better than the two-factor EFA model, CFI = .939, RMSEA = 092.

The model does show the expected separation of Honesty-Humility and Agreeableness facets, but the structure is more complex. First morality has nearly equal loadings on the Honesity-Humility and the Agreeableness facet. The markers of Honesty-Humility are the other three facets, modest, manipulative (reversed), and materialistic (reversed). I suggest that the common element of these facets is self-enhancement. The loading of morality suggests that individuals who are highly motivated to self-enhance are more likely to engage in immoral behaviors to do so.

Four of the remaining factors have direct loadings on the agreeableness factor. Considerate (caring/altruistic) has the highest loading, but all four have substantial loadings. The aggressive factor has the weakest relationship with agreeableness. One reason is that it is also related to neuroticism in the Big Five model. In this model, aggressiveness is linked to agreeableness indirectly by consideration and self-enhancement, suggesting that individuals who do not care about others (low consideration) and who care about themselves (self-enhancement) are more likely to aggress against others.

In addition, there were several correlated residuals between some facets. Trusting and forgiving shared unique variance. Maybe forgiveness is more likely to occur when individuals trust people that they had good intentions and are not going to repeat their transgression again in the future. Aggression showed shared variance with modesty and morality. One explanation could be that modestly is related to low assertiveness and that assertive people are more likely to use aggression. Morality may relate to aggression because it is typically considered immoral to harm others. Morality was also related to manipulativeness. Here the connection is rather obvious because manipulating people is immoral.

The model in Figure 1 should not be considered the ultimate solution to the controversy about the structure of pro-social and anti-social behaviors. To the contrary. The model should be considered the first structural model that actually fits the data. In contrast, previous results based on EFA produced models that approximated the structure, but never fit the actual data. Future research should test alternative models and these models should be evaluated in terms of model fit and theoretical plausibility (Joreskog, 1969).

Do these results answer the question whether there are five or six higher-order factors of personality? The answer is no. Consistent with the Five Factor model, the Honesty-Humility or Self-Enhancement factor is not independent of agreeableness. It is therefore reasonable to think about Honesty-Humility as subordinate to Agreeableness in a hierarchical model of traits. Personally, I favor this interpretation of the results. However, proponents of the HEXACO model may argue that the correlation between the agreeableness factor and the honesty-humility factor is low enough to make honesty-humility a separate factor. Moreover, it was not possible to control for evaluative bias (halo) variance in this model, and halo bias may have inflated the correlation between the two factors. On the other hand, if correlations of .4 are considered low enough to split Big-Five factors, it is possible that closer inspection of other Big Five domains can also be split into distinct, yet positively correlated factors. The main appeal of the Big Five model is that the five factors are fairly independent after controlling for evaluative bias variance. Moreover, many facets have loadings of .4 or even lower on the Big Five factor. It is therefore noteworthy that all correlations among the 9 factors were positive and suggest that a general factor produces covariation among them. The common factor can also be clearly interpreted in terms of the focus on self-interest versus other’s interests or needs to guide behaviors. Individuals high in agreeableness take others’s needs and feelings into account, whereas those low in agreeableness are guided strongly by self-interest. The split into two factors may be due to the fact that importance of self and other are not always in conflict with each other. Especially individuals low in self-enhancement may still differ in terms of their pro-social behaviours.


The main contribution of this blog post is to show the importance of testing model fit in investigations of the structure of personality traits. While it may seem self-evident that a theoretical model should fit the data, personality psychologists have failed to test model fit or ignored feedback that their models do not fit the data. Not surprisingly, personality psychologists continue to argue over models because it is easy to propose models, if they do not have to fit actual data. If structural research wants to be an empirical science, it has to subject models to empirical tests that can falsify models that do not fit the data.

I empirically showed that a simple two-factor model does not fit the data. At the same time, I showed that a model with a general agreeableness factor and several independent facets also does not fit the data. Thus, neither of the dominant models fits the data. At the same time, the data are consistent with the idea of a general factor underlying pro-social and anti-social behaviors, while the relationship among facets remains to be explored in more detail. Future research needs to control for evaluative bias variance and examine how the structure of agreeableness is embedded in a larger structural model of personality.


Ashton, M. C., Lee, K., Perugini, M., Szarota, P., de Vries, R. E., Di Blas, et al.
(2004). A six-factor structure of personality-descriptive adjectives: Solutions
from psycholexical studies in seven languages. Journal of Personality and Social
Psychology, 86, 356–366

Lee, K., & Ashton, M. C. (2004). Psychometric properties of the HEXACO Personality
Inventory. Multivariate Behavioral Research, 39, 329–358.

Cross-Cultural Comparisons of Personality: Beware of Method Factors

Ulrich Schimmack
Shigehiro Oishi


Personality ratings on a 25-item Big Five measures by two national samples (US, Japan) were analyzed with an item-level measurement model that separates method factors (acquiescence, halo bias) and trait factors. Results reveal a strong influence of halo bias on US responses that distort cultural comparisons in personality. After correcting for halo bias, Japanese were more conscientious, extraverted, open to experience and less neurotic and agreeable. The results support cultural differences in positive illusions and raises questions about the validity of studies that rely on scale means to examine cultural differences in personality.


Cultural stereotypes imply cross-cultural differences in personality traits. However, cross-cultural studies of personality do not support the validity of these cultural stereotypes (Terracciano et al., 2005). Whenever two measures produce divergent results, it is necessary to examine the sources of these discrepancies. One obvious reason could be that cultural stereotypes are simply wrong. It is also possible that scientific studies of personalty across culture produce misleading results (Perugini & Richetin, 2007). One problem for empirical studies of cross-cultural differences in personality is that cultural differences tend to be small. Culture explains at most 10% of the variance and often the percentages are much smaller. For example, McCrae et al. (2010) found that culture explained only 1.5% of the variance in agreeableness ratings. As some of this variance is method variance, the variance due to actual differences in agreeableness is likely to be less than 1%. With small amounts of valid variance, method factors can have a strong influence on the pattern of mean differences across cultures.

One methodological problem in cross-cultural studies of personality is that personalty measures are developed with a focus on the correlation of items with each other within a population. The item means are not relevant with the exception that items should avoid floor or ceiling effects. However, cross-cultural comparisons rely on differences in the item means. As item means have not been subjected to psychometric evaluations, it is possible that item means lack construct validity. Take “working hard” as an example. How hard people work could be influenced by culture. For example, in poor cultures people have to work harder to make a living. The item “working hard” may correctly reflect variation in conscientiousness within poor cultures and within rich cultures, but the differences between cultures would reflect environmental conditions rather than conscientiousness. As a result, it is necessary to demonstrate that cultural differences in item means are valid measures of cultural differences in personality.

Unfortunately, obtaining data from a large sample of nations is difficult and sample sizes are often rather small. For example, McCrae et al. (2010) examined convergent validity of Big Five scores with 18 nations. The only significant evidence of convergent validity was obtained for neuroticism, r = .44, and extraversion, r = .45. Openness and agreeableness even produced small negative correlations, r = -.27, r = -.05, respectively. The largest cross-cultural studies of personality had 36 overlapping nations (Allik et al., 2017; Schmitt et al., 2007). The highest convergent validity was r = .4 for extraversion and conscientiousness. Low convergent validity, r = .2, was observed for neuroticism and agreeableness, and the convergent validity for openness was 0 (Schimmack, 2020). These results show the difficulty of measuring personality across cultures and the lack of validated measures of cultures’ personality profiles.

Method Factors in Personality Measurement

It is well-known that self-ratings of personality are influenced by method factors. One factor is a stylistic factor in the use of response formats known as acquiescence bias (Cronbach, 1942, 1965). The other factor reflects individual differences in responding to the evaluative meaning of items known as halo bias (Thorndike, 1920). Both method factors can distort cross-cultural comparisons. For example, national stereotypes suggest that Japanese individuals are more conscientious than US American individuals, but mean scores of conscientiousness in cross-cultural studies do not confirm this stereotype (Oishi & Roth, 2009). Both method factors may artificially lower Japan’s mean score because Japanese respondents are less likely to use extreme scores (Min, Cortina, & Miller, 2016) and Asians are less likely to inflate their scores on desirable traits (Kim, Schimmack, & Oishi, 2012). In this article, we used structural equation modeling to separate method variance from trait variance to distinguish cultural differences in response tendencies from cultural differences in personality traits.

Convenience Samples versus National Samples

Another problem for empirical studies of national differences is that psychologists often rely on convenience samples. The problem with convenience samples is that personality can change with age and that there are regional differences in personality within nations (). For example, a sample of students at New York University may differ dramatically from a student sample at Mississippi State University or Iowa State University. Although regional differences tend to be small, national differences are also small. Thus, small regional differences can bias national comparisons. To avoid these biases it is preferable to compare national samples that cover all regions of a nation and a broad age range.

Modeling Approach

The purpose of our study is to advance research on cultural differences in personality by comparing a Japanese and a US national sample that completed the same Big Five personality questionnaire using a measurement model that distinguishes personality factors and method factors. The measurement model is an improved version of Anusic et al.’s (2009) halo-alpha-beta model (Schimmack, 2019). The model is essentially a tri-factor model.

Figure 1

That is, each item loads on three factor, namely (a) a primary loading on one of the Big Five factors, (b) a loading on an acquiescence bias factor, and (c) a loading on the evaluative bias/halo factor. As Big Five measures typically do not show a simple structure, the model also can include secondary loadings on other Big Five factors. This measurement model has been successfully fitted to several Big Five questionnaires (Schimmack, 2019). This is the first time, the model is applied to a multiple-group model to compare measurement models for US and Japanese samples.

We first fitted a very restrictive model that assumed invariance across the two factors. Given the lack of psychometric cross-cultural comparisons, we expected that this model would not have acceptable fit. We then modified the model to allow for cultural differences in some primary factor loadings, secondary factor loadings, and item intercepts. This step makes our work exploratory. However, we believe that this exploratory work is needed as a first step towards psychometrically sound measurement of cultural differences.


Participants (N = 952 Japanese, 891 US) were recruited by Nikkei Research Inc. and its U.S. affiliate using a national probabilistic sampling method based on gender and age. The mean age was 44. The data have been used before to compare the influence of personality on life-satisfaction judgments, but without comparing mean levels in personality and life-satisfaction (Kim, Schimmack, Oishi, & Tsutsui, 2018).


The Big Five items were taken from the International Personality Item Pool (Goldberg et al., 2006). There were five items for each of the Big Five dimensions (Table 1).


We first fitted a model without mean structure to the data. A model with strict invariance for the two samples did not have acceptable fit using RMSEA < .06 and CFI > .95 as criterion values, RMSEA = .064, CFI = .834. However, CFI values should not be expected to reach .95 in models with single-item indicators (Anusic et al., 2009). Therefore, the focus is on RMSEA. We first examined modification indices (MI) of primary loadings. We used MI > 30 as a criterion to free parameters to avoid overfitting the model. We found seven primary loadings that would improve model fit considerably (n4, e3, a1, a2, a3, a4, c4). Freeing these parameter improved the model (RMSEA = .060, CFI = .857). We next examined loadings on the halo factor because it is likely that some items differ in their connotative meaning across languages. However, we found only two notable MIs (o1, c4). Freeing these parameters improved model fit (RMSEA = .057, CFI = .871). We identified six secondary loadings that differed notably across cultures. One was a secondary loading on neuroticism (e4) and four were secondary loadings on agreeableness (n5, e1, e3, o4), and one was a secondary loading on conscientiousness (n3). Freeing these parameters improved model fit (RMSEA = .052, CFI = .894). We were satisfied with this measurement model and continued with the means model. The first model fixed the item intercepts and factor means to be identical. This model had worse fit than the model without a means structure (RMSEA = .070, CFI = .803). The biggest MI was observed for the mean of the halo factor. Allowing for mean differences in halo improved model fit considerably (RMSEA = .060, CFI = .849). MIs next suggested to allow for mean differences in extraversion and agreeableness. We next allowed for mean differences in the other factors. This further improved model fit (RMSEA = .058, CFI = .864), but not as much. MIs suggested seven items with different item intercepts (n1, n5, e3, o3, a5, c3 c5). Relaxing these parameters improved model fit close to the level for the model without a mean structure (RMSEA = .053, CFI = .888).

Table 1 shows the primary loadings and the loadings on the halo factor for the 25 items.

Table 1

The results show very similar primary loadings for most items. This means that factors have similar meaning in the two samples and that it is possible to compare the two cultures. Nevertheless, there are some differences that could bias comparisons based on item-sum-scores. The item “feeling comfortable around people” loads much more strongly on the extraversion factor in the US than in Japan. The agreeableness items “insult people” and “sympathize with others’ feelings” also load more strongly in the US than in Japan. Finally, “making a mess of things” is a conscientiousness item in the US, but not in Japan. The fact that item loadings are more consistent with the theoretical structure can be attributed to the development of the items in the US.

A novel and important finding is that most loadings on the halo factor are also very similar across nations. For example, the item “have excellent ideas” shows a high loading for the US and Japan. This finding contradicts the idea that evaluative biases are culture-specific (Church et al., 2014). The only notable difference is the item “make a mess of things” that has no notable loading on the halo factor in Japan. Even in English, the meaning of this item is ambiguous and future studies should replace this item with a better item. The correlation between the halo loadings for the two samples is high, r = .96.

Table 2 shows the item means and the item intercepts of the model.

Table 2

The item means of the US sample are strongly correlated with the loadings on the halo factor, r = .81. This is a robust finding in Western samples. More desirable items are endorsed more. The reason could be that individuals actually act in desirable ways most of the time and that halo bias influences item means. Surprisingly, there is no notable correlation between item means and loadings on the halo factor for the Japanese sample, r = .08. This pattern of results suggests that US means are much more strongly influenced by halo bias than Japanese means. Further evidence is provided by inspecting the mean differences. For desirable items (low N, high E, O, A, & C) US means are always higher than Japanese’ means. For undesirable items, the US means are always lower than Japanese’ means, except for the item “stay in the background” where the means are identical. The difference scores are also positively correlated with the halo loadings, r = .90. In conclusion, there is strong evidence that halo bias distorts the comparison of personality in these two samples.

The item intercepts show cultural differences in items after taking cultural differences in halo and the other factors into account. Notable differences were observed from some items. Even after controlling for halo and extraversion, US respondents report higher levels of being comfortable around people than Japanese. This difference fits cultural stereotypes. After correcting for halo bias, Japanese now score higher on getting chores done right away than Americans. This also fits cultural stereotypes. However, Americans still report paying more attention to detail than Japanese, which is inconsistent with cultural stereotypes. Extensive validation research is needed to examine whether these results reflect actual cultural differences in personality and behaviours.

Figure 2 shows the mean differences on the Big Five factors and the two bias factors.

Figure 2

Figure 2 shows a very large difference in halo bias. The difference is so large that it seems implausible. Maybe the model is overcorrecting, which would bias the mean differences for the actual traits in the opposite direction. There is little evidence of cultural differences in acquiescence bias. One open question is whether the strong halo effect is entirely due to evaluative biases. It is also possible that a modesty bias plays a role because modesty implies less extreme responses to desirable items and less extreme responses to undesirable items. To separate the two, it would be necessary to include frequent and infrequent behaviours that are not evaluative.

The most interesting result for the Big Five factors is that the Japanese sample scores higher in conscientiousness than the US sample after halo bias is removed. This reverses the mean differences in this sample and previous studies that show higher conscientiousness for US than Japanese samples (). The present results suggest that halo bias masks the actual difference in conscientiousness. However, other results are more surprising. In particular, the present results suggest that Japanese people are more extraverted than Americans. This contradicts cultural stereotypes and previous studies. The problem is that cultural stereotypes could be wrong and that previous studies did not control for halo bias. More research with actual behaviours and less evaluative items is needed to draw strong conclusions about personality differences between cultures.


It has been known for 100 years that self-ratings of personality are biased by connotative meaning. At least in North America it is common to see a strong correlation between the desirability of items and the means of self-ratings. There is also consistent evidence that Americans rate themselves in a more desirable manner than the average American (). However, this does not mean that Americans are seeing themselves as better than everybody else. In fact, self-ratings tend to be slightly less favorable than ratings of friends or family members (), indicating a general evaluative biases to rate oneself and close others favorably.

Given the pervasiveness of evaluative biases in personality ratings it is surprising that halo bias has received so little attention in cross-cultural studies of personality. One reason could be the lack of a good method to measure and remove halo variance from personality ratings. Despite early attempts to detect socially desirable responding, lie scales have shown little validity as bias measures (ref). The problem is that manifest scores on lie scales contain as much valid personality variance as bias variance. Thus, correcting for scores on these scales literally throws out the baby (valid variance) with the bathwater (bias variance). Structural equation modeling (SEM) solves this problem by spitting observed variances into unobserved or latent variances. However, personality psychologists have been reluctant to take advantage of SEM because item models require large samples and theoretical models were too simplistic and produced bad fit. Informed by multi-rater studies that emerged in the 1990s, we developed a measurement model of the Big Five that separates personality variance from evaluative bias variance (Anusic, et al., 2009; Schimmack, Kim, & 2012; Schimmack, 2019). Here we applied this model for the first time to cross-cultural data to examine whether cultures differ in halo bias. The result suggest that halo bias has a strong influence on personality ratings in the US, but not in Japan. The differences in halo bias distort comparisons on the actual personality traits. While raw scores suggest that Japanese people are less conscientious than Americans, the corrected factor means suggest the opposite. Japanese participants also appeared to be less neurotic, more extraverted and open to experiences, which was a surprising result. Correcting for halo bias did not change the cultural differences in agreeableness. Americans were more agreeable than Japanese with and without correction for halo bias. Our results do not provide a conclusive answer about cultural differences in personality, but they shed a new light on several questions in personality research.

Cultural Differences in Self-enhancement

One unresolved question in personality psychology is whether positive biases in self-perceptions also known as self-enhancement are unique to American or Western cultures or whether they are a universal phenomenon (Church et al., 2016). One problem are different approaches to the measurement of self-enhancement. The most widely used method are social comparisons where individuals compare themselves to an average person. These studies tend to show a persistent better-than-average effect in all cultures (ref). However, this finding does not imply that halo biases are equally strong in all cultures. Brown and Kobayashi (2002) found better-than-average effects in the US and Japan, but Japanese ratings of the self and others were less favorable than those in the US. Kim et al. (2012) explain this pattern with a general norm to be positive in North America that influences ratings of the self as well as ratings of others. Our results are consistent with this view and suggests that self-enhancement is not a universal tendency. More research with other cultures is needed to examine which cultural factors moderate halo biases.

Rating Biases or Self-Perception Biases

An open question is whether halo biases are mere rating biases or reflect distorted self-perceptions. One model suggests that participants are well aware of their true personality, but merely present themselves in a more positive light to others. Another model suggests that individuals truly believe that their personality is more desirable than it actually is. It is not easy to distinguish between these two models empirically. zzz

Halo Bias and the Reference Group Effect

In an influential article, Heine et al. (2002) criticized cross-cultural comparisons in personality ratings as invalid. The main argument was that respondents adjust the response categories to cultural norms. This adjustment was called the reference group effect. For example, the item “insult people” is not answered based on the frequency of insults or a comparison of the frequency of insults to other behaviours. Rather it is answered in comparison to the typical frequency of insults in a particular culture. The main prediction made by the reference group effect is that responses in all cultures should cluster around the mid-point of a Likert-scale that represents the typical frequency of insults. As a result, cultures could differ dramatically in the actual frequency of insults, while means on the subjective rating scales are identical.

The present results are inconsistent with a simple reference group effect. Specifically, the US sample showed notable variation in item means that was related to item desirability. As a result, undesirable items like “insult people” had a much lower mean, M = 1.83, than the mid-point of the scale (3), and desirable items “have excellent ideas” had a higher mean (M = 3.73) than the midpoint of the scale. This finding suggests that halo bias rather than a reference group effect threatens the validity of cross-cultural comparisons.

Reference group effects may play a bigger role in Japan. Here item means were not related to item desirabilty and clustered more closely around the mid-point of the scale. The highest mean was 3.56 for worry and the lowest mean was 2.45 for feeling comfortable around people. However, other evidence contradicts this hypothesis. After removing effects of halo and the other personality factors, item intercepts were still highly correlated across the two national samples, r = .91. This finding is inconsistent with culture-specific reference groups that would not produce consistent item intercepts.

Our results also provide a new explanation for the low conscientiousness of Japanese samples. A reference group effect would not predict a significantly lower level of conscientiousness. However, a stronger halo effect in the US explains this finding because conscientiousness is typically assessed with desirable items. Our results are also consistent with the finding that self-esteem and self-enhancement are more pronounced in the US than in Japan (Heine & Buchtel, 2009). These aforementioned biases inflate conscientiousness scores in the US. After removing this bias, Japanese rate themselves as more conscientious than US Americans.

Limitations and Future Directions

We echo previous calls for validation of personality scores of nations (Heine & Buchtel, 2009). The current results are inconsistent across questionnaires and even the low level of convergent validity may be inflated by cultural differences in response styles. Future studies should try to measure personality with items that minimize social desirability and use response formats that avoid the use of reference groups (e.g., frequency estimates). Moreover, results based on ratings should be validated with objective indicators of behaviours.

Future research also needs to take advantage of developments in psychological measurement and use models that can identify and control for response artifacts. The present model shows the ability of separating evaluative biases or halo variance from actual personality variance. Future studies should use this model to compare a larger number of nations.

The main limitation of our study is the relatively small number of items. The larger the number of items, the easier it is to distinguish item-specific variance, method variance, and trait variance. The measure also did not properly take into account that the Big Five are higher-order factors of more basic traits called facets. Measures like the BFI-2 or the NEO-PI3 should be used to study cultural differences at the facet level, which often shows unique influences of culture that are different from effects on the Big Five (Schimmack, 2020).

We conclude with a statement of scientific humility. The present results should not be taken as clear evidence about cultural differences in personality. Our article is merely a little step towards the goal of measuring personality differences across cultures. One obstacle in revealing such differences is that national differences appear to be relatively small compared to the variation in personality within nations. One possible explanation for this is that variation in personality is caused more by biological than cultural factors. For example, twin studies suggest that 40% of the variance in personality traits is caused by genetic variation within a population, whereas cross-cultural studies suggest that at most 10% of the variance is caused by cultural influences on population means. Thus, while uncovering cultural variation in personality is of great scientific interest, evidence of cultural differences between nations should not be used to stereotype individuals from different nations. Finally, it is important to distinguish between personality traits that are captured by Big Five traits and other personality attributes like attitudes, values, or goals that may be more strongly influenced by culture. The key novel contribution of this article is to demonstrate that cultural differences in response styles exists and distort national comparisons of personality with simple scale means. Future studies need to take response styles into account.


Cronbach, L. J. (1942). Studies of acquiescence as a factor in the true-false test. Journal of Educational Psychology, 33(6), 401–415.

Heine, S. J., & Buchtel, E. E. (2009). Personality: The universal and the culturally specific. Annual Review of Psychology, 60, 369–394.

Perugini, M., & Richetin, J. (2007). In the land of the blind, the one-eyed man is king. European Journal of Personality, 21(8), 977–981.

Schimmack, U. (2020). Personality science: The science of human diversity. TopHat, 978-1-77412-253-2.

Terracciano, A. et al. (2005). National character does not reflect mean personality
trait levels in 49 cultures. Science, 310, 96–100.

JPSP:PPID = Journal of Pseudo-Scientific Psychology: Pushing Paradigms – Ignoring Data


Ulrich Orth, Angus Clark, Brent Donnellan, Richard W. Robins (DOI: 10.1037/pspp0000358) present 10 studies that show the cross-lagged panel model (CLPM) does not fit the data. This does not stop them from interpreting a statistical artifact of the CLPM as evidence for their vulnerability model of depression. Here I explain in great detail why the CLPM does not fit the data and why it creates an artifactual cross-lagged path from self-esteem to depression. It is sad that the authors, reviewers, and editors were blind to the simple truth that a bad-fitting model should be rejected and that it is unscientific to interpret parameters of models with bad fit. Ignorance of basic scientific principles in a high-profile article reveals poor training and understanding of the scientific method among psychologists. If psychology wants to gain respect and credibility, it needs to take scientific principles more seriously.


Psychology is in a crisis. Researchers are trained within narrow paradigms, methods, and theories that populate small islands of researchers. The aim is to grow the island and to become a leading and popular island. This competition between islands is rewarded by an incentive structure that imposes the reward structure of capitalism on science. The winner gets to dominate the top journals that are mistaken as outlets of quality. However, just like Coke is not superior to Pepsi (sorry Coke fans), the winner is not better than the losers. They are just market leaders for some time. No progress is being made because the dominant theories and practices are never challenged and replaced with superior ones. Even the past decade that has focused on replication failures has changed little in the way research is conducted and rewarded. Quantity of production is rewarded, even if the products fail to meet basic quality standards as long as naive consumers of researchers are happy.

This post is about the lack of training in the analysis of longitudinal data with a panel structure. A panel study essentially repeats the measurement of one or several attributes several times. Nine years of undergradute and graduate training leave most psychologists without any training how to analyze these data. This explains why the cross-lagged panel model (CLPM) was criticized four decades ago (Rogosa, 1980), but researchers continue to use it with the naive assumption that it is a plausible model to analyze panel data. Critical articles are simply ignored. This is the preferred way of dealing with criticism by psychologists. Here, I provide a detailed critique of CLPM using Orth et al.’s data ( and simulations.

Step 1: Examine your data

Psychologists are not trained to examine correlation matrices for patterns. They are trained to submit their data to pre-specified (cookie-cutter) models and hope that the data fit the model. Even if the model does not fit, results are interpreted because researchers are not trained in modifying cookie cutter models to explore reasons for bad fit. To understand why a model does not fit the data, it is useful to inspect the actual pattern of correlations.

To illustrate the benefits of visual inspection of the actual data, I am using the data from the Berkeley Longitudinal Study (BLS), which is the first dataset listed in Orth et al.’s (2020) table that lists 10 datasets.

To ease interpretation, I break up the correlation table into three components, namely (a) correlations among self-esteem measures (se1-se4 with se1-se4), correlations among depression measures (de1-de4 with de1-de4), and correlations of self-esteem measures with depression measures (se1-se4 with de1-de4);

Table 1

Table 1 shows the correlation matrix for the four repeated measurements of self-esteem. The most important information in this table is how much the magnitude of the correlations decreases along the diagonals that represent different time lags. For example, the lag-1 correlations are .76, .79, and .74, which approximately average to a value of .76. The lag-2 correlations are .65 and .69, which averages to .67. The lag-3 correlation is .60.

The first observation is that correlations are getting weaker as the time-lag gets longer. This is what we would expect from a model that assumes self-esteem actually changes over time, rather than just fluctuating around a fixed set-point. The latter model implies that retest correlations remain the same over different time lags. So, we do have evidence that self-esteem changes over time, as predicted by the cross-lagged panel model.

The next question is how much retest correlations decrease with increasing time lags. The difference from lag-1 to lag-2 is .74 – .67 = .07. The difference from lag-2 to lag-3 is .67 – .60, which is also .07. This shows no leveling off of the decrease in these data. It is possible that the next wave would produce a lag-4 correlation of .53, which would be .07 lower than then lag-3 correlation. However, a difference of .07 is not very different from 0, which would imply that change asymptotes at .60. The data are simply insufficient to provide strong information about this.

The third observation is that the lag-2 correlation is much stronger than the square of the lag-1 correlations, .67 > .74^2 = .55. Similarly, the lag-3 correlation is stronger than the product of the lag-1 and lag-2 correlations, .60 > .74 * .67 = .50 This means that a simple autoregressive model with observed variables does not fit the data. However, this is exactly the model of Orth et al.’s CLPM.

It is easy to examine the fit of this part of the CLPM model, by fitting an autoregressive model to the self-esteem panel data.

se2-se4 PON se1-3 ! This command regresses each measure on the previous measure (n on n-1).
! There is one thing I learned from Orth et al., and it was the PON command of MPLUS

Table 2

Table 2 shows the fit of the autoregressive model. While CFI meets the conventional threshold of .95 (higher is better), RMSEA shows terrible fit of the model (.06 or lower are considered acceptable). This is a problem for cookie-cutter researchers who think CLPM is a generic model that fits all data. Here we see that the model makes unrealistic assumptions and we already know what the problem is based on our inspection of the correlation table. The model predicts more change than the data actually show. We are therefore in a good position to reject the CLPM as a viable model for these data. This is actually a positive outcome. The biggest problem in correlational research are data that fit all kinds of models. Here we have data that actually disconfirm some models. Progress can be made, but only if we are willing to abandon the CLPM.

Now let’s take a look at the depression data, following the same steps as for the self-esteem data.

Table 3

The average lag-1 correlation is .43. The average lag-2 correlaiton is .45, and the lag-3 correlation is .4. These results are problematic for an autoregressive model because the lag-2 correlation is not even lower than the lag-1 correlation.

Once more it is hard to tell, whether retest-correlations are approaching an asymptote. In this case, the lag-2 minus lag-1 difference is -.02 and the lag-3 minus lag-2 difference is .05.

Finally, it is clear that the autoregressive model with manifest variables overestimates change. The lag-2 correlation is stronger than the square of the lag-1 correlations, .45 > .43^2 = .18, and the lag-3 correlation is stronger than the lag-1 * lag-2 correlation, .40 > .43*.45 = .19.

Given these results, it is not surprising that the autoregressive model fits the data even less than for the self-esteem measures (Table 4).

de2-de4 PON de1-de3 ! regress each depression measure on the previous one.

Talble 4

Even the CFI value is now in the toilet and the RMSEA value is totally unacceptable. Thus, the basic model of stability and change implemented in CLPM is inconsistent with the data. Nobody should proceed to build a more complex, bivariate model if the univariate models are inconsistent with the data. The only reason why psychologists do so all the time is that they do not think about CLPM as a model. They think CLPM is like a t-test that can be fitted to any panel data without thinking. No wonder psychology is not making any progress.

Step 2: Find a Model That Fits the Data

The second step may seem uncontroversial. If one model does not fit the data, there is probably another model that does fit the data and this model has a higher chance of being the model that reflects the causal processes that produced the data. However, psychologists have an uncanny ability to mess up even the simplest steps in data analysis. They have convinced themselves that it is wrong to fit models to data. The model has to come first so that the results can be presented as confirming a theory. However, what is the theoretical rational of the CLPM? It is not motivated by any theory of development, stability, or change. It is as atheoretical as any other model. It only has the advantage that it became popular on an island of psychology and now people use it without being questioned about it. Convention and conformity are not pillars of science.

There are many alternative models to CLPM that can be tried. One model is 60 years old and was introduced by Heise (1969). It is also an autoregressive model, but it also allows for occassion specific variance. That is, some factors may temporarily change individuals’ self-esteem or depression without any lasting effects on future measurements. This is a particularly appealing idea for a symptom checklist of depression that asks about depressive symptoms in the past four weeks. Maybe somebody’s cat died or it was a midterm period and depressive symptoms were present for a brief period, but these factors have no influence on depressive symptoms a year later.

I first fitted Heise’s model to the self-esteem data.

sse1 BY se1@1;
sse2 BY se2@1;
sse3 BY se3@1;
sse4 BY se4@1;
sse2-sse4 PON sse1-sse3 (stability);
se1-se4 (se_osv) ! occasion specific variance in self-esteem

Model fit for this model is perfect. Even the chi-square test is not significant (which in SEM is a good thing, because it means the model closely fits the data).

Model results show that there is significant occasion specific variance. After taking this variance into account the stability of the variance that is not occassion-specific, called state variance by Heise, is around r = .9 from one occasion to the next.

Fit for the depression data is also perfect.

There is even more occasion specific variance in depressive symptoms, but the non-occasion-specific variance is even more stable as the non-occasion-specific variance in self-esteem.

These results make perfect sense if we think about the way self-esteem and depression are measured. Self-esteem is measured with a trait measure of how individuals see themselves in general, ignoring ups and downs and temporary shifts in self-esteem. In contrast, depression is assessed with questions about a specific time period and respondents are supposed to focus on their current ups and downs. Their general disposition should be reflected in these judgments only to the extent that it influences their actual symptoms in the past weeks. These episodic measures are expected to have more occasion specific variance if they are valid. These results show that participants are responding to the different questions in different ways.

In conclusion, model fit and the results favor Heise’s model over the cookie-cutter CLPM.

Step 3: Putting the two autoregressive models together

Let’s first examine the correlations of self-esteem measures with depression measures.

The first observation is that the same-occasion correlations are stronger (more negative) than the cross-occasion correlations. This suggests that occasion specific variance in self-esteem is correlated with occasion specific variance in depression.

The second observation is that the lagged self-esteem to depression correlations (e.g., se1 with de2) do not become weaker (less negative) with increasing time lag, lag-1 r = -.36, lag-2 r = -.32, lag-3 r = .33.

The third observation is that the lagged depression to self-esteem correlations (e.g., de1 with se2) do not decrease from lag-1 to lag-2, although they do become weaker from lag-2 to lag-3, lag-1 r = -.44, lag-2 r = -.45, lag-3 r = -.35.

The fourth observation is that the lagged self-esteem to depression correlations (se1 with de2) are weaker than the lagged depression to self-esteem (de1 with se2) correlations . This pattern is expected because self-esteem is more stable than depressive symptoms. As illustrated in the Figure below, the path from de1-se4 is stronger than the path form se1 to de4 because the path from se1 to se4 is stronger than the path from de1 to de4.

Regression analysis or structural equation modeling is needed to examine whether there are any additional lagged effects of self-esteem on depressive symptoms. However, a strong cross-lagged path from se1 to de4 would produce a stronger correlation of se1 and de4, if stability were equal or if the effect is strong. So, a stronger lagged self-esteem to depression correlation than a lagged depression to self-esteem correlation would imply a cross-lagged effect from self-esteem to depression, but the reverse pattern is inconclusive because self-esteem is more stable.

Like Orth et al. (2020) I found that Heise’s model did not converge. However, unlike Orth et al. I did not conclude from this finding that the CLPM model is preferable. After all, it does not fit the data. Model convergence is sometimes simply a problem of default starting values that work for most models but not for all models. In this case, the high stability of self-esteem produced a problem with default starting values. Just setting this starting value to 1 solved the convergence problem and produced a well-fitting result.

The model results show no negative lagged prediction of depression from self-esteem. In fact, a small positive relationship emerged, but it was not statistically significant.

It is instructive to compare these results with the CLPM results. The CLPM model is nested in the Heise model. The only difference is that the occasion-specific variances of depression and self-esteem are fixed to zero. As these parameters were constrained across occasions, this model has two fewer parameters and the model df increase from 24 to 26. Model fit decreased in the more parsimonious model. However, the overall fit is not terrible, although RMSEA should be below .06 [Interestingly, the CFI value changed from a value over .95 to a value .94 when I estimated the model with MPLUS8.2, whereas Orth et al. used MPLUS8]. This shows the problem of relying on overall fit to endorse models. Overall fit is often good with longitudinal data because all models predict weaker correlations over longer time intervals. The direct model comparison shows that the Heise model is the better model.

In the CLPM model, self-esteem is a negative lagged predictor of depression. This is the key finding that Orth and colleagues have been using to support the vulnerability model of depression (low self-esteem leads to depression).

Why does the CLPM model produce negative lagged effects of self-esteem on depression. The reason is that the model underestimates the long-term stability of depression from time 1 to time 3 and time 4. To compensate for this it can use self-esteem that is more stable and then link self-esteem at time 2 with depression at time 3 (.745 * -.191) and self-esteem at time 3 with depression at time 4 (.742 * .739 * -.190). But even this is not sufficient to compensate for the misprediction of depression over time. Hence, the worse fit of the model. This can be seen by examining the model reproduced correlation matrix in the MPLUS Tech1 output.

Even with the additional cross-lagged path, the model predicts only a correlation of r = .157 from de1 to de4, while the observed correlation was r = .403. This discrepancy merely confirms what the univariate models showed. A model without occasion-specific variances underestimates long-term stability.

Interem Conclusion

Closer inspection of Orth et al.’s data shows that the CLPM does not fit the data. This is not surprising because it is well-known that the cross-lagged panel model often underestimates long-term stability. Even Orth has published univariate analyses of self-esteem that show a simple autoregressive model does not fit the data (Kuster & Orth, 2013). Here I showed that using the wrong model of stability creates statistical artifacts in the estimation of cross-lagged path coefficients. The only empirical support for the vulnerability model of depression is a statistical artifact.

Replication Study

I picked the My Work and I (MWI) dataset for a replication study. I picked it because it used the same measures and had a relatively large sample size (N = 663). However, the study is not an exact or direct replication of the previous one. One important difference is that measurements were repeated every two months rather than every year. The length of the time interval can influence the pattern of correlations.

There are two notable differences in the correlation table. First, the correlations increase with each measurement from .782 for se1 with se2 to .871 from se4 to se5. This suggests a response artifact, such as a stereotypic response styles that inflates consistency over time. This is more likely to happen for shorter intervals. Second, the difference between correlations with different lags are much smaller. They were .07 in the previous study. Here the differences are .02 to .03. This means there is hardly any autoregressive structure, suggesting that a trait model may fit the data better.

The pattern for depression is also different from the previous study. First, the correlations are stronger, which makes sense, because the retest interval is shorter. Somebody who suffers from depressive symptoms is more likely to still suffer two months later than a year later.

There is a clearer autoregressive structure for depression and no sign of stereotypic responding. The reason could be that depression was assessed with a symptom checklist that asks about the previous four weeks. As this question covers a new time period each time, participants may avoid stereotypic responding.

The depression-self-esteem correlations also become stronger (more negative) over time from r = -.538 to r = -.675. This means that a model with constrained coefficients may not fit the data.

The higher stability of depression explains why there is no longer a consistent pattern of stronger lagged depression to self-esteem correlations (de1 with se2) above the diagonal than self-esteem to depression correlations (se1 with de2) below the diagonal. Five correlations are stronger one way and five correlations are stronger the other way.

For self-esteem, the autoregressive model without occasion-specific variance had poor fit (RMSEA = .170, CFI = .920). Allowing for occasion-specific variance improved fit and fit was excellent (RMSEA = .002, CFI = .999). For depression, the autoregressive model without occasion-specific variance had poor fit (RMSEA = .113, CFI = .918). The model with occasion-specific variance fit better and had excellent fit (RMSEA = .029, CFI = .995). These results replicate the previous results and show that CLPM does not fit because it underestimates stability of self-esteem and depression.

The CLPM model also had bad fit in the original article (RMSEA = .105, CFI = .932). In comparison, the model with occasion specific variances had much better fit (RMSEA = .038, CFI = .991). Interestingly, this model did show a small, but statistically significant path from self-esteem to depression (effect size r = -.08). This raises the possibility that the vulnerability effect may exist for shorter time intervals of a few months, but not for longer time intervals of a year or more. However, Orth et al. do not consider this possibility. Rather, they try to justify the use of the CLPM to analyze panel data even though the model does not fit.


Orth et al. note “fit values were lowest for the CLPM” (p. 21) with a footnote that recognizes the problem of the CLPM, “As discussed in the Introduction, the CLPM underestimates the long-term stability of constructs, and this issue leads to misfit as the number of waves increases” (p. 63).

Orth et al. also note correctly that the cross-lagged effect of self-esteem on depression emerges more consistently with the CLPM model. By now it is clear why this is the case. It emerges consistently because it is a statistical artifact produced by the underestimation of stability in depression in the CLPM model. However, Orth et al.’s belief in the vulnerability effect is so strong that they are unable to come to a rational conclusion. Instead they propose that the CLPM model, despite its bad fit, shows something meaningful.

We argue that precisely because the prospective effects tested in the CLPM are also based on between-person variance, it may answer questions that cannot be assessed with models that focus on within-person effects. For example, consider the possible effects of warm parenting on children’s self-esteem (Krauss, Orth, & Robins, 2019): A cross-lagged effect in the CLPM would indicate that children raised by warm parents would be more likely to develop high self-esteem than children raised by less warm parents. A cross-lagged effect in the RI-CLPM would indicate that children who experience more parental warmth than usual at a particular time point will show a subsequent increase in self-esteem at the next time point, whereas children who experience less parental warmth than usual at a particular time point will show a subsequent drop in self-esteem at the next time point

Orth et al. then point out correctly that the CLPM is nested in other models and makes more restrictive assumptions about the absence of occasion specific variance or trait variance, but they convince themselves that this is not a problem.

As was evident also in the present analyses, the fit of the CLPM is typically not as good as the fit of the RI-CLPM (Hamaker et al., 2015; Masselink, Van Roekel, Hankin, et al., 2018). It is important to note that the CLPM is nested in the RI-CLPM (for further information about how the models examined in this research are nested, see Usami, Murayama, et al., 2019). That is, the CLPM is a special case of the RI-CLPM, where the variances of the two random intercept factors and the covariance between the random intercept factors are constrained to zero (thus, the CLPM has three additional degrees of freedom). Consequently, with increasing sample size, the RI-CLPM necessarily fits significantly better than the CLPM (MacCallum, Browne, & Cai, 2006). However, does this mean that the RI-CLPM should be preferred in model selection? Given that the two models differ in their conceptual meaning (see the discussion on between- and within-person effects above), we believe that the decision between the CLPM and RI-CLPM should not be based on model fit, but rather on theoretical considerations.

As shown here, the bad fit of CLPM is not an unfair punishment of a parsimonious model. The bad fit reveals that the model fails to model stability correctly. To disregard bad fit and to favor the more parsimonious model even if it doesn’t fit makes no sense. By the same logic, a model without cross-lagged paths would be more parsimonious than a model with cross-lagged paths and we could reject the vulnerability model simply because it is not parsimonious. For example, when I fitted the model with occasion specific variances and without cross-lagged paths, model fit was better than model fit of the CLPM (RMSEA = .041 vs. RMSEA = .109) and only slightly worse than model fit of the model with occasion specific variance and cross-lagged paths (RMSEA = .040).

It is incomprehensible to methodologists that anybody would try to argue in favor of a model that does not fit the data. If a model consistently produces bad fit, it is not a proper model of the data and has to be rejected. To prefer a model because it produces a consistent artifact that fits theoretical preferences is not science.

Replication II

Although the first replication mostly confirmed the results of the first study, one notable difference was the presence of statistically significant cross-lagged effects in the second study. There are a variety of explanations for this inconsistency. The lack of an effect in the first study could be a type-II error. The presence of an effect in the first replication study could be a type-I errror. Finally, the difference in time intervals could be a moderator.

I picked the Your Personality (YP) dataset because it was the only dataset that used the same measures as the previous two studies. The time interval was 6 months, which is in the middle of the other two intervals. This made it interesting to see whether results would be more consistent with the 2-month or the 1-year intervals.

For self-esteem, the autoregressive model with occasion specific variance had a good fit to the data (RMSEA = .016, CFI = .999). Constraining the occasion specific variance to zero reduced model fit considerably (RMSEA = .160, CFI = .912). Results for depression were unexpected. The model with occasion specific variance showed non-significant and slightly negative residuals for the state variances. This finding implies that there are no detectable changes in depression over time and that depression scores only have a stable trait and occasion specific variance. Thus, I fixed the autoregressive parameters to 1 and the residual state variances to zero. This model is equivalent to a model that specifies a trait factor. Even this model had barely acceptable fit (RMSEA = .062, CFI = .962). Fit could be increased by relaxing the constraints on the occasion specific variance (RMSEA = .060, CFI = .978). However, a simple trait model fit the data even better (RMSEA = .000, CFI = 1.000). The lack of an autoregressive structure makes it implausible that there are cross-lagged effects on depression. If there is no new state variance, self-esteem cannot be a predictor of new state variance.

The presence of a trait factor for depression suggests that there could also be a trait factor for self-esteem and that some of the correlations between self-esteem and depression are due to correlated traits. Therefore I added a trait factor to the measurement model of self-esteem. This model had good fit (RMSEA = .043, .993) and fit was superior to the CLPM (RMSEA = .123, CFI = .883). The model showed no significant cross-lagged effect from self-esteem to depression and the parameter estimate was positive rather than negative, .07. This finding is not surprising given the lack of decreasing correlations over time for depression.

Replication III

The last openly shared datasets are from the California Families Project (CFP). I first examined the children’s data (CFP-C) because Orth et al. (2020) reported a significant vulnerability effect with the RI-CLPM.

For self-esteem, the autoregressive model without occasion-specific variance had bad fit (RMSEA = .108, CFI = .908). Even the model with occasion-specific variance had poor fit (RMSEA = .091, CFI = .945). In contrast, a model with a trait factor and without occasion specific variance had good fit (RMSEA = .023, CFI = .997). This finding suggests that it is necessary to include a stable trait factor to model stability of self-esteem correctly in this dataset.

For depression, the autoregressive model without occasion-specific variance had bad fit (RMSEA = .104, CFI = .878). Even the model with occasion-specific variance had poor fit (RMSEA = .103, CFI = .897). Adding a trait factor produced a model with acceptable fit (RMSEA = .051, CFI = .983).

The trait-state model fit the data well (RMSEA = .989, CFI = .032) and much better than the CLPM (RMSEA = .079, CFI = .914). The autoregressive effect of self-esteem on depression was not significant, and only have the size of the effect size in the RI-CLPM ( -.05 vs. -.09). The difference is due to the constraint on the trait factor. Relaxing these constraints improves model fit and the vulnerability effect becomes non-significant.

Replication IV

The last dataset is based on the mothers’ self-reports in the California Families Project (CFP-M).

For self-esteem, the autoregressive model without occasion-specific variance had bad fit (RMSEA = .139, CFI = .885). The model with occasion specific variance improved fit (RMSEA = .049, CFI = .988). However, the trait-state model had even better fit (RMSEA = .046, CFI = .993).

For depression, the autoregressive model without occasion-specific variance had bad fit (RMSEA = .127, CFI = .880). The model with occasion-specific variance had excellent fit (RMSEA = .000, CFI = 1.000). The trait-state model also had excellent fit (RMSEA = .000, CFI = 1.000).

The CLPM had bad fit to the data (RMSEA = .092, CFI = .913). The Heise model improved fit (RMSEA = .038, CFI = .987). The trait-state model had even better fit (RMSEA = .031, CFI = .992). The cross-lagged effect of self-esteem on depression was negative, but small and not significant, -.05 (95%CI = -.13 to .02).

Simulation Study 1

The first simulation demonstrates that a cross-lagged effect emerges when the CLPM is fitted to data with a trait factor and one of the constructs has more trait variance which produces more stability over time.

I simulated 64% trait variance and 36% occasion-specific variance for self-esteem.

I simulated 36% trait variance and 64% occasion-specific variance for depression.

The correlation between the two trait factors was r = -.7. This produced manifest correlations of r = -.71*sqrt(.36)*sqrt(.64) = -.7 * .6 * .8 = -.34.

For self-esteem the autoregressive model without occasion specific variance had bad fit (). For depression, the autoregressive model without occasion specific variance had bad fit. The CLPM model also had bad fit (RMSEA = .141, CFI = .820). Although the simulation did not include cross-lagged paths, the CLPM showed a significant cross-lagged effect from self-esteem to depression (-.25) and a weaker cross-lagged effect from depression to self-esteem (-.14).

Needless to say, the trait-state model had perfect fit to the data and showed cross-lagged path coefficients of zero.

This simulation shows that CLPM produces artificial cross-lagged effects because it underestimates long-term stability. This problem is well-known, but Orth et al. (2020) deliberately ignore it when they interpret cross-lagged parameters in CLPM with bad fit.

Simulation Study 2

The second simulation shows that a model with a significant cross-lagged path can fit the data, if this path is actually present in the data. The cross-lagged effect was specified as a moderate effect with b = .3. Inspection of the correlation matrix shows the expected pattern that cross-lagged correlations from se to de (se1 with de2) are stronger than cross-lagged correlations from de to se (se2 with de1). The differences are strongest for lag-1.

The model with the cross-lagged paths had perfect fit (RMSEA = .000, CFI = 1.000). The model without cross-lagged paths had worse fit and RMSEA was above .06 (RMSEA = .073, CFI = .968).


The publication of Orth et al.’s (2020 article in JPSP is an embarrassment for the PPID section of JPSP. The authors did not make an innocent mistake. Their own analyses showed across 10 datasets that CLPM does not fit their data. One would expect that a team of researchers would be able to draw the correct conclusion from this finding. However, the power of motivated reasoning is strong. Rather than admitting that the vulnerability model of depression is based on a statistical artifact, the authors try to rationalize why the model with bad fit should not be rejected.

The authors write “the CLPM findings suggest that individual differences in self-esteem predict changes in individual differences in depression, consistent with the vulnerability model” (p. 39).

This conclusions is blatantly false. A finding in a model with bad fit should never be interpreted. After all, the purpose of fitting models to data and to examine model fit is to falsify models that are inconsistent with the data. However, psychologists have been brainwashed into thinking that the purpose of data analysis is only to confirm theoretical predictions and to ignore evidence that is inconsistent with theoretical models. It is therefore not a surprise that psychology has a theory crisis. Theories are nothing more than hunches that guided first explorations and are never challenged. Every discovery in psychology is considered to be true. This does not stop psychologists from developing and supporting contradictory models, which results in an every growing number of theories and confusion. It is like evolution without a selection mechanism. No wonder psychology is making little progress.

Numerous critics of psychology have pointed out that nil-hypothesis testing can be blamed for the lack of development because null-results are ambiguous. However, this excuse cannot be used here. Structural equation modeling is different from null-hypothesis testing because significant results like a high Chi-square value and derived fit indices provide clear and unambiguous evidence that a model does not fit the data. To ignore this evidence and to interpret parameters in these models is unscientific. The fact that authors, reviewers, and editors were willing to publish these unscientific claims in the top journal of personality psychology shows how poorly methods and statistics are understood by applied researchers. To gain respect and credibility, personality psychologists need to respect the scientific method.

Personality Science: The Science of Human Diversity

I wrote a textbook about personality psychology. The textbook is an e-textbook with online engagement for students. I am going to pilot the textbook this fall with my students and revise it with some additional chapters in 2021.

The book also provides an up-to-date review of the empirical literature. The content is freely accessible through a demo version of the course.

Please provide comments, corrections, additional references, etc. in the comments section or email me directly at

A review of “Low self-esteem prospectively predicts depression in adolescence and young adulthood”

In 2007, I was asked to review a ms. about the relationship between self-esteem and depression. The authors used a cross-lagged panel model to examine “prospective prediction” which is a code word for causal claims in a non-experimental study. The problem is that the cross-lagged model is fundamentally flawed because it ignores stable traits and underestimates stability. To compensate for this flaw, it uses cross-lagged paths which leads to false and inflated cross-lagged effects, especially from the more stable to the less stable construct.

I wrote a long and detailed review that was ignored by the editor and the authors and the flawed cross-lagged panel model was published (Orth, Robins, & Roberts, 2008). The article served as the basis for several follow up articles (Orth, Robins, Meier, & Conger, 2016; Rieger, Göllner, Trautwein, & Roberts, 2016; Orth, Robins, Widaman, & Conger, 2014; Orth & Robins, 2013; Sowislo & Orth, 2013; Kuster, Orth, Meier, 2012; Orth, Robins, Trzesniewski, Maes, & Schmitt, 2009; Orth, Robins, & Meier, 2009) and the main author continues to push the flawed cross-lagged panel model (Orth, Clark, Donnellan, & Robins, 2020), although he himself published a model with a trait factor to explain stability in self-esteem (Kuster & Orth, 2013). It is scientifically unjustified to omit this trait factor from bivariate models that relate self-esteem to depression, if ample evidence shows that a trait factor underlies stability of self-esteem (Kuster & Orth, 2013). So, an entire literature is based on a statistical artifact that has been well known four four decades (Rogosa, 1980).

I just found my old review while looking into a folder called “file drawer” and I thought I share it here. It just shows how peer-review doesn’t serve the purpose of quality control and that ambition often trumps the search for truth.

Review – Dec / 3 / 2017

This article tackles an important question: What is the causal relation between depression and self-esteem? As always, at the most abstract level there are three answers to this question. Self-esteem causes (low) depression. Depression causes (low) self-esteem. The correlation is due to a third unobserved variable. To complicate matters, these causal models are not mutually exclusive. It is possible that all three causal models contribute to the observed correlations between self-esteem and depression.

The authors hope to test causal models by means of longitudinal studies, and their empirical data are better than data from many previous studies to examine this question. However, the statistical analyses have some shortcomings that may lead to false inferences about causality.

The first important question is the definition of depression and self-esteem. Depression and self-esteem can be measured in different ways. Self-esteem measures can measure state self-esteem or self-esteem in general. Similarly, depression measures can ask about depressive symptoms over a short time interval (a few weeks) or dispositional depression. The nature of the measure will have a strong impact on the observed retest correlations, even after taking random measurement error into account.

In the empirical studies, self-esteem was measured with a questionnaire that asks about general tendencies (Rosenberg’s self-esteem scale). In contrast, depression was assessed by asking about symptoms within the preceding seven days (CES-D).  Surprisingly, Study 1 shows no differences in the retest correlations of depression and self-esteem. Less surprising is the fact that in the absence of different stabilities, cross-lagged effects are small and hardly different from each other, whereas Study 2 shows differences in stability and asymmetrical patterns of cross-lagged coefficients. This pattern of results suggests that the cross-lagged coefficients are driven by the stability of the measures (see Rogosa, 1980, for an excellent discussion of cross-lagged panel studies).

The good news is that the authors’ data are suitable to test alternative models. One important alternative model would be a model that postulates two latent dispositions for depression and self-esteem (not a single common factor). The latent disposition would produce stability in depression and self-esteem over time. The lower retest correlations of depression would be due to more situational fluctuation of depressive symptoms. The model could allow for a correlation between the latent trait factors of depression and self-esteem. Based on Watson’s model, one would predict a very strong negative correlation between the two trait factors (but less than -1), while situational fluctuation of depression could be relatively weakly related to fluctuation in self-esteem.

The main difference between the cross-lagged model and the trait model concerns the pattern of correlations across different retest intervals. The cross-lagged model predicts a simplex structure (i.e., the magnitude of correlations decreases with increasing retest intervals). In contrast, the trait model predicts that retest correlations are unrelated to the length of the retest interval. With adequate statistical power, it is therefore possible to test these models against each other. With more complex statistical methods it is even possible to test a hybrid model that allows for all three causal effects (Kenny & Zautra, 1995).

The present manuscript simply presents one model with adequate model fit. However, model fit is largely influenced by the measurement model. The measurement model fits the data well because it is based on parcels (i.e., parcels are made to be parallel indicators of a construct and are bound to fit well). Therefore, the fit indices are insensitive to the specification of the longitudinal pattern of correlations. To illustrate, global fit is based on the fit to a correlation matrix with 276 parameters (3 indicators * 2 constructs * 4 waves = 24 indicators , 24 * 23 / 2 = 276 correlations). At the latent level, there are only 28 parameters (2 constructs * 4 waves = 8 latent factors, 8 * 7 / 2 = 28 parameters). The cross-lagged model constrains only 12 of these parameters (12 / 276 < 5%). Thus, the fit of the causal model should be evaluated in terms of the relative fit of the measurement model to the structural model. Table 2 shows the relevant information. Surprisingly, it shows only a difference of 6 degrees of freedom between Model 2 and 3, where I would have expected 12 degrees of freedom difference (?). More important, with six degrees of freedom, the chi-square difference is quite large 59. Although the qui-square test may be overly sensitive, it would be important to know why the model fit is not better. My guess is that the model underestimates long-term stability due to the failure to include a trait component. The same test for Study 2 suggests a better fit of the cross-lagged model in Study 2. However, even a good fit does not indicate that the model is correct. A trait model may fit the data as well or even better.

Regarding Study 1, the authors commit the common fallacy to interpret null-effects as evidence for the lack of a significant effect. Even if in Study 1, self-esteem was a significant (p < .05) lagged predictor of depression, and depression was not a significant (p > .05) lagged predictor of self-esteem, it is incorrect to conclude that self-esteem has an effect, but depression does not have an effect. Indeed, given the small magnitude of the two effects (-.04 vs -.10 in Figure 1) it is likely that these effects are not significantly different from each other (it is good practice in SEM studies to report confidence intervals, which would make it easier to interpret the results).

The limitation section does acknowledge that “the study designs do not allow for strong conclusions regarding the causal influence of self-esteem on depression” However, without more detail and explicit discussion of alternative models, the importance of this disclaimer in the fine print is lost to most readers unfamiliar with structural equation modeling, and the statement seems to contradict the conclusions drawn in the abstract and causal interpretations of the results in the discussion (e.g., Future research should seek to identify the mediating processes of the effect of self esteem on depression).

I have no theoretical reasons to favor any causal model. I am simply trying to point out that alternative models are plausible and likely to fit the data as well as those presented in the manuscript. At a minimum a revision should acknowledge this, and present the actual empirical data (correlation tables) to allow other researchers to test alternative models.

Why do men report higher levels of self-esteem than woman?

Self-esteem is one of the most popular constructs in personality/social psychology. The common approach to study self-esteem is to give participants a questionnaire with one or more questions (items). To study gender differences, the scores of multiple items are added up or averaged separately for men and women, and then subtracted from each other. If this difference score is not zero, the data show a gender difference. Of course, the difference will never be exactly zero. So, it is impossible to confirm the nil-hypothesis that men and women are exactly the same. A more interesting question is whether gender differences in self-esteem are fairly consistent across different samples and how large gender differences, on average, are. To answer this question, psychologists conduct meta-analyses. A meta-analysis combines findings from small samples into one large sample.

The first comprehensive meta-analysis of self-esteem reported a small difference between men and women, with men reporting slightly higher levels of self-esteem than women (Kling et al., 1999). What does a small difference look like. First, imagine that you have to predict whether 50 men and 50 women are above or below the average (median in self-esteem, but the only information that you have is their gender. If there was no difference between men and women, you have no reliable information about gender and you might just flip a coin and have a 50% chance of guessing correctly. However, given the information that men are slightly more likely to be above average in self-esteem, you guess above-average for men and below average for women. This blatant stereotype helps you to be correct 54% of the time, but you are still incorrect in your guesses 46% of the time.

Another way to get a sense of the magnitude of the effect size is to compare it to well-known, large gender differences. One of the largest gender differences that is also easy to measure is height. Men are 1.33 standard deviations taller than women, while the difference in self-esteem ratings is only 0.21 standard deviations. This means the difference in self-esteem is only 15% of the difference in height.

A more recent meta-analysis found an even smaller difference of d = .11 (Zuckerman & Hall, 2016). A small difference increases the probability that gender differences in self-esteem ratings may be even smaller or even in the opposite direction in some populations. That is, while the difference in height is so large that it can be observed in all human populations, the difference in self-esteem is so small that it may not be universally observed.

Another problem with small effects is that they are more susceptible to the influence of systematic measurement error. Unfortunately, psychologists rarely examine the influence of measurement error on their measures. Thus, this possibility has not been explored.

Another problem is that psychologists tend to rely on convenience samples, which makes it difficult to generalize findings to the general population. For example, psychology undergraduate samples select for specific personality traits that may make male or female psychology students less representative of their respective gender.

It is therefore problematic to draw premature conclusions about gender differences in self-esteem on the basis of meta-analyses of self-esteem ratings in convenience samples.

What Explains Gender Differences in Self-Esteem Ratings?

The most common explanations for gender differences in self-esteem are gender roles (Zuckerman & Hall, 2016) or biological differences (Schmitt et al, 2016). However, there are few direct empirical tests of these hypotheses. Even biologically oriented researchers recognize that self-esteem is influenced by many different factors, including environmental ones. It is therefore unlikely that biological sex differences have a direct influence on self-esteem. A more plausible model would assume that gender differences in self-esteem are mediated by a trait that shows stronger gender differences and that predicts self-esteem. The same holds for social theories. It seems unlikely that women rely on gender stereotypes to evaluate their self-esteem. It is more plausible that they rely on attributes that show gender differences. For example, Western societies have different beauty standards for men and women and women tend to have lower self-esteem in ratings of their attractiveness (Gentile et al., 2009). Thus, a logical next step is to test mediation models. Surprisingly, few studies have explored well-known predictors of self-esteem as potential mediators of gender differences in self-esteem.

Personality Traits and Self-Esteem

Since the 1980s, thousands of studies have measured personality from the perspective of the Five Factor Model. The Big Five capture variation in negative emotionality (Neuroticism), positive energy (Extraversion), curiosity and creativity (Openness), cooperation and empathy (Agreeableness), and goal-striving and impulse-control (Conscientiousness). Given the popularity of self-esteem and the Big Five in personality research, many studies have examined the relationship between the Big Five and self-esteem, while other studies have examined gender differences in the Big Five traits.

Studies of gender differences show the biggest and most consistent differences for neuroticism and agreeableness. Women tend to score higher on both dimensions than men. The results for the Big Five and self-esteem are more complicated. Simple correlations show that higher self-esteem is associated with lower Neuroticism and higher Extraversion, Openness, Agreeableness, and Conscientiousness (Robins et al., 2001). The problem is that Big Five measures have a denotative and an evaluative component. Being neurotic does not only mean to respond more strongly with negative emotions; it also is undesirable. Using structural equation model, Anusic et al. (2009) separated the denotative and evaluative component and found that self-esteem was strongly related to the evaluative component of personality ratings. This evaluative factor in personality ratings was first discovered by Thorndike (1920) one-hundred years ago. The finding that self-esteem is related to overly positive self-ratings of personality is also consistent with a large literature on self-enhancement. Individuals with high self-esteem tend to exaggerate their positive qualities ().

Interestingly, there are very few published studies of gender differences in self-enhancement. One possible explanation for this is that there is only a weak relationship between gender and self-enhancement. The rational is that gender is routinely measured and that many studies of self-enhancement could have examined gender differences. It is also well known that psychologists are biased against null-findings. Thus, ample data without publications suggest that there is no strong relationship. However, a few studies have found stronger self-enhancement for men than for women. For example, one study showed that men overestimate their intelligence more than women (von Stumm et al., 2011). There is also evidence that self-enhancement and halo predict biases in intelligence ratings (Anusic, et al., 2009). However, it is not clear whether gender differences are related to halo or are specific to ratings of intelligence.

In short, a review of the literature on gender and personality and personality and self-esteem suggests three potential mediators of the gender differences in self-esteem. Men may report higher levels of self-esteem because they are lower in neuroticism, lower in agreeableness, or higher in self-enhancement.

Empirical Test of the Mediation Model

I used data from the Gosling–Potter Internet Personality Project (Gosling, Vazire, Srivastava,
& John, 2004
). Participants were visitors of a website who were interested in taking a personality test and receiving feedback about their personality. The advantage of this sampling approach is that it creates a very large dataset with millions of participants. The disadvantage is that men and women who visited this sight might differ in personality traits or self-esteem. The questionnaire included a single-item measure of self-esteem. This item shows the typical gender difference in self-esteem (Bleidorn et al., 2016).

To separate descriptive factors of the Big Five from evaluative bias and acquiescence bias, I fitted a measurement model to the 44-item Big Five Inventory. I demonstrated earlier that this measurement model has good fit for Canadian participants (Schimmack, 2019). To test the mediation model, I added gender and self-esteem to the model. In this study, gender was measured with a simple dichotomous male vs. female question.

Gender was a predictor of all 7 factors (Big Five + Halo + Acquiescence). Exploratory analysis examined whether gender had unique relationships with specific BFI items. These relationships could be due to unique relationships of gender with specific personality traits called facets. However, few notable relationships were observed. Self-esteem was predicted by all seven personality traits and gender. However, openness to experience showed weak relationships with self-esteem. To stabilize the model, this path was fixed to zero.

I fitted the model to data from several nations. I selected nations with (a) a large number of complete data (N = 10,000), familiarity with English as a first or common second language (e.g., India = yes, Japan = no), while trying to sample a diverse range of cultures because gender differences in self-esteem tend to vary across cultures (Bleidorn et al., 2016; Zuckerman & Hall, 2016). I fitted the model to samples from four nations: US, Netherlands, India, and Philippines with N = 10,000 for each nation. Table 1 shows the results.

The first two rows show the fit of the Canadian model to the other four nations. Fit is a bit lower for Asian samples, but still acceptable.

The results for sex differences in the Big Five are a bit surprising. Although all four samples show the typical gender difference in neuroticism, the effect sizes are relatively small. For agreeableness, the gender differences in the two Asian samples are negligible. This raises some concerns about the conclusion that gender differences in personality traits are universal and possibly due to evolved genetic differences (Schmitt et al, 2016). The most interesting novel finding is that there are no notable gender differences in self-enhancement. This also implies that self-enhancement cannot mediate gender differences in self-esteem.

The strongest predictor of self-esteem is self-enhancement. Effect sizes range from d = .27 in the Netherlands to d = .45 in the Philippines. The second strongest predictor is neuroticism. As neuroticism also shows consistent gender differences, neuroticism partially mediates the effect of gender on self-esteem. Although weak, agreeableness is a consistent negative predictor of self-esteem. This replicates Anusic et al.’s (2009) finding that the sign of the relationship reverses when halo bias in agreeableness ratings is removed from measures of agreeableness.

The total effects show the gender differences in the four samples. Consistent with meta-analysis the gender differences in self-esteem are weak with effect sizes ranging from d = .05 to d = .15. Personality explains some of this relationship. The unexplained direct effect of gender is very small.


A large literature and several meta-analysis have documented small, but consistent gender differences in self-ratings of self-esteem. Few studies have examined whether these differences are mere rating biases or tested causal models of these gender differences. This article addressed these questions by examining seven potential mediators; the Big Five traits as well as halo bias and acquiescence bias.

The results replicated previous findings that gender differences in self-esteem are small, d < .2. They also showed that neuroticism is a partial mediator of gender differences in self-esteem. Women tend to be more sensitive to negative information and this disposition predicts lower self-esteem. It makes sense that a general tendency to focus on negative information also extends to evaluations of the self. Women appear to be more self-critical than men. A second mediator was agreeableness. Women tend to be more agreeable and agreeable people tend to have lower self-esteem. However, this relationship was only observed in Western nations and not in Asian nations. This cultural difference explains why gender differences in self-esteem tend to be stronger in Western than in Asian cultures. Finally, a general evaluative bias in self-ratings of personality was the strongest predictor of self-esteem, but showed no notable gender differences. Gender also still had a very small relationship with self-esteem after accounting for personalty mediators.

Overall, these results are more consistent with models that emphasize similarities between men and women (Men and Women are from Earth) than models that emphasize gender differences (Women are from Venus and Men are from Mars). Even if evolutionary theories of gender differences are valid, they explain only a small amount of the variance in personality traits and self-esteem. As one evolutionary psychologists put it “it is undeniably true that men and women are more similar than different genetically, physically and psychologically” (p. 52). The results also undermine claims that women internalize negative stereotypes about them and have notably lower self-esteem as a result. Given the small effect sizes, it is surprising how much empirical and theoretical attention gender differences in self-esteem have received. One reason is that psychologists often ignore effect sizes and only care about the direction of an effect. Given the small effect size of gender on self-esteem, it seems more fruitful to examine factors that produce variation in self-esteem for men and women.

Lies, Damn Lies, and Experiments on Attitude Ratings

Ten years ago, social psychology had a life-time opportunity to realize that most of their research is bullshit. Their esteemed colleague Daryl Bem published a hoax article about extrasensory perception in their esteemed Journal of Personality and Social Psychology. The editors felt compelled to write a soul searching editorial about research practices in their field that could produce such nonsense results. However, 10 years later social psychologists continue to use the same questionable practices to publish bullshit results in JPSP. Moreover, they are willfully ignorant of any criticism of their field that is producing mostly pseudo-scientific garbage. Just now, Wegener and Petty, two social psychologists at Ohio State University wrote an article that downplays the importance of replication failures in social psychology. At the same time, they publish a JPSP article that shows they haven’t learned anything from 10 years of discussion about research practices in psychology. I treat the first author as an innocent victim who is being trained in the dark art of research practices that have given us social priming, ego-depletion, and time-reversed sexual arousal.

The authors report seven studies. We don’t know how many other studies were run. The seven studies are standard experiments with one or two (2 x 2) experimental manipulations between subjects. The studies are quick online studies with Mturk samples. The main goal was to show that some experimental manipulations influence some ratings that are supposed to measure attitudes. Any causal effect on these measures is interpreted as a change in attitudes.

The problem for the author is that their experimental manipulations have small effects on the attitude measures. So, individually studies 1-6 would not show any effects. At no point did they consider this a problem and increase sample sizes. However, they were able to fix the problem by combining studies that were similar enough into one dataset. his was also done by Bem to produce significant results for time-reversed causality. It is not a good practice, but that doesn’t bother editors and reviewers at the top journal of social psychology. After all, they all do not know how to do science.

So, let’s forget about the questionable studies 1-6 and focus on the preregistered replication study with 555 Mturk workers (Study 7). The authors analyze their data with a mediation model and find statistically significant indirect effects. The problem with this approach is that mediation no longer has the internal validity of an experiment. Spurious relationships between mediators and the DV can inflate these indirect effects. So, it is also important to demonstrate that there is an effect by showing that the manipulation changed the DV (Baron & Kenny, 1986). The authors do not report this analysis. The authors also do not provide information about standardized effect sizes to evaluate the practical significance of their manipulation. However, the authors did provide covariance matrices in a supplement and I was able to run the analyses to get this information.

Here are the results.

The main effect for the bias manipulation is d = -.04, p = .38, 95%CI = -.12, .05

The main effect for the untrustworthiness manipulation is d = .01, p = .75, 95%CI = -.07, .10.

Both effects are not significant. Moreover, the effect size is so small and thanks to the large sample size the confidence intervals are so narrow that we can reject the hypothesis that the manipulations have at least a small effect, d = .2.

So, here we see the total failure of social psychology to understand what they are doing and their inability to make a real contribution to the understanding of attitudes and attitude change. This didn’t stop Rich Petty from co-authoring an article about psychology’s contribution to addressing the Covid-19 pandemic. Now, it would be unfair to blame 150,000 deaths on social psychology, but it is a fact that 40 years of trivial experiments have done little to help us change attitudes like attitudes towards wearing masks in the real world.

I can only warn young, idealistic students to consider social psychology as a career path. I speak form experience. I was a young idealistic student eager to learn about social psychology in the 1990s. If I could go back in time, I would have done something else with my life. In 2010, I thought social psychology might actually change for the better, but in 2020 it is clear that most psychologists want to continue with their trivial experiments that tell us nothing about social behaviour. If you just can’t help it and want to study social phenomena I recommend personality psychology or other social sciences.