Dr. Ulrich Schimmack’s Blog about Replicability

For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).

DEFINITION OF REPLICABILITYIn empirical studies with sampling error, replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion (Schimmack, 2017).

BLOGS BY YEAR:  20192018, 2017, 2016, 2015, 2014

Featured Blog of the Month (January, 2020): Z-Curve.2.0 (with R-package) 




  1. 2018 Replicability Rankings of 117 Psychology Journals (2010-2018)

Rankings of 117 Psychology Journals according to the average replicability of a published significant result. Also includes detailed analysis of time trends in replicability from 2010 to 2018). 

Golden2.  Introduction to Z-Curve with R-Code

This post presented the first replicability ranking and explains the methodology that is used to estimate the typical power of a significant result published in a journal.  The post provides an explanation of the new method to estimate observed power based on the distribution of test statistics converted into absolute z-scores.  The method has been developed further to estimate power for a wider range of z-scores by developing a model that allows for heterogeneity in power across tests.  A description of the new method will be published when extensive simulation studies are completed.


3. An Introduction to the R-Index


The R-Index can be used to predict whether a set of published results will replicate in a set of exact replication studies. It combines information about the observed power of the original studies with information about the amount of inflation in observed power due to publication bias (R-Index = Observed Median Power – Inflation). The R-Index has predicted the outcome of actual replication studies.

Featured Image -- 203

4.  The Test of Insufficient Variance (TIVA)


The Test of Insufficient Variance is the most powerful test of publication bias and/or dishonest reporting practices. It can be used even if only two independent statistical results are available, although power to detect bias increases with the number of studies. After converting test results into z-scores, z-scores are expected to have a variance of one.   Unless power is very high, some of these z-scores will not be statistically significant (z .05 two-tailed).  If these non-significant results are missing, the variance shrinks, and TIVA detects that the variance is insufficient.  The observed variance is compared against the expected variance of 1 with a left-tailed chi-square test. The usefulness of TIVA is illustrated with Bem’s (2011) “Feeling the Future” data.

train-wreck-15.  MOST VIEWED POST (with comment by Noble Laureate Daniel Kahneman)

Reconstruction of a Train Wreck: How Priming Research Went off the Rails

This blog post examines the replicability of priming studies cited in Daniel Kahneman’s popular book “Thinking fast and slow.”   The results suggest that many of the cited findings are difficult to replicate.

http://schoolsnapshots.org/blog/2014/09/30/math-prize-for-girls-at-m-i-t/6. How robust are Stereotype-Threat Effects on Women’s Math Performance?

Stereotype-threat has been used by social psychologists to explain gender differences in math performance. Accordingly, the stereotype that men are better at math than women is threatening to women and threat leads to lower performance.  This theory has produced a large number of studies, but a recent meta-analysis showed that the literature suffers from publication bias and dishonest reporting.  After correcting for these effects, the stereotype-threat effect was negligible.  This blog post shows a low R-Index for the first article that appeared to provide strong support for stereotype-threat.  These results show that the R-Index can warn readers and researchers that reported results are too good to be true.

GPower7.  An attempt at explaining null-hypothesis testing and statistical power with 1 figure and 1500 words.   Null-hypothesis significance testing is old, widely used, and confusing. Many false claims have been used to suggest that NHST is a flawed statistical method. Others argue that the method is fine, but often misunderstood. Here I try to explain NHST and why it is important to consider power (type-II errors) using a picture from the free software GPower.


8.  The Problem with Bayesian Null-Hypothesis Testing


Some Bayesian statisticians have proposed Bayes-Factors to provide evidence for a Null-Hypothesis (i.e., there is no effect).  They used Bem’s (2011) “Feeling the Future” data to argue that Bayes-Factors would have demonstrated that extra-sensory perception does not exist.  This blog post shows that Bayes-Factors depend on the specification of the alternative hypothesis and that support for the null-hypothesis is often obtained by choosing an unrealistic alternative hypothesis (e.g., there is a 25% probability that effect size is greater than one standard deviation, d > 1).  As a result, Bayes-Factors can favor the null-hypothesis when there is an effect, but the effect size is small (d = .2).  A Bayes-Factor in favor of the null is more appropriately interpreted as evidence that the alternative hypothesis needs to decrease the probabilities assigned to large effect sizes. The post also shows that Bayes-Factors based on a meta-analysis of Bem’s data provide misleading evidence that an effect is present because Bayesian statistics do not take publication bias and dishonest reporting practices into account.

hidden9. Hidden figures: Replication failures in the stereotype threat literature.  A widespread problem is that failed replication studies are often not published. This blog post shows that another problem is that failed replication studies are ignored even when they are published.  Selective publishing of confirmatory results undermines the credibility of science and claims about the importance of stereotype threat to explain gender differences in mathematics.

20170620_14554410. My journey towards estimation of replicability.  In this blog post I explain how I got interested in statistical power and replicability and how I developed statistical methods to reveal selection bias and to estimate replicability.

The Structure of Neuroticism

The construct of neuroticism is older than psychological science. It has its roots in Freud’s theories of mental illnesses. Thanks to to influence of psychoanalysis on the thinking of psychologists, the first personality questionnaires included measures of neuroticism or anxiety, which were considered to be highly related or even identical constructs.

Eysenck’s research on personality first focussed on Neuroticism and Extraversion as the key dimensions of personality traits. He then added psychoticism as a third dimension.

In the 1980s, personality psychologists agreed on a model with five major dimensions that included neuroticism and extraversion as prominent dimensions. Psychoticism was divided into agreeableness and conscientiousness and a fifth dimension openness was added to the model.

Today, the Big Five model dominates personality psychology and many personality questionnaires focus on the measurement of the Big Five.

Despite the long history of research on neuroticism, the actual meaning of the term and the construct that is being measured by neuroticism scales is still unclear. Some researchers see neuroticism as a general disposition to experience a broad range of negative emotions. In the emotion literature, anxiety, anger, and sadness are often considered to be basic negative emotions, and the prominent NEO-PI questionnaires considers neuroticism to be a general disposition to experience these three basic emotions more intensely and frequently.

Neuroticism has also been linked to more variability in mood states, higher levels of self-consciousness and lower self-esteem.

According to this view of neuroticism, it is important to distinguish between neuroticism as a more general disposition to experience negative feelings and anxiety, which is only one of several negative feelings.

A simple model of neuroticism would assume that a general disposition to respond more strongly to negative emotions produces correlations among more specific dispositions to experience more anxiety, anger, sadness, and self-conscious emotions like embarrassment. This model implies a hierarchical structure with neuroticism as a higher-order factor of more specific negative dispositions.

In the early 2000s, Ashton and Lee published an alternative model of personality with six factors called the HEXACO model. The key difference between the Big Five model and the HEXACO model is the conceptualization of pro- and anti-social traits. While these traits are considered to be related to a single higher-order factor of agreeableness in the Big Five model, the HEXACO model distinguishes between agreeableness and honesty-humility as two distinct traits. However, this is not the only difference between the two models. Another important difference is the conceptualization of affective dispositions. The HEXACO model does not have a factor corresponding to neuroticism. Instead it has an emotionality factor. The only common trait to neuroticism and emotionality is anxiety, which is measured with similar items in Big Five questionnaires and in HEXACO questionnaires. The other three traits linked to emotionality are unique to the HEXACO model.

The four primary factors, also called facets) that are used to identify and measure emotionality are anxiety, fear, dependence, and sentimentality. Fear is distinguished from anxiety by a focus on immediate and often physical danger. In contrast, anxiety and worry tend to be elicited by thoughts about uncertain events in the future. Dependence is defined by a need for social comfort in difficult times. Sentimentality is a disposition to respond strongly to negative events that happen to other people, including fictional characters.

In a recent target article, Ashton and Lee argued that it is time to replace the Big Five model with the superior HEXACO model. A change from neuroticism to emotionality would be a dramatic shift given the prominence of neuroticism in the history of personality psychology. Here, I examine empirically how Emotionality is related to Neuroticism and whether personality psychologists should adapt the HEXACO framework to understand individual differences in affective dispositions.


A key problem in research on the structure of personality is that researchers often rely on questionnaires that were developed with a specific structure in mind. As a result, the structure is pre-determined by the selection of items and constructs. To overcome this problem, it is necessary to sample a broad and ideally representative sample of primary traits. The next problem is that motivation and attention-span of participants limits the number of items that a personality questionnaire can include. These problems have been resolved by Revelle and colleagues survey that asks participants to complete only a subset of over 600 items. Modern statistical methods can analyze datasets with planned missing data. Thus, it is possible to examine the structure of hundreds of personality items. Condon and Revelle (2018) also made these data openly available (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/SD7SVE). I am very grateful for their initiative that provides an unprecedented opportunity to examine the structure of personality.

The items were picked to represent primary factors (facets) of the HEXACO questionnaire and the NEO-PI questionnaire. In addition, the questionnaire covered neuroticism items from the EPQ and other questionnaires. The items are based on the IPIP item pool. Each primary factor is represented by 10 items. I picked the items that represent the four HEXACO Emotionality factors, anxiety, fear, dependency, sentimentality, and four of the NEO-PI Neuroticism factors, anxiety, anger, depression, and self-consciousness. The anxiety factor overlaps and is represented by mostly overlapping items. Thus, this item selection resulted in 70 items that were intended to measure 7 primary factors. I added four additional items that represented variable moods (moodiness) that were included in the BFAS and EPQ, which might form an independent factor.


The data were analyzed with confirmatory factor analysis (CFA), using the MPLUS software. CFA has several advantages over traditional factor analytic methods that have been employed by proponents of the HEXACO and the Big Five models. The main advantages are that it is possible to model hierarchical structures that represent the Big Five or HEXACO factors as higher-order factors of primary factors. A second advantage is that CFA provides information about model fit whereas traditional EFA produces solutions without evaluating model fit.

Measurement Model

A first step in establishing a measurement model was to select items with high primary loadings, low secondary loadings, and low correlated residuals. The aim was to represent each primary factor with the best four items. While four items may not be enough to create a good scale, four items are sufficient to establish a measurement model of primary factors. Limiting the number of items to four items is also advantages because computing time increases with additional items and models with missing data can take a long time to converge.

Aside from primary loadings, the model included an acquiescence factor based on the coding of items. Directed coded items had unstandardized loadings of 1 and reverse coded items had an unstandardized loading of -1. There were no secondary loadings or correlated residuals.

The model met standard criteria of model fit such as a CFI > .95 and RMSEA < .05, CFI = .954, RMSEA = .007. However, models with missing data should not be evaluated based on these fit indices because fit is determined by a different formula ( (Zhang & Savaley, 2019).  More importantly, modification indices showed no notable changes in model fit if fixed parameters were freed. Table 1 shows the items and their primary factor loadings.

Table 2 shows the correlations among the primary factors.

The first four factors are assumed to belong to the HEXACO-Emotionality factor. As expected, fear, anxiety, and dependence are moderately to highly positively correlated. Contrary to expectations, sentimentality showed low correlations especially with fear.

Factors 4 to 8 are assumed to be related to Big Five neuroticism. As expected, all of these factors are moderately to highly correlated.

In addition, the dependence factor from the HEXACO model also shows moderate to high correlations with all Big Five neuroticism factors. The fear factor also shows positive relations with the neuroticism factors, especially for self-consciousness.

With the exception of Sentimentality, all of the factors tend to be positively correlated, suggesting that they are related to a common higher-order factor.

Overall, this pattern of results provides little support for the notion that HEXACO-Emotionality is a distinct higher-order factor from Big-Five neuroticism.


The first model assumed that all factors are related to each other by means of a single higher-order factor. In addition, the model allowed for correlated residuals among the four HEXACO factors. This makes it possible to examine whether these four factors share additional variance with each other that is not explained by a general Negative Emotionality factor.

Model fit decreased compared to the measurement model which serves as a comparison standard for theoretical models, CFI: .916 vs. 954, RMSEA = .009 vs. .007.

All primary factors except sentimentality had substantial loadings on the Negative Emotionality factor. Table 3 shows the residual correlations for the four HEXACO factors.

All correlations are positive suggesting that the HEXACO Emotionality factor captures some shared variance among these four factors that is not explained by the Negative Emotionality factor. However, two of the correlations are very low indicating that there is little shared variance between sentimentality and fear or dependence and anxiety.


The second model, modeled the relationship among the HEXACO factors with a factor. Model fit decreased, CFI = .914 vs. .916, RMSEA = .010 vs. 009. Loadings on the Emotionality factor ranged from .27 to .46. Fear, anxiety, and dependence had higher loadings on the Negative Emotionality factor than on the Emotionality factor.

The main conclusion from these results is that it would be problematic to replace the Big Five model with the HEXACO model because the Emotionality factor in the HEXACO model fails to capture the nature of the broader Neuroticism factor in the Big Five model. In fact, there is little evidence for a specific Emotionality factor in this dataset.


The discrepancy between the measurement model and Model 1 suggests that there are additional relationships between some primary factors that are not explained by the general Negative Emotionality factor. Examining modification indices suggested several changes to the model. Model 3 shows the final results. This model fit the data nearly as well as the measurement model, CFI = .949 vs. 954, RMSEA = .007 vs. .007. Inspection of the Modification Indices showed no further ways to improve the model by freeing correlated residuals among primary factors. In one case, three correlated residuals were consistent and were modeled as a factor. Figure 1 shows the results.

First, the model shows notable and statistically significant effects of neuroticism on all primary factors except sentimentality. Second the correlated residuals show an interesting patterns where primary factors can be arranged in a chain. that is, depression is related to moody, moody is related to anger, anger is related to anxiety, anxiety is related to fear, fear is related to self-consciousness and dependence, self-consciousness is related to dependence and finally, dependence is related to sentimentality. This suggests the possibility that a second broader dimension might be underlying the structure of negative emotionality. Research on emotions suggests that this dimension could be activation (fear is high, depression is low) or potency (anger is high, dependence is low).This is an important avenue for future research. The key finding in Figure 1 is that the traditional Neuroticism dimension is an important broad higher-order factor that accounts for the correlations among 7 of the 8 primary factors. These results favor the Big Five model over the HEXACO model.

A Big-5 Model of the Hexaco-100 Items

In the 1980s, personality psychologists celebrated the emergence of a five-factor model as a unifying framework for personality traits. Since then, the so-called Big-5 have dominated thinking and measurement of personality.

Two decades later, Ashton and Lee proposed an alternative model with six factors. This model has come to be known as the HEXACO model.

A recent special issue in the European Journal of Personality discussed the pros and cons of these two models. The special issue did not produce a satisfactory resolution between proponents of the two models.

In theory, it should be possible to resolve this dispute with empirical data, especially given the similarities between the two models. Five of the factors are more or less similar between the two models. One factor is Neuroticism with anxiety/worry as a key marker of this higher-order trait. A second factor is Extraversion with sociability and positive energy as markers. A third factor is Openness with artistic interests as a common marker. A forth factor is conscientiousness with orderliness and planful actions as markers. The key differences between the two models is concerned with pro-social and anti-social traits. In the Big Five model, a single higher-order trait of agreeableness is assumed to produce shared variance among all of these traits (e.g., morality, kindness, modesty). The HEXACO model assumes that there are two higher-order traits. One is also called agreeableness and the other one is called honesty and humility.

As Ashton and Lee (2005) noted, the critical empirical question is how the Big Five model accounts for the traits related to the honesty-humility factor in the HEXACO model. Although the question is straightforward, empirical tests of it are not. The problem is that personality researchers often rely on observed correlations between scales and that correlations among scales depend on the item-content of scales. For example, Ashton and Lee (2005) reported that the Big-Five Mini-Marker scale of Agreeableness correlated only r = .26 with their Honesty-Humility scale. This finding is not particularly informative because correlations between scales are not equivalent to correlations between the factors that the scales are supposed to reflect. It is also not clear whether a correlation of r = .26 should be interpreted as evidence that Honesty-Humility is a separate higher-order factor at the same level as the other Big Five traits. To answer this question, it would be necessary to provide a clear definition of a higher-order factor. For example, higher-order factors should account for shared variance among several primary factors that have only low secondary loadings on other factors.

Confirmatory factor analysis (CFA) addresses some of the problems of correlational studies with scale scores. One main advantage of CFA is that models do not depend on the item selection. It is therefore possible to fit a theoretical structure to questionnaires that were developed for a different model. I therefore used CFA to see whether it is possible to fit the Big Five model to the HEXACO-100 questionnaire that was explicitly designed to measure 4 primary factors (facets) for each of the six HEXACO higher-order traits. Each primary factor was represented by four items. This leads to 4 x 4 x 6 = 96 items. After consultation with Michael Ashton, I did not include the additional four altruism items.

Measurement Model

The Big-Five or HEXACO models are higher-order models that are supposed to explain the pattern of correlations among the primary factors. In order to test these models, it is necessary to first establish a measurement model for the primary factors. Starting point for the measurement model was a model with a simple structure where each item only has a primary loading on its designated factor. For example, the anxiety item “I sometimes can’t help worrying about little things” loaded only on the anxiety factor. All 24 primary factors were allowed to correlate freely with each other.

It is well-known that few data fit a simple structure for two reasons. First, the direction of items can influence responses. This can be modeled with an acquiescence factor that codes whether an item is a direct or a reverse coded items. Second, it is difficult to write items that reflect only variation in the intended primary trait. Thus, many items are likely to have small, but statistically significant, secondary loadings on other factors. These secondary loadings need to be modeled to achieve acceptable model fit, even if they have little practical significance. Another problem is that two items of the same factor may share additional variance because they share similar wordings or item content. For example, the two items “I clean my office or home quite frequently” and the reverse coded item “People often joke with me about the messiness of my room or desk” share specific content. This shared variance between items needs to be modeled with correlated residuals to achieve acceptable model fit.

Researchers can use Modification Indices to identify secondary loadings and correlated residuals that have a strong influence on model fit. Freeing the identified parameters improves model fit and can produce a measurement model with acceptable model fit. Moreover, MI can also provide information that there are no more fixed parameters that have a strong negative effect on model fit.

After modifying the simple-structure model accordingly, I established a measurement model that had acceptable fit, RMSEA = .021, CFI = .936. Although the CFI did not reach the threshold of .950, the MI did not show any further improvements that could be made. Freeing further secondary loadings resulted in secondary loadings less than .1. Thus, I stopped at this point.

16 primary factors had primary factor loadings of .4 or higher for all items. The remaining 8 primary factors had 3 primary factor loadings of .4 or higher. Only 4 items had secondary loadings greater than .3. Thus, the measurement model confirmed the intended structure of the questionnaire.

Importantly, the measurement model was created without imposing any structure on the correlations among higher-order factors. Thus, the freeing of secondary loadings and correlated residuals did not bias the results in favor of the Big Five or HEXACO model. Rather, the fit of the measurement model can be used to evaluate the fit of theoretical models about the structure of personality.

A simplistic model that is often presented in textbooks would imply that only traits related to the same higher-order factor are correlated with each other and that all other correlations are close to zero. Table 1 shows the correlations for the HEXACO-Agreeableness (A-Gent = gentle, A-Forg = forgiving, A-Pati = patient, & A-Flex = flexible) and the HEXACO-honesty-humility (H-Gree = greed-avoidance, H-Fair = fairness, H-Mode = modest, & H-Sinc = sincere) factors.

In support of the Big Five model, all correlations are positive. This suggests that all primary factors are related to a single higher-order factor. In support of the HEXACO model, correlations among A-factors and correlations among H-factors tend to be higher than correlations of A-factors with H-factors. Three notable exceptions are highlighted in red and all of them involve modesty. Modesty is more strongly related to A-Gent and A-Flex than to H-Mode.

Table 2 shows the correlations of the A and H factors with the four neuroticism factors (N-Fear = fear, N-Anxi = anxiety, N-Depe = dependence, N-Sent = sentimental). Notable correlations greater than .2 are highlighted. For the most part, the results show that neuroticism and pro-social traits are unrelated. However, there are some specific relations among factors. Notably, all four HEXACO-A factors are negatively related to anxiety. This shows some dissociation between A and H factors. In addition, fear is positively related to fairness and negatively related to sincerity. Sentimentality is positively related to fairness and modesty. Neither the Big Five model nor the HEXACO model has explanations for these relationships.

Table 3 shows the correlation with the Extraversion factors (E-soci = Sociable, E-socb = bold, E-live = lively, E-Sses = self-esteem). There are few notable relationships between A and H factors on the one hand and E factors on the other hand. This supports the assumption of both models that pro-social traits are unrelated to extraversion traits, including being sociable.

Table 4 shows the results for the Openness factors. Once more there are few notable relationships. This is consistent with the idea that pro-social traits are fairly independent of Openness.

Table 5 shows the results for conscientiousness factors (C-Orga = organized, C-Dili = diligent, C-Perf = Perfectionistic, & C-Prud = prudent). Most of the correlations are again small, indicating that pro-sociality is independent of conscientiousness. The most notable exceptions are positive correlations of the conscientiousness factors with fairness. This suggests that fairness is related to conscientiousness.

Table 6 shows the remaining correlations among the N, E, O, and C factors.

The green triangles show correlations among the primary factors belonging to the same higher-order factor. The strong correlations confirm the selection of primary factors to be included in the HEXACO-100. Most of the remaining correlations are below .2. The grey fields show correlations greater than .2. The most notable correlations are for diligence (C-Dili), which is correlated with all E-factors. This suggests a notable secondary loading of diligence on the higher-order factor E. Another noteworthy finding is a strong correlation between self-esteem (E-Sses) and anxiety (N-anx). This is to be expected because self-esteem is known to have strong relationships with neuroticism. It is surprising, however, that self-esteem is not related to the other primary factors of neuroticism. One problem in interpreting these results is that the other neuroticism facets are unique to the HEXACO-100.

In conclusion, inspection of the correlations among the 24 primary factors shows clear evidence for 5 mostly independent factors that correspond to the Big Five factors. In addition, the correlations among the pro-social factors show a distinction between the four HEXACO-A factors and the four HEXACO-H factors. Thus, it is possible to represent the structure with 6 factors that correspond to the HEXACO model, but the higher-order A and H factors would not be independent.

A Big Five Model of the HEXACO-100

I fitted a model with five higher-order factors to examine the ability of the Big Five model to explain the structure of the HEXACO-100. Importantly, I did not alter the measurement model of the primary factors. It is clear from the previous results that a simple-structure would not fit the data. I therefore allowed for secondary loadings of primary factors on the higher-order factors. In addition, I allowed for residual correlations among primary factors. Furthermore, when several primary factors showed consistent correlated residuals, I modeled them as factors. In this way, the HEXACO-A and HEXACO-H factors could be modeled as factors that account for correlated residuals among pro-social factors. Finally, I added a halo factor to the model. The halo factor has been identified in many Big Five questionnaires and reflects the influence of item-desirability on responses.

Model fit was slightly less than model fit for the measurement model, RMSEA = .021 vs. .021, CFI = .927 vs. .936. However, inspection of MI did not suggest additional plausible ways to improve the model. Figure 1 shows the primary loadings on the Big Five factors and the two HEXACO factors, HEXACO-Agreeableness (HA) and HEXACO-Honesty-Humility.

The first notable observation is that primary factors have loadings above .5 for four of the Big Five factors. For the Agreeableness factor, all loadings were statistically significant and above .2, but four loadings were below .5. This shows that agreeableness explains less variance in some primary factors than the other Big Five factors. Thus, one question is whether the magnitude of loadings on the Big Five factors should be a criterion for model selection.

The second noteworthy observation is that the model clearly identified HEXACO-A and HEXACO-H as distinct factors. That is, the residuals of the corresponding primary factors were all positively correlated. All loadings were above .2, but several of the loadings were also below .5. Moreover, for the HEXACO-A factors the loadings on the Big5-A factor were stronger than the loadings on the HEXACO-A factor. Modesty (H-Mode) also loaded more highly on Big5-A than HH. The results for HEXACO-A are not particularly troubling because the HEXACO model does not consider this factor to be particularly different from Big5-A. Thus, the main question is whether the additional shared variance among HEXACO-H factors warrants the creation of a model with six factors. That is, does Honesty-Humility have the same status as the Big Five factors?

Alternative Model 1

The HEXACO model postulates six factors. Comparisons of the Big Five and HEXACO model tend to imply that the HEXACO factors are just as independent as the Big Five factors. However, the data show that HEXACO-A factors and HEXACO-H factors are not as independent of each other as other factors. To fit a six-factor model to the data, it would be possible to allow for a correlation between HEXACO-A and HEXACO-H. To make this model fit as well as the Big-Five model, an additional secondary loading of modesty (H-Mode) on HEXACO-A was needed, RMSEA = .22, CFI = .926. This secondary loading was low, r = .25, and is not displayed in Figure 2.

The most notable finding is a substantial correlation between Hexaco-A and Hexaco-H of r = .49. Although there are no clear criteria for practical independence, this correlation is strong and suggests that there is an important common factor that produces a positive correlation between these two factors. This makes this model rather unappealing. The main advantage of the Big Five model would be that it captures the highest level of independent factors in a hierarchy of personality traits.

Alternative Model 2

An alternative solution to represent the correlations among HEXACO-A and HEXACO-H factors is to treat HEXACO-A and HEXACO-H as independent factors and to allow for secondary loadings of HEXACO-H factors on HEXACO-A or vice versa. Based on the claim that the H-factor adds something new to the structure, I modelled secondary loadings of the primary H-factors on HEXACO-A. Fit was the same as for the first alternative model, RMSEA = .22, CFI = .927. Figure 3 shows substantial secondary loadings for three of the four H-factors, and for modesty the loading on the HEXACO-A factor is even stronger than the loading on the HEXACO-H factor.

The following table shows the loading pattern along with all secondary loadings greater than .1. Notable secondary loadings greater than .3 are highlighted in pink. Aside from the loading of some H-factors on A, there are some notable loadings of two C-factors on E. This finding is consistent with other results that high achievement motivation is related to E and C.

The last column provides information about correlated residuals (CR) in the last column. Primary factors with the same letter have a correlated residual. For example, there is a strong negative relationship between anxiety (N-anxiety) and self-esteem (E-Sses) that was apparent in the correlations among the primary factors in Table 6. This relationship could not be modeled as a negative secondary loading on neuroticism because the other neuroticism factors showed much weaker relationships with self-esteem.


In sum, the choice between the Big5 model and the HEXACO model is a relatively minor stylistic choice. The Big Five model is a broad model that predicts variance in a wide variety of primary personality factors that are often called facets. There is no evidence that the Big Five model fails to capture variation in the primary factors that are used to measure the Honesty-Humility factor of the HEXACO model. All four H-factors are related to a general agreeableness factor. Thus, it is reasonable to maintain the Big Five model as a model of the highest level in a hierarchy of personality traits and to consider the H-factor a factor that explains additional relationships among pro-social traits. However, an alternative model with Honesty-Humility as a sixth factor is also consistent with the data. This model only appears different from the Big Five model if secondary loadings are ignored. However, all H-factors had secondary loadings on agreeableness. Thus, agreeableness remains a broader trait that links all pro-social traits, while Honesty-Humility explains additional relationships among a subset of this factors. If Honesty-Humility is indeed a distinct global factor it should be possible to find primary factors that are uniquely related to this factor without notable secondary loadings on Agreeableness. If such traits exists, they would strengthen the support for the HEXACO model. On the other hand, if all traits that are related to Honesty-Humility also load on Agreeableness, it seems more appropriate to treat Honesty-Humility as a lower-level factor in the hierarchy of traits. In conclusion, these structural models did not settle the issue, but they clarify the issue. Agreeableness factors and Honesty-Humilty factors form distinct, but related clusters of primary traits. This empirical finding can be represented with a Five-Factor model with Honest-Humility as shared variance among some pro-social traits or it can be represented with six factors and secondary loadings.


A major source of confusion in research on the structure of personality is the failure to distinguish between factors and scales. Many proponents of the HEXACO model point out that the HEXACO scales, especially the Honesty-Humilty scale, explain variance in criterion variables that is not explained by Big-Five scales. It has also been observed that the advantage of the HEXACO scales depends on the Big-Five scales that are used. The reason for these findings is that scales are imperfect measures of their intended factors. They also contain information about the primary factors that were used to measure the higher-order factors. The advantage of the HEXACO-100 is that it measures 24 primary factors. There is nothing special about the Honesty-Humility factor. As Figure 1 shows, the honesty-humilty factor explains only a portion of the variance in its designated primary factors, namely .67^2 = 45% of the variance in greed-avoidance, .55^2 = 30% of the variance in fairness, .32^2 = 10% of the variance in modesty, and .41^2 = 17% of the variance in sincerity. Averaging these scales to form a Honesty-Humilty scale destroys some of this variance and inevitably lowers the ability to predict some criterion variable that is strongly related to one of these primary factors. There is also no reason why Big Five questionnaires should not include some primary factors of Honesty-Humility and the NEO-PI-3 does include modesty and fairness.

Personality psychologists need to distinguish more clearly between factors and scales. The correlation of the NEO-PI-3 agreeableness scale will be different from those with the HEXACO-A scale or the BFI2-agreeableness scale. Scale correlations are biased by the choice of items, unless items are carefully selected to maximize correlation with the latent factor. For research purposes, researchers should use latent variable models that can decompose an observed correlation into the influence of the higher-order factor and the influence of specific factors.

Personality researchers should also carefully think about the primary factors they may want to include in their studies. For example, even researchers who favor a HEXACO model may include additional measures of anger and depression to explore the contribution of affective dispositions to outcome measures. Similarly, Big Five researchers may want to supplement their Big Five questionnaires with measures of primary traits related to honesty and morality if the Big-Five measure does not capture them. A focus on the highe-order factors is only justified in studies that require short measures with a few items.


My main contribution to the search for a structural model of personality is to examine this question with a statistical tool that makes it possible to test structural models of factors. The advantage of this method is that it is possible to separate structural models of factors from the items that are used to measure factors. While scales of the same factor can differ sometimes dramatically, structural models of factors are independent of the specific items that are used to measure a factor as long as some items reflect variance in the factor. Using this approach, I showed that the Big Five and HEXACO model only differ in the way they represent covariation among some primary factors. It is incorrect to claim that Big Five models fail to represent variation in honesty or humility. It is also incorrect to assume that all pro-social traits are independent after their shared variance in agreeableness is removed. Future research needs to examine more carefully the structural relationships among primary traits that are not explained by higher-order factors. This research question has been neglected because exploratory factor analysis is unable to examine this question. I therefore urge personality researchers to adopt confirmatory factor analysis to advance research on personality structure.

A Meta-Psychological Investigation of Intelligence Research with Z-Curve.2.0

A recent article by Nuijten, van Assen, Augusteijn, Crompvoets, and Wicherts reported the results of a meta-meta-analysis of intelligence research. The authors extracted 2442 eect sizes from 131 meta-analyses. The authors made these data openly available to allow “readers to pursue other categorizations and analyses” (p. 6). In this blog post, I report the results of an analysis of their data with z-curve.2.0 (Bartos & Schimmack, 2020; Brunner & Schimmack, 2020). Z-curve is a powerful statistical tool that can (a) examine the presence of publication bias and/or the use of questionable research practices, (b) provide unbiased estimate of statistical power before and after selection for significance when QRPs are present, and (c) estimate the maximum number of false positive results.

Questionable Research Practices

The term questionable research practices refers to a number of statistical practices that inflate the number of significant results in a literature (John et al., 2012). Nuijten et al. relied on the correlation between sample size and effect size to examine the presence of publication bias. Publication bias produces a negative correlation between sample size and effect size because larger effects are needed to get significance in studies with smaller samples. The method has several well-known limitations. Most important, a negative correlation is also expected if researchers use larger samples when they anticipate smaller effects, either in the form of a formal a priori power analysis or based on informal information about sample sizes in previous studies. For example, it is well-known that effect sizes in molecular genetics studies are tiny and that sample sizes are huge. Thus, a negative correlation is expected even without publication bias.

Z-curve.2.0 avoids this problem by using a different approach to detected the presence of publication bias. The approach compares the observed discovery rate (i.e., the percentage of significant results) to the expected discovery rate (i.e., the average power of studies before selection for significance). To estimate the EDR, z-curve.2.0 fits a finite mixture model to the significant results and estimates average power based on the weights of a finite number of non-centrality parameters.

I converted the reported information about sample size, effect size, and sampling error into t-values, and then converted the t-values. Extremely large t-values of 20 were fixed to a value of 20. Then t-values were converted into absolute z-scores.

Figure 1 shows a histogram of the z-scores in the critical range from 0 to 6. All z-scores greater than 6 are assumed to have a power of 1 with a significance threshold of .05 (z = 1.96).

The critical comparison of the observed discovery rate (52%) and the expected discovery rate (58%) shows no evidence of QRPs. In fact, the EDR is even higher than the ODR, but the confidence interval is wide and includes the ODR. When there is no evidence that QRPs are present, it is better to use all observed z-scores, including the credible non-significant results, to fit the finite mixture model. Figure 2 shows the results. The blue line moved to 0, indicating that all values were used for estimation.

Visual inspection shows a close match between the observed distribution of z-scores (blue line) and the predicted distribution by the finite mixture model (grey line). The observed discovery rate now closely matches the expected discovery rate of 52%. Thus, there is no evidence of publication bias in the meta-meta-analysis of effect sizes in intelligence research.

Interestingly, there is also no evidence that researchers used mild QRPs to move marginally significant results just below .05 on the other side of the significance criterion to produce just significant results. There are two possible explanation for this. On the one hand, intelligence researchers may be more honest than other psychologists. On the other hand, it is possible that meta-analyses are not representative of the focal hypothesis tests that led to publication of original research articles. A meta-analysis of focal hypothesis tests in original articles is needed to answer this question.

In conclusion, this superior analysis of the presence of bias in the intelligence literature showed no evidence of bias. In contrast, Nuijten et al. (2020) found a significant correlation between effect sizes and sample sizes which they call small study effect. The problem with this finding is that it can reveal either careful planning of sample sizes (good practices) or the use of QRPs (bad practices). Thus, their analyses does not tell us whether there is bias in the data. Z-curve.2.0 resolves this ambiguity and shows that there is no evidence of selection for significance in these data.

Statistical Power

Nuijten et al. used Cohen’s classic approach to investigate power (Cohen, 1962). Based on this approach, they concluded “we found an overall median power of 11.9% to detect a small effect,54.5% for a medium effect, and 93.9% for a large effect (corresponding to a Pearson’s r of 0.1, 0.3, and 0.5 or a Cohen’s d of 0.2, 0.5, and 0.8, respectively)”

This information merely provides information about the sample sizes in the different studies. Studies with small sample sizes have low power to detect a small effect size. As most studies had small sample sizes, the average power to detect small effects is low. However, this does not tell us anything about the actual power of studies to obtain significant results for two reasons. First, effect sizes in a meta-meta-analysis are extremely heterogeneous. Thus, not all studies are chasing small effect sizes. As a result, the power of studies is likely to be higher than the average power to detect small effect sizes. Second, the previous results showed that (a) sample sizes correlate with effect sizes and (b) there is no evidence of QRPs. This means that researchers are a priori deciding to use smaller samples to search for larger effects and larger samples to search for smaller effects. This means that formal or informal a priori power analyses ensure that small samples can have as much or more power than large samples. It is therefore not informative to conduct power analysis only based on information about sample size. Z-curve.2.0 avoids this problem and provides estimates of the actual mean power of studies. Moreover, it provides two estimates of power for two different populations of studies. One population are all studies that are conducted by intelligence researchers without selecting for significance. This estimate is the expected discovery rate. Z-curve also provides an estimate for the population of studies that produced a significant result. This population is of interest because only significant results can be used to claim a discovery; with an error rate of 5%. When there is heterogeneity in power, the mean power after selection for significance is higher than the average power before selection for significance (Brunner & Schimmack, 2020). When researchers attempt to replicate a significant results to verify that it was not a false positive result, mean power after selection for significance provides the average probability that an exact replication study will be significant. This information is valuable to evaluate the outcome of actual replication studies (cf. Schimmack, 2020).

Given the lack of publication bias, there are two ways to determine mean power before selection for significance. We can simply compute the average of significant results and we can use the estimated discovery rate. Figure 2 shows that both values are 52%. Thus, the average power of studies conducted by intelligence researchers is 52%. This is well-below the recommended level of 80%.

The picture is a bit better for studies with a significant result. Here the average power called the expected replication rate is 71% and the 95% confidence interval approaches 80%. Thus, we would expect that more than 50% of significant results in intelligence research can be replicated with a significant result in the replication study. This estimate is higher than for social psychology, where the expected replication rate is only 43%.

False Positive Psychology

The past decade has seen a number of stunning replication failures in social psychology (cf. Schimmack, 2020). This has led to a concern that most discoveries in psychology if not in all sciences are false positive results that were obtained with questionable research practices (Ioannidis, 2005 ; Simmons et al., 2011). So far, however, these concerns are based on speculations and hypothetical scenarios rather than actual data. Z-curve.2.0 makes it possible to examine this question empirically. Although it is impossible to say how many published results are in fact false positive results, it is possible to estimate the maximum number of false-positive results based on the discovery rate. (Soric, 1989). As the observed and expected discovery are identical, we can use the value of 52% as our estimate of the discovery rate. This implies that no more than 5% of the significant results are false positive results. Thus, the empirical evidence shows that most published results in intelligence research are not false positives.

Moreover, this finding implies that most non-significant results are false negatives or type-II errors. That is, the null-hypothesis is also false for non-significant results. This is not surprising because many intelligence studies are correlational and the nil-hypothesis that there is absolutely no relationship between two naturally occurring variables has a low a priori probability. This also means that intelligence researchers would benefit from specifying some minimal effect size for hypothesis testing or to focus on effect size estimation rather than hypothesis testing.


Nujiten et al. conclude that intelligence research is plagued by QRPs. “Based on our findings, we conclude that intelligence research from 1915 to 2013 shows signs that publication bias may have caused overestimated effects”. This conclusion ignores that small-sample effects are ambiguous. The superior z-curve analysis shows no evidence of publication bias. As a result, there is also no evidence that reported effect sizes are inflated.

The z-curve.2.0 analysis leads to a different conclusion. There is no evidence of publication bias, significant results have a probability of 70% to be replicated in exact replication studies and even if exact replication studies are impossible the discovery rate of 50% implies that we should expect the majority of replication attempts with the same sample sizes to be successful (Bartos & Schimmack, 2020). In replication studies with larger samples even more results should replicate. Finally, most of the non-significant results are false negative results because there are few true null-hypothesis in correlational research. A modest increase in sample sizes could easy achieve 80% power which is typically recommended.

A larger concern is the credibility of conclusions based on meta-meta-analyses. The problem is that meta-analysis focus on general main effects that are consistent across studies. In contrast, original studies may focus on unique patterns in the data that can not be subjected to meta-analysis because direct replications of these specific patterns are lacking. Future research therefore needs to code the focal hypothesis tests in intelligence articles to examine the credibility of intelligence research.

Another concern is the reliance on alpha = .05 as a significance criterion. Large genomic studies have a multiple comparison problem where 10,000 analyses can easily produce hundreds of significant results with alpha = .05. This problem is well-known and genetics studies now use much lower alpha levels to test for significance. A proper power analysis of these studies needs to use the actual alpha level rather than the standard level of .05. Z-curve is a flexible tool that can be used with different alpha levels. Therefore, I highly recommend z-curve for future meta-scientific investigations of intelligence research and other disciplines.


Bartoš, F., & Schimmack, U. (2020). z-curve.2.0: Estimating replication and discovery rates. Under review.

Brunner, J., & Schimmack, U. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta- Psychology. MP.2018.874, https://doi.org/10.15626/MP.2018.874

Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145–153. http://dx.doi.org/10.1037/h0045186

Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2, e124. http://dx.doi.org/10.1371/journal.pmed.0020124

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the Prevalence of Questionable Research Practices With Incentives for Truth Telling. Psychological Science, 23(5), 524–532. https://doi.org/10.1177/0956797611430953

Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne. Advance online publication. https://doi.org/10.1037/cap0000246

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22,  1359 –1366. http://dx.doi.org/10.1177/0956797611417632

Sorić, B. (1989). Statistical “discoveries” and effect-size estimation.Journal of the American Statistical Association,84(406), 608-610.

The Structure of Agreeableness

In 1934, Thurstone published his groundbreaking article “The Vectors of Mind” that introduced factor analysis as an objective method to examine the structure of personality traits. His first application of his method to a list of 60 trait adjectives rated by 1,300 participants yielded five factors. It would take several more decades for personality psychologists to settle on a five-factor model of personality traits (Digman, 1990).

Although the five-factor model dominates personality, it is not the only theory of personality traits. The biggest rival of the Big Five model is the HEXACO model that postulates six factors (Ashton, Lee, Perugini et al, 2004; Lee & Ashton, 2004).

A recent special issue in the European Journal of Personality contained a target article by Ashton and Lee in favor of replacing the Big Five model with the HEXACO model and responses by numerous prominent personality psychologists in defense of the Big Five model.

The key difference between the Big Five model and the HEXACO model is the representation of pro-social (self-transcending) versus egoistic (self-enhancing) traits. Whereas the Big Five model assumes that a single general factor called agreeableness produces shared variance among pro-social traits, the HEXACO model proposes two distinct factors called Honesty-Humility and Agreeableness. While the special issue showcases the disagreement among persoality researchers, it fails to offer an empirical solution to this controversy.

I argue that the main reason for stagnation in research on the structure of personality is the reliance on Thurstone’s outdated method of factor analysis. Just like Thurstone’s multi-factor method was an important methodological contribution that replaced Spearman’s single-factor method, Jöreskog (1969) developed confirmatory factor analysis that addresses several limitations in Thurstone’s method. Neither Ashton and Lee nor any of the commentators suggested the use of CFA to test empirically whether pro-social traits are represented by one general or two broad traits.

There are several reasons why personality psychologists have resisted the use of CFA to study personality structure. Some of these reasons were explicitly states by McCrae, Zonderman, Costa, Bond, and Paunonen (1996).

1. “In this article we argue that maximum likelihood confirmatory factor analysis (CFA), as it has typically been applied in investigating personality structure, is systematically flawed” (p. 552).

2. “The CFA technique may be inappropriately applied” (p. 553)

3. “CFA techniques are best suited to the analysis of simple structure models” (p. 553)

4. “Even proponents of CFA acknowledge a long list of problems with the technique, ranging from technical difficulties in estimation of some models to the cost in time and effort involved.”

5. The major advantage claimed for CFA is its ability to provide statistical tests of the fit of empirical data to different theoretical models. Yet it has been known for years that the chi-square test on which most measures of fit are based is problematic”

6. A variety of alternative measures of goodness-of-fit have been suggested, but their interpretation and relative merits are not yet clear, and they do not yield tests of statistical

7. Data showing that chisquare tests lead to overextraction in this sense call into question
the appropriateness of those tests in both exploratory and confirmatory maximum likelihood models. For example, in the present study the largest single problem was a residual correlation between NEO-PI-R facet scales E4: Activity and C4: Achievement Striving. It would be possible to specify a correlated error term between these two scales, but the interpretation of such a term is unclear. Correlated error usually refers to a nonsubstantive source of variance. If Activity and Achievement Striving were, say, observer ratings, whereas all other variables were self-reports, it would make sense to control for this difference in method by introducing a correlated error term. But there are no obvious sources of correlated error among the NEO-PI-R
facet scales in the present study.

8, With increasing familiarity of the technique and the availability of convenient computer programs (e.g., Bentler, 1989; Joreskog & Sorbom, 1993), it is likely that many more researchers will conduct CFA analyses in the future. It is therefore essential to point out the dangers in an uncritical adoption and simplistic application of CFA techniques (cf. Breckler, 1990).

9. Structures that are known to be reliable showed poor fits when evaluated by CFA techniques. We believe this points to serious problems with CFA itself when used to examine personality structure.

I may be giving McCrae et al. (1996) to much credit for the lack of CFA studies of personality structure, but their highly cited article clearly did not encourage future generations to explore personality structure with CFA. This is unfortunate because CFA has many advantages over traditional EFA.

The first advantage is the ability to compare model fit. McCrae et al. are overly concerned about the chi-square statistic that is sensitive to sample size and rewards overly complex models. Even when they published their article, other fit indices were already used to address these problems. Today researchers have even more experience with the evaluation of model fit. More important, well-established fit indices also reward parsimony and can favor a more parsimonious model with five factors over a less parsimonious model with six factors. This opens the door to head to head model comparisons of the Big Five and HEXACO model.

The second advantage of CFA is that factors are theoretically specified by researchers rather than empirically driven by patterns in the data. This means that CFA can find factors that are represented by as few as two items and factors that are represented by 10 or more items. In contrast, EFA will favor factors with many items. This means that researchers need to know the structure to represent all factors equally, which is not possible in studies that try to discover a structure and the number of factors. Different sampling from the item space may explain differences in personality structures with EFA, but is not a problem for CFA as long as factors are represented by a minimum of two items.

A third advantage of CFA is that it is possible to model hierarchical structures. Theoretically, most personality researchers agree that the Big Five or HEXACO are higher-order factors that explain only some of the variance in so-called facets or primary traits like modesty, altruism, forgiveness, or morality. However, EFA cannot represent hierarchies. This makes it necessary to examine the higher order structure of personality with scales that average several items. However, these scales are impure indicators of the primary factors and the impurities can distort the higher-order factor structure.

A fourth advantage is that CFA can also model method factors that can distort the actual structure of personality. For example, Thurstone’s results appeared to be heavily influenced by an evaluative factor. With CFA it is possible to model evaluative biases and other response styles like acquiescence bias to separate systematic method variance from the actual correlations between traits (Anusic et al., 2009).

Finally, CFA requires researchers to think hard about the structure of personality to specify a plausible model. In contrast, EFA produces a five factor or six-factor solution in less than a minute. Most of the theoretical work is then to find post-hoc explanations for the structure.

In short, most of the problems listed by McCrae et al. (1996) are not bugs, but features of CFA. The fact that they were unable to create a fitting model to their data only shows that they didn’t take advantage of these features. In contrast, I was able to fit a hierarchical model with method factors to Costa and McCrae’s NEO-PI-R questionnaire (Schimmack, 2019). The results mostly confirmed the Big Five model, but some facets did not have primary loadings on the predicted factor.

Here I am using hierarchical CFA to examine the structure of pro-social and anti-social traits. The Big Five model predicts that all traits are related to one general factor that is commonly called Agreeableness. The HEXACO model predicts that there are two relatively independent factors that are called Agreeableness and Honestiy-Humility (Ashton & Lee, 2005).


The data were collected by Crowe, Lynam, and Miller (2017). 1205 participants provided self-ratings on 104 items that were selected from various questionnaires to measure Big-5 agreeableness or HEXACO-agreeableness and honesty and humility. The data were analyzed with EFA. In the recent debate about the structure of personality, Lynam, Crowe, Vize, and Miller (2020) pointed to the finding of a general factor to argue that there is “Little Evidence That Honesty-Humility Lives Outside of FFM Agreeableness” (p. 530). They also noted that “at no point in the hierarchy did a separate Honesty-Humility factor emerge” (p. 530).

In their response, Ashton and Lee criticize the item-selection and argue that the authors did not sample enough items that reflect HEXACO-Agreeableness. “Now, the Crowe et al. variable set did contain a substantial proportion of items that should represent good markers of Honesty-
Humility, but it was sharply lacking in items that represent some of the best markers of HEXACO Agreeableness: for example, one major omission was item content related to
even temper versus anger-proneness, which was represented only by three items of the Patience facet of HEXACO-PI-R Agreeableness” (p. 565). They also are concerned about oversampling of other facets. “The Crowe et al. “Compassion” factor is one of the
all-time great examples of a ‘bloated specific.’ (p. 565). These are valid concerns for analyses with EFA that allow researchers to influence the factor structure by undersampling or oversampling items. However, CFA is not influenced by the number of items that reflect a specific facet. Even two items are sufficient to create a measurement model of a specific factor and to examine whether these two factors are fairly independent or substantially correlated. Thus, it is a novel contribution to examine the structure of pro-social and anti-social traits using CFA.

Exploratory Analyses

Before I present the CFA results, I also used EFA as implemented in MPLUS to examine the structure of the 104 items. Given the predicted hierarchical structure, it is obvious that neither a one-factor nor a two-factor solution should fit the data. The main purpose of an EFA analysis would be to explore the number of primary factors / facets that is represented in the 104 items. To answer this question, it is inappropriate to rely on the scree test that is overly influenced by item-selection or on the chi-square test that leads to over-extraction. Instead, factor solutions can be compared with standard fit indices from CFA such as the Comparative Fit Index (CFI), the Root Mean Square Error of Approximation (RMSEA), the Akaike Information Criterion (AIC), The Bayesian Information Criterion (BIC), and the sample-size adjusted BIC (SSA-BIC).

Although all indices take parsimony into account, three of the indices favor the most complex structure with 20 factors. BIC favors parsimony the most and settles for 12 factors as the optimal number. The sample-size adjusted BIC favors 16 factors. Evidently, the number of primary factors is uncertain. The reason is that many larger primary factors may be split into highly correlated more specific factors that are called nuances. The structure of primary factors can be explored with CFA analyses of primary factors. Thus, I settled for the 12 factor solution to start the CFA analyses.

Another way to determine the number of primary factors is to look at the scales that were represented in the items. There were 9 HEXACO scales: forgiving (A), gentle (A), flexible (A), patient (A), modest (H), fair (H), greed avoidant (H), sincere (H), and altruistic. In addition, there were the Big Five facets empathetic, trusting, straightforward, modest, compassionate, and polite. Some of these facets overlap with HEXACO facets, suggesting that the 12 factor solution may reflect the full content of the HEXACO and Big Five facets.

Exploration of the Primary Factors with CFA

Before I start, it is important to point out a major misconception about CFA. The term confirmatory has mislead researchers to assume that CFA should only be used to confirm a theoretically expected structure. Any post-hoc modifications of a model to fit actual data would then be a questionable research practice. This is a misconception that is based on Joreskog’s unfortunate decision to label his statistical method confirmatory. Joreskog’s actual description of his method that few researchers have read makes it clear that CFA can be used for exploration.

“We shall give examples of how a preliminary interpretation of the factors can be successively modified to determine a final solution that is acceptable from the point of view of both goodness of fit and psychological interpretation. It is highly desirable that a hypothesis that has been generated in this way should subsequently be confirmed or disproved by obtaining new data and subjecting these to a confirmatory analysis.” (p. 183).

There is nothing different from using CFA to explore data than to run a multiple regression analysis or an exploratory ANOVA. Data exploration is an important part of science. It is only questionable when exploratory analyses are falsely presented as confirmatory. For example, I could pretend that I came up with an elaborate theory of agreeableness and present the final model as theoretically predicted. This is known as HARKing (Kerr, 1998). However exploration is needed to generate a model that is worthwhile testing in a confirmatory study. As nobody has examined the structure of agreeableness with CFA, the first attempt to fit a model is naturally exploratory and can only serve as a starting point for the development of a structural model of agreeableness.

Items were considered to be candidate items of a factor if they loaded at least .3 on a factor. This may seem a low level, but it is unrealistic to expect high loadings of single items on a single factor.

The next step was to fit a simple structure to the items of each factor. When possible this model also included an acquiescence factor that coded direct versus reverse scored items. Typically, this model did not fit the data well. The next step was to look for correlated residuals that show shared variance between items. These items violate the assumption of local independence. That is, the only reason for the correlation between items should be the primary factor. Items can show additional shared variance for a number of reasons such as similar wording or shared specific content (nuances). When many items were available, one of the items with correlated residuals was deleted. Another criterion for item selection was the magnitude of primary loadings. A third criterion aimed to get a balanced number of direct and reverse scored items when this was possible.

Out of the 12 factors, 9 factors were interpretable and matched one of the a priori facets. The 9 primary factors were allowed to correlate freely with each other. The model had acceptable overall fit, CFI = .954, RMSEA = .030. Table 2 shows the factors and the items with their source and primary loadings.

The factor analysis only failed to distinguish clearly between the gentle and flexible facets of HEXACO-agreeableness. Thus, HEXACO agreeableness is represented by three rather than four factors. More important is that the four Honesty-Humility facets of the HEXACO model, namely Greed-Avoidance (F11), Sincerity(F9), Modest (F7), and Morality (F5) were clearly identified. Thus, it is possible to examine the relationship of Honesty-Humilty to Big-Five agreeableness with these data. Importantly, CFA is not affected by the number of indicators. Three items with good primary loadings are sufficient to identify a factor.

Table 3 shows the pattern of correlations among the 9 factors. The aim of a structural model of agreeableness is to explain this pattern of correlations. However, visual inspection of these correlations alone can provide some valuable insights about the structure of agreeableness. Strong evidence for a two-factor model would be provided by high correlations among the Honesty-Humility facets, high correlations among the Agreeableness facets, and low correlations between Honesty-Humulity and Agreeableness facets (Campbell & Fiske, 1959).

The pattern of correlations is only partially consistent with the two-factor structure of the HEXACO model. All of the correlations among the Honesty-Humility facets and all of the correlations among the Agreeableness facets are above .38. However, 6 out of the 20 cross-trait correlations are also above .38. Moreover all of the correlations are positive, suggesting that Honesty-Humility and Agreeableness are not independent.

For the sake of comparability, I also computed scales corresponding to the nine factors. Table 3 shows the correlations for scales that could be used for a two-step hierarchical analysis by means of EFA. In general correlations in Table 3 are weaker than correlations in Table 2 because correlations among scale scores are attenuated by random measurement error. The pattern of correlations remains the same, but there are more cross-trait correlations that exceed the lowest same-trait correlation of .31.

The Structure of Agreeableness

A traditional EFA has severe limitations in examining the structure of correlations in Table 3. One major limitation is that structural relations have to be due to higher-order factors. The other limitation is that 9 variables can only identify a small number of factors. These problems are typically overlooked because traditionally EFA results are not examined for goodness of fit. However, an EFA analysis with fit indices in MPLUS shows that the two-factor model does not meet the standard fit criteria of .95 for CFI and .06 for RMSEA, CFI = .939, RMSEA = .092. The two factors clearly did correspond to the Hexaco-Humility and Agreeableness factors, but even with secondary loadings, the model fails to fully account for the pattern of correlations. Moreover, the two factors were correlated r = .41, suggesting that they are not independent of each other.

Figure 1 shows a CFA model that is based on the factor correlations in Table 2. This model does not fit the data as well, CFI = .950, RMSEA = .031, as the simple measurement model with correlated factors, CFI = .954, RMSEA = .030, but I was unable to find a plausible model with better fit. I encourage others to do so. On the other hand, the model fit the data much better than the two-factor EFA model, CFI = .939, RMSEA = 092.

The model does show the expected separation of Honesty-Humility and Agreeableness facets, but the structure is more complex. First morality has nearly equal loadings on the Honesity-Humility and the Agreeableness facet. The markers of Honesty-Humility are the other three facets, modest, manipulative (reversed), and materialistic (reversed). I suggest that the common element of these facets is self-enhancement. The loading of morality suggests that individuals who are highly motivated to self-enhance are more likely to engage in immoral behaviors to do so.

Four of the remaining factors have direct loadings on the agreeableness factor. Considerate (caring/altruistic) has the highest loading, but all four have substantial loadings. The aggressive factor has the weakest relationship with agreeableness. One reason is that it is also related to neuroticism in the Big Five model. In this model, aggressiveness is linked to agreeableness indirectly by consideration and self-enhancement, suggesting that individuals who do not care about others (low consideration) and who care about themselves (self-enhancement) are more likely to aggress against others.

In addition, there were several correlated residuals between some facets. Trusting and forgiving shared unique variance. Maybe forgiveness is more likely to occur when individuals trust people that they had good intentions and are not going to repeat their transgression again in the future. Aggression showed shared variance with modesty and morality. One explanation could be that modestly is related to low assertiveness and that assertive people are more likely to use aggression. Morality may relate to aggression because it is typically considered immoral to harm others. Morality was also related to manipulativeness. Here the connection is rather obvious because manipulating people is immoral.

The model in Figure 1 should not be considered the ultimate solution to the controversy about the structure of pro-social and anti-social behaviors. To the contrary. The model should be considered the first structural model that actually fits the data. In contrast, previous results based on EFA produced models that approximated the structure, but never fit the actual data. Future research should test alternative models and these models should be evaluated in terms of model fit and theoretical plausibility (Joreskog, 1969).

Do these results answer the question whether there are five or six higher-order factors of personality? The answer is no. Consistent with the Five Factor model, the Honesty-Humility or Self-Enhancement factor is not independent of agreeableness. It is therefore reasonable to think about Honesty-Humility as subordinate to Agreeableness in a hierarchical model of traits. Personally, I favor this interpretation of the results. However, proponents of the HEXACO model may argue that the correlation between the agreeableness factor and the honesty-humility factor is low enough to make honesty-humility a separate factor. Moreover, it was not possible to control for evaluative bias (halo) variance in this model, and halo bias may have inflated the correlation between the two factors. On the other hand, if correlations of .4 are considered low enough to split Big-Five factors, it is possible that closer inspection of other Big Five domains can also be split into distinct, yet positively correlated factors. The main appeal of the Big Five model is that the five factors are fairly independent after controlling for evaluative bias variance. Moreover, many facets have loadings of .4 or even lower on the Big Five factor. It is therefore noteworthy that all correlations among the 9 factors were positive and suggest that a general factor produces covariation among them. The common factor can also be clearly interpreted in terms of the focus on self-interest versus other’s interests or needs to guide behaviors. Individuals high in agreeableness take others’s needs and feelings into account, whereas those low in agreeableness are guided strongly by self-interest. The split into two factors may be due to the fact that importance of self and other are not always in conflict with each other. Especially individuals low in self-enhancement may still differ in terms of their pro-social behaviours.


The main contribution of this blog post is to show the importance of testing model fit in investigations of the structure of personality traits. While it may seem self-evident that a theoretical model should fit the data, personality psychologists have failed to test model fit or ignored feedback that their models do not fit the data. Not surprisingly, personality psychologists continue to argue over models because it is easy to propose models, if they do not have to fit actual data. If structural research wants to be an empirical science, it has to subject models to empirical tests that can falsify models that do not fit the data.

I empirically showed that a simple two-factor model does not fit the data. At the same time, I showed that a model with a general agreeableness factor and several independent facets also does not fit the data. Thus, neither of the dominant models fits the data. At the same time, the data are consistent with the idea of a general factor underlying pro-social and anti-social behaviors, while the relationship among facets remains to be explored in more detail. Future research needs to control for evaluative bias variance and examine how the structure of agreeableness is embedded in a larger structural model of personality.


Ashton, M. C., Lee, K., Perugini, M., Szarota, P., de Vries, R. E., Di Blas, et al.
(2004). A six-factor structure of personality-descriptive adjectives: Solutions
from psycholexical studies in seven languages. Journal of Personality and Social
Psychology, 86, 356–366

Lee, K., & Ashton, M. C. (2004). Psychometric properties of the HEXACO Personality
Inventory. Multivariate Behavioral Research, 39, 329–358.

US police are 12 times more likely to draw a gun in encounters with unarmed Black versus White civilians

It is a well-known fact among criminologists and other social sciences that Black US citizens are killed by police in disproportionate numbers. That is, relative to the percentage in the US population, Black civilians are killed 2 to 3 times more often than White civilians. This is about the only solid fact in the social sciences on racial disparities in lethal use of force.

Researchers vehemently disagree about the causes of this disparity. Some suggest that it is at least partially caused by racial biases in policing and the decision to use lethal force. Others argue that it is explained by the fact that police is more likely to use lethal force with violent criminals and that Black citizens are more likely to be violent criminals. Some of this disagreement can be explained by different ways to look at the same statistics and confusion about the meaning of the results.

A study by Wheeler, Philips, Worrall, and Bishopp (2017) illustrates the problem of poor communication of results. The authors report the results of an important study that provides much needed information about the frequency of use of force. The researches had access to nearly 2,000 incidences where an officer from the Dallas Police Department drew a weapon. In about 10% of these incidences, officers fired at least one shot (207 out of 1909). They also had information about the ethnicity of the civilian involved. Their abstract states a clear conclusion. “African Americans are less likely than Whites to be shot” (p. 49). The discussion section elaborates on this main finding. “Contrary to the national implicit bias narrative, our analysis found that African Americans were less likely to be shot than White subjects” (p. 65).

The next paragraph highlights that the issue is more complex.

It cannot be overemphasized that the addition of don’t shoot control cases to police
shooting cases dramatically alters the findings. With a simple census comparison (see
Results and Discussion and Conclusion), African Americans were overrepresented in
the shootings compared to Whites and Latinos. Similarly, when only examining
shooting incidents (see first column of Table 4 and accompanying narrative), of those
shot, African Americans had a higher probability of being unarmed compared to
White suspects. However, by incorporating control cases in which officers did not
shoot, we reached completely opposite inferences, namely, that African Americans
have a lower probability of being shot relative to Whites.

This paragraph is followed by a reaffirmation that “neither analysis hints at racial bias against African Americans” (p. 66).

They authors than point out that their conclusion in the abstract is severely limited to a very narrow definition of racial bias.

As previously mentioned, an important limitation of the study is the fact that such an analysis is only relevant to officer decision-making after they have drawn their firearm” (p. 66).

This restrictive definition of racial bias explains why the authors main conclusion results in a paradox. On the one hand, there is strong and clear evidence that more Black US citizens die at the hand of police than White US citizens. On the other hand, the authors claim that there is no racial bias against Black US citizens in the decision to shoot. This leaves the open question how there can be racial disparity in deaths without racial bias in shots fired. The answer is simple. Officers are much more likely to draw a weapon in encounters with Black civilians. This information is provided in Table 2

Officers drew a weapon in 1082 (57%) encounters with Black civilians compared to 273 (14%) encounters with White civilians, a disparity of 4:1. The abstract ignores this fact and focuses on the conditional probability that shots are fired when a gun is drawn (9% vs. 12%), a disparity of 1:1.3. Given the much larger racial disparity in decisions to draw a gun versus to shoot when a gun is drawn, Black civilians are actually shot disproportionally more than White civilians (100 vs. 34) at a ratio of 3:1. This is consistent with national statistics that show a 2-3:1 racial disparity in lethal use of force.

It is absolutely misleading to conclude from these data that there is no racial bias in policing or the use of force and to suggest that these results are inconsistent with the idea that racial biases in policing lead to a disproportionate number of unarmed Black civilians being killed by police.

Even more relevant information is contained in Table 4 that shows incidences in which the civilian was unarmed.

Despite being unarmed, officers drew their weapons in 239 incidences compared to 38 incidences for White civilians, a disparity of 6:1. As a result, Black civilians are also much more likely to be shot by police than White civilians (22 vs. 5), 4:1 ratio.

To fully understand the extent of racial disparities in the drawing of a weapon and shots being fired it is necessary to take the ethnic composition of Dallas into account. Wikipedia suggests that the ratio is 2:1 for Whites (50% White, 25% Black). Thus, the racial disparity in police officers drawing a gun on an unarmed civilian is 12:1 and the racial disparity of shooting an unarmed civilian is 8:1.

In conclusion, Wheeler et al.’s analyses suggest that racial bias in decisions to shoot when a gun is drawn are unlikely to explain racial disparities in lethal use of force. However, their data also suggest that racial biases in the decision to draw a gun on Black civilians may very well contribute to the disproportionate killing of unarmed civilians. The authors racial bias is revealed by their emphasize on the decision to shoot after drawing a gun, while ignoring the large racial disparity in the decision to draw a gun in the first place.

Political bias in the social sciences is a major problem. In an increasingly polarized political world, especially in the United States, scientists should try to unit a country by creating a body of solid empirical facts that only the fringe extremists and willful ignorant continue to ignore. Abuse of science to produce misleading false claims only fuels the division and gives extremists false facts to cement their ideology. It is time to take a look at systemic racism in criminology to ensure credibility especially with Black civilians. Distrust in institutions like the police or science will only fuel the division and endanger lives of all colors. It is therefore extremely unfortunate that Wheeler et al. explicitly use their article to discredit valid concerns by the BlackLivesMatter movement about racial disparities in policing.

The tragic and avoidable death of Atatiana Jefferson in neighboring Fort Worth is only one example that shows how Weehler et al.’s conclusions disregard evidence of racial in lethal use of force. A young, poorly trained officers drew a gun with deadly consequences. Called for a wellness-check (!!!) in a Black neighborhood, the White officer unannounced entered the dark backyard. The female victim heard some noise in the backyard, got her legal gun, and went to the window to examine the situation. Spooked, the police officer fired at the window and killed the homeowner. In Wheeler’s statistics, this incidence would be coded as the decision to shoot after drawing a gun with an armed Black civilian. The real question is what he was thinking to search a dark backyard with his gun drawn.


These all-to-common incidences are not only tragic for the victims and their relatives. They are also likely to have dramatic consequences for police officers. In this case, the officer was indicted for murder.


The goal of social science should be to analyze the causes of deadly encounters between police and civilians with officers or civilians as victims to create interventions that reduce the 1000 deaths a year in these encounters. Wheeler et al.’s (2016) data and tables provide a valuable piece of information. Their conclusions do not. Future research should focus on factors that determine the drawing of a weapon, especially when civilians are unarmed. All to often, these incidences end with a dead Black body on the pavement.

Kahneman talks to Mischel about Traits and Self-Control

I found this video on YouTube (Christan G.) with little information about the source of the discussion. I think it is a valuable historic document and I am reposting it here because I am afraid that it may be deleted from YouTube and be lost.


Kahneman “We are all Mischelians.”

Kahneman “You [Mischel] showed convincingly that traits do not exist but you also provided the most convincing evidence for stable traits [when children delay eating a marshmallow become good students who do not drink and smoke.]

Here is Mischel’s answer to a question I always wanted him to answer. In short, self-control is not a trait. It is a skill.

The Dunning-Kruger Effect Explained

“These responses to our work have also furnished us moments of delicious irony, in that each critique makes the basic claim that our account of the data displays an incompetence that we somehow were ignorant of.” (Dunning, 2011, p. 247).

In 1999, Kruger and Dunning published an influential article. With 2258 citations in WebofScience it ranks #28 in citations for articles in the Journal of Personality and Social Psychology. The main contributions of the article were (a) to demonstrate that overestimation of performance is not equally distributed across different levels of performance, and (b) to provide a theory that explain why low-performers are especially prone to overestimate their performance. The finding that low-performers overestimate their performance, while high-performers are more accurate or even underestimate their performance has been dubbed the Dunning-Kruger effect (DKE). It is one of the few effects in social psychology that is named in honor of the researchers who discovered it.

The effect is robust and has been replicated in hundreds of studies (Khalid, 2016; Pennycook et al., 2017). Interestingly, it can even be observed with judgments about physical attributes like attractiveness (Greitemeyer, 2020).

While there is general consensus that the DKE is a robust phenomenon, researchers disagree about the explanation for the DKE. Kruger and Dunning (1999) proposed a meta-cognitive theory. Accordingly, individuals have no introspective access to their shortcomings. For example, a student who picked one answer from a set of options in a multiple-choice test thinks that they picked the most reasonable option. After all, they would have picked a different option if they had considered another option as more reasonable. Students only become aware that they picked the wrong option when they are given feedback about their performance. As a result, they are overly confident that they picked the right answer before they are given feedback. This mistake will occur more frequently for low-performers than for high performers. It can not occur for students who ace their exam (i.e, get all answers correct). The only mistake top-performers could make is to doubt their right answers and underestimate their performance. Thus, lack of insight into mistakes coupled with a high frequency of mistakes leads to higher overconfidence among low-performers.

In contrast, critiques of the meta-cognitive theory have argued that DKE is a statistical necessity. As long as individuals are not perfectly aware of their actual performance, low-performers are bound to overestimate their performance and high-performers are bound to underestimate their performance (Ackermann et al., 2002; Gignac & Zajenkowski, 2020; Krueger & Mueller, 2002). This account has been called regression to the mean. Unfortunately, this label has produced a lot of confusion because the statistical phenomenon of regression to the mean is poorly understood by many psychologists.

Misunderstanding of Regression to the Mean

Wikipedia explains that “in statistics, regression toward the mean (or regression to the mean) is the phenomenon that arises if a sample point of a random variable is extreme (nearly an outlier), a future point will be closer to the mean or average on further measurements.”

Wikipedia also provides a familiar example.

Consider a simple example: a class of students takes a 100-item true/false test on a subject. Suppose that all students choose randomly on all questions. Then, each student’s score would be a realization of one of a set of independent and identically distributed random variables, with an expected mean of 50. Naturally, some students will score substantially above 50 and some substantially below 50 just by chance. If one selects only the top scoring 10% of the students and gives them a second test on which they again choose randomly on all items, the mean score would again be expected to be close to 50. Thus the mean of these students would “regress” all the way back to the mean of all students who took the original test. No matter what a student scores on the original test, the best prediction of their score on the second test is 50.

This is probably the context that most psychologists have in mind when they think about regression to the mean. The same measurement procedure is repeated twice. In this scenario, students who performed lower the first time are likely to increase their performance the second time and students who performed well the first time are bound to decrease in their performance the second time. How much students regress towards the mean depends on the influence of their actual abilities on performance on the two tests. The more strongly the two tests are correlated, the less regression to the mean occurs. In the extreme case, where performance is fully determined by ability, the retest correlation is 1 and there is no regression to the mean because there are no residuals (i.e., deviations of individuals between their two performances).

The focus on the specific example of repeated measurements created a lot of confusion in the DKE literature. It probably started with Krueger and Mueller’s critique. First, they emphasize statistical regression and even provide a formula that shows a deterministic relationship between a predictor variable x and the discrepancies between the predictor variable and criterion, r(x,x-y) that is bound to be negative. It follows that low performers are bound to have larger positive deviations. However, they then proceed to discuss reliability of the performance measures.

Thus far, we have assumed that actual percentiles are perfectly reliable measures of ability. As in any psychometric test, however, the present test scores involved both true variance and error
variance (Feldt & Brennan, 1989). With repeated testing, high and low test scores regress toward the group average, and the magnitude of these regression effects is proportional to the size of the error variance and the extremity of the initial score (Campbell & Kenny, 1999). In the Kruger and Dunning (1999) paradigm, unreliable actual percentiles mean that the poorest performers are not as deficient as they seem and that the highest performers are not as able as they seem.

This passage implies that regression to the mean plays a different role in the DKE. Performance on any particular test is not only a function of ability, but also (random) situational factors. This means that performance scores are biased estimates of ability. Low performers’ scores are more likely to be biased in a negative direction than high performers. If performance judgments are based on self-knowledge of ability, the comparison of judgments with performance scores is biased and may show an illusory DKE. To address this problem, Krueger and Mueller propose to estimate the reliability of test scores and to correct for the bias introduced by unreliability.

In the following 18 years, it has been neglected that Krueger and Mueller made two independent arguments against the meta-cognitive theory. One minor problem is unreliability in the performance measure as a measure of ability. The major problem is that the DKE effect is a statistical necessity that applies to difference scores.

Unreliability in the Performance Measure Does not Explain the DKE

Kruger and Dunning (2002) responded to Krueger and Mueller’s (2002) critique. Their response focused exclusively on the problem of unreliable performance measures.

They found that correcting for test unreliability reduces or eliminates the apparent asymmetry in calibration between top and bottom performers.” (p. 189).

They then conflate statistical regression and unreliability when they ask “Does regression explain the results?”

The central point of Krueger and Mueller’s (2002) critique is that a regression artifact, coupled with a general BTA effect, can explain the results of Kruger and Dunning (1999). As they noted,
all psychometric tests involve error variance, thus “with repeated testing, high and low test scores regress toward the group average, and the magnitude of these regression effects is proportional to the size of the error variance and the extremity of the initial score” (Krueger & Mueller, 2002, p. 184). They go on to point out that “in the Kruger and Dunning (1999) paradigm, unreliable actual percentiles mean that the poorest performers are not as deficient as they seem and that the highest performers are not as able as they seem” (p. 184). Although we agree that test unreliability can contribute to the apparent miscalibration of top and bottom performers, it cannot fully explain this miscalibration” (p. 189)

This argument has convinced many researchers in this area that the key problem is unreliability in the performance measure and that this issue can be addressed empirically by controlling for unreliability. Doing so, typically does not remove the DKE (Ehrlinger, Johnson, Banner, Dunning, & Kruger, 2008).

The problem is that unreliability in the performance measure is not the major concern. It is not even clear how it applies when participants are asked to estimate their performance on a specific test. A test is an unreliable measure of an unobserved construct like ability, but a student who got 60% of multiple choice question correct got 60% of the question correct. There is no unreliability in manifest scores.

The confusion between these two issues has led to the false impression that the regression explanation has been examined and empirically falsified as a sufficient explanation of DKE. For example, in a review paper Dunning (2011) wrote.

Fortunately, there are ways to estimate the degree of measurement unreliability and then correct for it. One can then assess what the relation is between perception and reality once unreliability in measuring actual performance has been eliminated. See Fig. 5.3, which displays students’ estimates of exam performance, in both percentile and raw terms, for a different college class (Ehrlinger et al., 2008, Study 1). As can be seen in the figure, correcting for measurement unreliability has only a negligible impact on the degree to which bottom performers overestimate their performance (see also Kruger & Dunning, 2002). The phenomenon remains largely intact. (p. 266).

The Dunning-Kruger Effect is a Statistical Necessity

Another article in 2002 also made the point that DKE is a statistical necessity, although the authors called it an artifact (Ackerman et al., 2002). The authors made their point with a simple simulation.

To understand whether this effect could be accounted for by regression to the mean, we simulated this analysis using two random variables (one representing objective knowledge and the other self-reported knowledge) and 500 observations (representing an N of 500). As in the Kruger and Dunning (1999) comparison, these random variables were correlated r=0.19. The observations were then divided into quartiles based on the simulated scores for the objective knowledge variable (n=125 observations per quartile). Simulated self-report and objective knowledge were then compared by quartile. As can be seen in Fig. 1, the plotting of simulated data for 500 subjects resulted in exactly the same phenomenon reported by Kruger and Dunning (1999)—an overestimation for those in the lowest quartile and an underestimation for those in the top quartile. Further analysis comparing the means of self-report and objective knowledge for each quartile revealed that the difference between the simulated self-reported (M=-0.21) and objective (M=-1.22) scores for the bottom quartile was significant t (124)= -10.09, P<0.001 (which would be ‘‘interpreted’’ as overestimation of performance). The difference between simulated self-reported (M=0.27) and objective (M=1.36) scores for the top quartile was also significant, t(124)=11.09, P<0.001, (‘‘interpreted’’ as underestimation by top performers). This illustration demonstrates the measurement problems associated with interpreting statistical significance when two variables are compared across groups selected for performance on one of the variables, and there is a low correlation between the two variables.

Unaware of Ackerman et al.’s (2002) article, Gignac & Zajenkowski (2020) used simulations to make the same point.

Here is an R-Script to perform the same simulation.

N = 500
accuracy = .6
obj = rnorm(N)
sub = objaccuracy + rnorm(N)sqrt(1-accuracy^2)
abline(h = 0)
quarts = quantile(obj,c(0,.25,.5,.75,1))
abline(v = quarts,lty=2)
x = tapply(obj,cut(obj,quarts),mean)
y = tapply(sub,cut(obj,quarts),mean)

It is reassuring that empirical studies mostly found support for a pattern that is predicted by a purely mathematical relationship. However, it is not clear that we need a term for it and naming it the Dunning-Kruger effect is misleading because Kruger and Dunning provided a psychological explanation for this statistically determined pattern.

Does the Simulation Provide Ironic Support for the Dunning-Kruger Effect?

Dunning (2011) observed that any valid criticism of the DKE would provide ironic support for the DKE. After all, the authors confidently proposed a false theory of the effect in full ignorance of their incompetence to realize that their graphs reveal a statistic relationship between any two variables rather than a profound insight into humans’ limited self-awareness.

I disagree. The difference is that students after an exam before they get the results have no feedback or other valid information that might help them to make more accurate judgments about their performance. It is a rather different situation when other researchers propose alternative explanations and these explanations are ignored. This is akin to students who come to complain about ambiguous exam questions that other students answered correctly in large numbers. Resistent to valid feedback is not the DKE.

As noted above, Kruger and Dunning (2002) responded to Krueger and Mueller’s criticism and it is possible that they misunderstood Krueger and Mueller’s critique because it did not clearly distinguish between the statistical regression explanation and the unreliability explanation for the effect. However, in 2015 Dunning does cite Ackerman et al.’s article, but claims that the regression explanation has been addressed by controlling for unreliability.

To be sure, these findings and our analysis of them are not without critics. Other researchers have asserted that the Dunning-Kruger pattern of self-error is mere statistical artifact. For example, some researchers have argued that the pattern is simply a regression-to-the-mean effect (Ackerman, Beier, & Bowen, 2002; Burson, Larrick, & Klayman, 2006; Krueger & Mueller,
2002). Simply because of measurement error, perceptions of performance will fail to correlate perfectly with actual performance. This dissociation due to measurement error will cause poor performers to overestimate their performance and top performers to underestimate theirs, the pattern found, for example, in Fig. 1. In response, we have conducted studies in which we
estimate and correct for measurement error, asking what the perception/ reality link would look like if we had perfectly reliable instruments assessing performance and perception. We find that such a procedure reduces our pattern of self-judgment errors only trivially (Ehrlinger et al., 2008; Kruger & Dunning, 2002). (p. 157)

Either Dunning cited, but did not read Ackerman et al.’s article, or he was unable to realize that statistical regression and unreliable measures are two distinct explanations for the DKE.

Does it Matter?

In 2011, Dunning alludes to the fact that there are two distinct regression effects that may explain the DKE.

There are actually two different versions of this “regression effect” account of our data. Some scholars observe that Fig. 5.2 looks like a regression effect, and then claim that this constitutes a complete explanation for the Dunning–Kruger phenomenon. What these critics miss, however, is that just dismissing the Dunning–Kruger effect as a regression effect is not so much explaining the phenomenon as it is merely relabeling it. What one has to do is to go further to elucidate why perception and reality of performance are associated so imperfectly. Why is the relation so regressive? What drives such a disconnect for top and bottom performers between what they think they have achieved and what they actually have? (p. 266)

Here Dunning seems to be aware that unreliability in the performance measure is not necessary for regression to the mean. His response to this criticism is less than satisfactory. The main point of the regression to the mean model is that low-performers are bound to overestimate their performance because they are low performers. No additional explanation is needed other than uncertainty about one’s actual performance. Most important, the regression model assumes that low-performers and high-performers are no different in their meta-cognitive abilities to guess their actual performance. The DKE emerges even if errors are simulated as random noise.

In contrast, Kruger and Dunning’s main claim is that low-performers suffer from two short-comings.

My colleagues and I have laid blame for this lack of self-insight among poor performers on a double-curse—their deficits in expertise cause them not only to make errors but also leave them unable to recognize the flaws in their reasoning. (Dunning, 2011, p. 265).

This review of the main arguments in this debate shows that the key criticism of Kruger and Dunning’s account of their findings has never been seriously addressed. As a result, hundreds of studies have been published as empirical support for an effect that follows from a statistical relationship between two imperfectly correlated variables.

This does not mean that the regression model implies that limited self-awareness is not a problem. The model still implies that low performers are bound to overestimate their performance and high performers are bound to underestimate their performance. The discrepancies between actual and estimated performance are real. The difference is just not due to differences in lack of insight into one’s abilities. Although this may be the case, it is difficult to test the influence of additional factors because regression to the mean alone will always produce the predicted pattern.

It is disconcerting that researchers have spend 20 years on studying a statistical phenomenon as if it provides insights into human’s ability to know themselves. The real question is not why low-performers overestimate their performance more than others. This has to be the case. The real question is why individuals often try to avoid feedback that provides them with more accurate knowledge of themselves. Of course, this question has been addressed in other lines of research on self-verification and positive illusions that rarely connects with the Dunning-Kruger literature. The reason may be that research on these topics is much more difficult and produces more inconsistent results than plotting aggregated difference scores for two variables.

Psychologists are not immune to the Dunning-Kruger Effect


Bar-Anan and Vianello (2018) published a structural equation model in support of a dual-attitude model that postulates explicit and implicit attitudes towards racial groups, political parties, and the self. I used their data to argue against a dual-attitude model. Vianello and Bar-Anan (2020) wrote a commentary that challenged my conclusions. I was a reviewer of their commentary and pointed out several problems with their new model (Schimmack, 2020). They did not respond to my review and their commentary was published without changes. I wrote a reply to their commentary. In the reply, I merely pointed to my criticism of their new model. Vianello and Bar-Anan wrote a review of my reply, in which they continue to claim that my model is wrong. I invited them to discuss the differences between our models, but they declined. In this blog post, I show that Vianello and Bar-Anan lack insight into the shortcomings of their model, which is consistent with the Dunning-Kruger effect that incompetent individuals lack insight into their own incompetence. On top of this, Vianello and Bar-Anan show willful ignorance by resisting arguments that undermine their motivated belief in dual-attitude models. As I show below, Vianello and Bar-Anan’s model has several unexplained results (e.g, negative loadings on method factors), worse fit than my model, and produces false evidence of incremental predictive validity for the implicit attitude factors.


The skill set of psychology researchers is fairly limited. In some areas expertise is needed to create creative experimental setups. In other areas, some expertise in the use of measurement instruments (e.g., EEG) is required. However, for the most part, once data are collected, little expertise is needed. Data are analyzed with simple statistical tools like t-tests, ANOVAs, or multiple regression. These statistical methods are implemented in simple commands and no expertise is required to obtain results from statistics programs like SPSS or R.

Structural equation modeling is different because researchers have to specify a model that is fitted to the data. With complex data sets, the number of possible models that can be specified increases exponentially and it is not possible to specify all models and to simply pick the model with the best fit. Moreover, there will be many models with similar fit and it requires expertise to pick plausible models. Unfortunately, psychologists receive little formal training in structural equation modeling because graduate training relies heavily on training by supervisors rather than formal training. As most supervisors never received training in structural equation modeling, they cannot teach their graduate student how to perform these analyses. This means that expertise in structural equation modeling varies widely.

An inevitable consequence of wide variation in expertise is that individuals with low expertise have little insight into their limited abilities. This is known as the Dunning-Kruger effect that has been replicated in numerous studies. Even incentives to provide accurate performance estimates do not eliminate the overconfidence of individuals with low levels of expertise (Ehrlinger et al., 2008).

The Dunning-Kruger effect explains Vianello and Bar-Anan’s (2020) response to my article that presents another ill-fitting model that makes little theoretical sense. This overconfidence may also explain why they are unwilling to engage in a discussion of their model with me. They may not realize that my model is superior because they were unable to compare the models or to run more direct comparisons of the models. As their commentary is published in the influential journal Perspectives on Psychological Science and as many readers lack the expertise to evaluate the merits of their criticism, it is necessary to explain clearly why their criticism of my models is invalid and why their new alternative model is flawed.

Reproducing Vianello and Bar-Anan’s Model

I learned the hard way that the best way to fit a structural equation model is to start with small models of parts of the data and then to add variables or other partial models to build a complex model. The reason is that bad fit in smaller models can be easily identified and lead to important model modifications, whereas bad fit in a complex model can have thousands of reasons that are difficult to diagnose. In this particular case, I saw new reason to even fit a complex model for attitudes to political parties, racial groups, and the self. Instead I fitted separate models for each attitude domain. Vianello and Bar-Anan (2020) take issue with this decision.

As for estimating method variance across attitude domains, that is the very logic behind an MTMM design (Campbell & Fiske, 1959; Widaman, 1985): Method variance is shared across measures of different traits that use the same method (e.g., among indirect measures
of automatic racial bias and political preferences). Trait variance is shared across measures of the same trait that use different methods (e.g., among direct and indirect measures of racial attitude). Separating the MTMM matrix into three separate submatrices (one for each
trait), as Schimmack did in his article, misses a main advantage of an MTMM design.

This criticism is based on an outdated notion of validation by means of correlations in a multi-trait-multi-method matrix. In this MTMM tables, every trait is measured with all methods. For example, the Big Five traits are measured with students’ self-ratings, mothers’ ratings, and fathers’ ratings (5 traits x 3 methods). This is not possible for validation studies of explicit and implicit measures because it is assumed that explicit measures measure explicit constructs and implicit measures measure implicit constructs. Thus, it is not possible to fully cross traits and methods. This problem is evident in all models by Bar-Anan and Vianello and myself. Bar-Anan and Vianello make the mistake to assume that using implicit measures for several attitude domains solves this problem, but their assumption that we can use correlations between implicit measures in one domain and implicit measures in another domain to solve this problem is wrong. In fact, it makes matters worse because they fail to model method variance within a single attitude domain properly.

To show this problem, I first constructed measurement models for each attitude domain and then show that combining well-fitting models of three three domains produces a better fitting model than Vianello and Bar-Anan’s model.

Racial Bias

In their revised model, Vianello and Bar-Anan postulate three method factors. One for explicit measures, one for IAT-related measures, and one for the Affective Missatribution Paradigm and the Evaluative Priming Task. It is not possible to estimate a separate method factor for all explicit measures, but it is possible to allow for method factors that are unique to the IAT-related measures and one that is unique to the AMP and EPT. In the first model, I fitted this model to the measures of racial bias. The model appears to have good fit, RMSEA = .013, CFI = 973. In this model, the correlation between the explicit and implicit racial bias factors is r = .80.

However, it would be premature to stop the analysis here because overall fit values in models with many missing values are misleading (Zhang & Savaley, 2019). Even if fit were good, it is good practice to examine the modification indices to see whether some parameters are misspecified.

Inspection of the fit indices shows one very large Modification Index of 146.04 for the residual correlation between the feeling thermometer and the preference ratings. There is a very plausible explanation for this finding. These two measures are very similar and can share method variance. For example, social desirable responding could have the same effect on both ratings. This was the reason why I included only one of the two measures in my model. An alternative is to include both ratings and allow for the correlated residual to model shared method variance.

As predicted by the MI, model fit improved, RMSEA = .006, CFI = .995. Vianello and Bar-Anan (2020) might object that this finding is post-hoc after peeking at the data, while their model is specified theoretically. However, this argument is weak. If they really theoretically predicted that feeling thermometer and direct ratings share no method variance, it is not clear what theory they have in mind. After all, shared rating biases are very common. Moreover, their model also assumes shared method variance between these factors, but it also predicts that this method variance also influences dissimilar measures like the Modern Racism Scale and even ratings of other attitude objects. In short, neither their model nor my models are based on theories, in part because psychologists have ignored to develop and validate measurement theories. Even if it were theoretically predicted that feeling-thermometer and preference ratings do not share method variance, the large MI for this parameter would indicate that this theory is wrong. Thus, the data falsify this prediction. In the modified model, the implicit-explicit correlation increases from .80 to .90, providing even less support for the dual-attitude model.

Further inspection of the MI showed no plausible further improvements of the model. One important finding in this partial model is that there is no evidence of shared method variance between the AMP and EPT, r = -.04. Thus, closer inspection of the correlations among the racial attitude domain suggests two problems for Vianello and Bar-Anan’s model. There is evidence of shared method variance between two explicit measures and there is no evidence of shared method variance between two implicit measures, namely the AMP and EPT.

Next, I built a model for the political orientation domain starting with the specification in Vianello and Bar-Anan’s model. Once more, overall fit appears to be good, RMSEA = .014, CFI = .989. In this model, the correlation between the implicit and explicit factor is r = .9. However, inspection of the MI replicates a residual correlation between feeling thermometer and preference ratings. MI = 91.91. Allowing for this shared method variance improved model fit, RMSEA = .012, CFI = .993, but had little effect on the implicit-explicit correlation, r = .91. In this model, there was some evidence of shared method variance between the AMP and EPT, r = .13.

Next, I put these two well-fitting models together, leaving each model unchanged. The only new question is how measures of racial bias should be related to measures of political orientation. It is common to allow trait factors to correlate freely. This is also what Vianello and Bar-Anan did and I followed this common practices. Thus, there is no theoretical structure imposed on the trait correlations. I did not specify any additional relations for the method factors. If such relationships exist, this should lead to low fit. Model fit seemed to be good, RMSEA = .009, CFI = .982. The biggest MI was observed for the loading of the Modern Racism Scale (MRS) on the explicit political orientation factor, MI = 197.69. This is consistent with the item content of the MRS that combines racism with conservative politics (e.g., being against affirmative action). For that reason, I included the MRS in my measurement model of political orientation (Schimmack, 2020).

Vianello and Bar-Anan (2020) criticize my use of the MRS. “For instance, Schimmack chose to omit one of the indirect measures—the SPF—from the models, to include the Modern Racism Scale (McConahay, 1983) as an indicator of political evaluation, and to omit the thermometer scales from two of his models. We assume that Schimmack had good practical or theoretical reasons for his modelling decisions; unfortunately, however, he did not include those reasons.” If they had inspected the MI, they would have seen that my decision to use the MRS as a different method to measure political orientation was justified by the data as well as by the item-content of the scale.

After allowing for this theoretically expected relationship, model fit improves, chi2(df = 231) = 506.93, RMSEA = .007, CFI = .990. Next I examined whether the IAT method factor for racial bias is related to the IAT method factor for political orientation. Adding this relationship did not improve fit, chi2(230) = 506.65 = RMSEA = .007, CFI = .990. More important, the correlation was not significant, r = -.06. This is a problem for Vianello and Bar-Anan’s model that assumes the two method factors are identical. To test this hypothesis, I fitted a model with a single IAT method factor. This model had worse fit, chi2(231) = 526.99, RMSEA = .007, CFI = .989. Thus, there is no evidence for a general IAT method factor.

I next explored the possibility of a method factor for the explicit measures. I had identified shared method variance for the feeling thermometer and preference ratings for racial bias and for political orientation. I now modeled this shared method variance with method factors and let the two method factors correlate with each other. The addition of a correlation did not improve model fit, chi2(230) = 506.93, RMSEA = .007, CFI = .990 and the correlation between the two explicit method factors was not significant, r = .00. Imposing a single method factor for both attitude domains reduced model fit, chi2(df = 229) = 568.27, RMSEA = .008, CFI = .987.

I also tried to fit a single method factor for the AMP and EPT. The model only converged by constraining two loadings. Then model fit improved slightly, chi2(df = 230) = 501.75, RMSEA = .007, CFI = .990. The problem for Vianello and Bar-Anan is that the better fit was achieved with a negative loading on the method factor. This is inconsistent with the idea that a general method factor inflates correlations across attitude domains.

In sum, there is no evidence that method factors are consistent across the two attitude domains. Therefore I retained the basic model that specified method variance within attitude domains. I then added the three criterion variables to the model. As in Vianello and Bar-Anan’s model, contact was regressed on the explicit and implicit racial bias factor and previous voting and intention to vote were regressed on the explicit and implicit political orientation factors. The residuals were allowed to correlate freely, as in Vianello and Bar-Anan’s model.

Overall model fit decreased slightly for CFI, chi2(df = 297) = 668.61, RMSEA = .007, CFI = .988. MI suggested an additional relationship between the explicit political orientation factor and racial contact. Modifying the model accordingly improved fit slightly, chi2(df = 296) = 660.59, RMSEA = .007, CFI = .988. There were no additional MI involving the two voting measures.

Results were different from Vianello and Bar-Anan’s results. They reported that the implicit factors had incremental predictive validity for all three criterion measures.

In contrast, the model I am developing here shows no incremental predictive validity for the implicit factors.

It is important to note that I create the measurement model before I examined predictive validity. After the measurement model was created, criterion variables were added and the data determined the pattern of results. It is unclear how Vianello and Bar-Anan developed a measurement model with non-existing method factors that produced the desired outcome of significant incremental validity.

To try to reproduce their full result, I also added self-esteem measures to the model. To do so, I first created a measurement model for the self-esteem measures. The basic measurement model had poor fit, chi2(df = 58) = 434.49, RMSEA = .019, CFI = .885. Once more, the MI suggested that feeling-thermometer and preference ratings shared method variance. Allowing for this residual correlation increased model fit, chi2(df = 57) = 165.77, RMSEA = .010, CFI = .967. Another MI suggested a loading of the speeded task on the implicit factor, MI = 54.59. Allowing for this loading further improved model fit, chi2(df = 56) = 110.01, RMSEA = .007, CFI = .983. The crucial correlation between the explicit and implicit factor was r = .36. The correlation in Vianello and Bar-Anan’s model was r = .30.

I then added the self-esteem model to the model with the other two attitude domains, chi2(df = 695) = 1309.59, RMSEA = .006, CFI = .982. Next I added correlations of the IAT method factor for self-esteem with the two other IAT-method factors. This improved model fit, chi2(df = 693) = 1274.59, RMSEA = .006, CFI = .983. The reason was a significant correlation between the IAT method factors for self-esteem and racial bias. I offered an explanation for this finding in my article. Most White respondents associate self with good and White with good. If some respondents are better able to control their automatic tendencies, they will show less pro-self and pro-White biases. In contrast, Vianello and Bar-Anan have no theoretical explanation for a shared method factor across attitude domains. There was no significant correlation between IAT method factors for self-esteem and political orientation. The reason is that political orientation has more balanced automatic tendencies so that method variance does not favor one direction over the other.

This model had better fit with fewer parameters than Vianello and Bar-Anan’s model, chi2(df = 679) = 1719.39, RMSEA = .008, CFI = .970. The critical results of predictive validity remained unchanged.

I also fitted Vianello and Bar-Anan’s model and added four parameters that I identified as missing from their model: (a) the loading of the MRS on the explicit political orientation factor and (b) the correlations between feeling-thermometer and preference ratings for each domain. Making these adjustments improved model fit considerably, chi2(df = 675) = 1235.59, RMSEA = .006, CFI = .984. This modest adjustment altered the pattern of results for the prediction of the three criterion variables. Unlike Vianello and Bar-Anan’s model, the implicit factors no longer predicted any of the three criterion variables.


My interaction with Vianello and Bar-Anan are symptomatic of social psychologists misapplication of the scientific method. Rather than using data to test theories, data are being abused to confirm pre-existing beliefs. This confirmation bias goes against philosophies of science that have demonstrated the need to subject theories to strong tests and to allow data to falsify theories. Verificationism is so ingrained in social psychology that Vianello and Bar-Anan ended up with a model that showed significant incremental predictive validity for all three criterion measures in their model, when this model made several questionable assumptions. They may object that I am biased in the opposite direction, but I presented clear justifications for modeling decisions and my model fits better than their model. In my 2020 article, I showed that Bar-Anan also co-authored another article that exaggerated evidence of predictive validity that disappeared when I reanalyzed the data (Greenwald, Smith, Sriram, Bar-Anan, & Nosek, 2009). Ten years later, social psychologists claim that they have improved their research methods, but Vianello and Bar-Anan’s commentary in 2020 shows that social psychologists have a long way to go. If social psychologists want to (re)gain trust, they need to be willing to discard cherished theories that are not supported by data.


Bar-Anan, Y., & Vianello, M. (2018). A multi-method multi-trait test of the dual-attitude perspective. Journal of Experimental Psychology: General, 147(8), 1264–1272. https://doi.org/10.1037/xge0000383

Ehrlinger, J., Johnson, K., Banner, M., Dunning, D., & Kruger, J. (2008). Why the unskilled are unaware: Further explorations of (absent) self-insight among the incompetent. Organizational Behavior and Human Decision Processes, 105(1), 98–121. https://doi.org/10.1016/j.obhdp.2007.05.002

Greenwald, A. G., Smith, C. T., Sriram, N., Bar-Anan, Y., & Nosek, B. A. (2009). Implicit race attitudes predicted vote in the 2008 U.S. Presidential election. Analyses of Social Issues and Public Policy (ASAP), 9(1), 241–253. https://doi.org/10.1111/j.1530-2415.2009.01195.x

Schimmack U. The Implicit Association Test: A Method in Search of a Construct. Perspectives on Psychological Science. October 2019. doi:10.1177/1745691619863798

Vianello M, Bar-Anan Y. Can the Implicit Association Test Measure Automatic Judgment? The Validation Continues. Perspectives on Psychological Science. February 2020. doi:10.1177/1745691619897960

Zhang, X. & Savalei, V. (2020) Examining the effect of missing data on RMSEA and CFI under normal theory full-information maximum likelihood, Structural Equation Modeling: A Multidisciplinary Journal, 27:2, 219-239, DOI: 10.1080/10705511.2019.1642111

Cross-Cultural Comparisons of Personality: Beware of Method Factors

Ulrich Schimmack
Shigehiro Oishi


Personality ratings on a 25-item Big Five measures by two national samples (US, Japan) were analyzed with an item-level measurement model that separates method factors (acquiescence, halo bias) and trait factors. Results reveal a strong influence of halo bias on US responses that distort cultural comparisons in personality. After correcting for halo bias, Japanese were more conscientious, extraverted, open to experience and less neurotic and agreeable. The results support cultural differences in positive illusions and raises questions about the validity of studies that rely on scale means to examine cultural differences in personality.


Cultural stereotypes imply cross-cultural differences in personality traits. However, cross-cultural studies of personality do not support the validity of these cultural stereotypes (Terracciano et al., 2005). Whenever two measures produce divergent results, it is necessary to examine the sources of these discrepancies. One obvious reason could be that cultural stereotypes are simply wrong. It is also possible that scientific studies of personalty across culture produce misleading results (Perugini & Richetin, 2007). One problem for empirical studies of cross-cultural differences in personality is that cultural differences tend to be small. Culture explains at most 10% of the variance and often the percentages are much smaller. For example, McCrae et al. (2010) found that culture explained only 1.5% of the variance in agreeableness ratings. As some of this variance is method variance, the variance due to actual differences in agreeableness is likely to be less than 1%. With small amounts of valid variance, method factors can have a strong influence on the pattern of mean differences across cultures.

One methodological problem in cross-cultural studies of personality is that personalty measures are developed with a focus on the correlation of items with each other within a population. The item means are not relevant with the exception that items should avoid floor or ceiling effects. However, cross-cultural comparisons rely on differences in the item means. As item means have not been subjected to psychometric evaluations, it is possible that item means lack construct validity. Take “working hard” as an example. How hard people work could be influenced by culture. For example, in poor cultures people have to work harder to make a living. The item “working hard” may correctly reflect variation in conscientiousness within poor cultures and within rich cultures, but the differences between cultures would reflect environmental conditions rather than conscientiousness. As a result, it is necessary to demonstrate that cultural differences in item means are valid measures of cultural differences in personality.

Unfortunately, obtaining data from a large sample of nations is difficult and sample sizes are often rather small. For example, McCrae et al. (2010) examined convergent validity of Big Five scores with 18 nations. The only significant evidence of convergent validity was obtained for neuroticism, r = .44, and extraversion, r = .45. Openness and agreeableness even produced small negative correlations, r = -.27, r = -.05, respectively. The largest cross-cultural studies of personality had 36 overlapping nations (Allik et al., 2017; Schmitt et al., 2007). The highest convergent validity was r = .4 for extraversion and conscientiousness. Low convergent validity, r = .2, was observed for neuroticism and agreeableness, and the convergent validity for openness was 0 (Schimmack, 2020). These results show the difficulty of measuring personality across cultures and the lack of validated measures of cultures’ personality profiles.

Method Factors in Personality Measurement

It is well-known that self-ratings of personality are influenced by method factors. One factor is a stylistic factor in the use of response formats known as acquiescence bias (Cronbach, 1942, 1965). The other factor reflects individual differences in responding to the evaluative meaning of items known as halo bias (Thorndike, 1920). Both method factors can distort cross-cultural comparisons. For example, national stereotypes suggest that Japanese individuals are more conscientious than US American individuals, but mean scores of conscientiousness in cross-cultural studies do not confirm this stereotype (Oishi & Roth, 2009). Both method factors may artificially lower Japan’s mean score because Japanese respondents are less likely to use extreme scores (Min, Cortina, & Miller, 2016) and Asians are less likely to inflate their scores on desirable traits (Kim, Schimmack, & Oishi, 2012). In this article, we used structural equation modeling to separate method variance from trait variance to distinguish cultural differences in response tendencies from cultural differences in personality traits.

Convenience Samples versus National Samples

Another problem for empirical studies of national differences is that psychologists often rely on convenience samples. The problem with convenience samples is that personality can change with age and that there are regional differences in personality within nations (). For example, a sample of students at New York University may differ dramatically from a student sample at Mississippi State University or Iowa State University. Although regional differences tend to be small, national differences are also small. Thus, small regional differences can bias national comparisons. To avoid these biases it is preferable to compare national samples that cover all regions of a nation and a broad age range.

Modeling Approach

The purpose of our study is to advance research on cultural differences in personality by comparing a Japanese and a US national sample that completed the same Big Five personality questionnaire using a measurement model that distinguishes personality factors and method factors. The measurement model is an improved version of Anusic et al.’s (2009) halo-alpha-beta model (Schimmack, 2019). The model is essentially a tri-factor model.

Figure 1

That is, each item loads on three factor, namely (a) a primary loading on one of the Big Five factors, (b) a loading on an acquiescence bias factor, and (c) a loading on the evaluative bias/halo factor. As Big Five measures typically do not show a simple structure, the model also can include secondary loadings on other Big Five factors. This measurement model has been successfully fitted to several Big Five questionnaires (Schimmack, 2019). This is the first time, the model is applied to a multiple-group model to compare measurement models for US and Japanese samples.

We first fitted a very restrictive model that assumed invariance across the two factors. Given the lack of psychometric cross-cultural comparisons, we expected that this model would not have acceptable fit. We then modified the model to allow for cultural differences in some primary factor loadings, secondary factor loadings, and item intercepts. This step makes our work exploratory. However, we believe that this exploratory work is needed as a first step towards psychometrically sound measurement of cultural differences.


Participants (N = 952 Japanese, 891 US) were recruited by Nikkei Research Inc. and its U.S. affiliate using a national probabilistic sampling method based on gender and age. The mean age was 44. The data have been used before to compare the influence of personality on life-satisfaction judgments, but without comparing mean levels in personality and life-satisfaction (Kim, Schimmack, Oishi, & Tsutsui, 2018).


The Big Five items were taken from the International Personality Item Pool (Goldberg et al., 2006). There were five items for each of the Big Five dimensions (Table 1).


We first fitted a model without mean structure to the data. A model with strict invariance for the two samples did not have acceptable fit using RMSEA < .06 and CFI > .95 as criterion values, RMSEA = .064, CFI = .834. However, CFI values should not be expected to reach .95 in models with single-item indicators (Anusic et al., 2009). Therefore, the focus is on RMSEA. We first examined modification indices (MI) of primary loadings. We used MI > 30 as a criterion to free parameters to avoid overfitting the model. We found seven primary loadings that would improve model fit considerably (n4, e3, a1, a2, a3, a4, c4). Freeing these parameter improved the model (RMSEA = .060, CFI = .857). We next examined loadings on the halo factor because it is likely that some items differ in their connotative meaning across languages. However, we found only two notable MIs (o1, c4). Freeing these parameters improved model fit (RMSEA = .057, CFI = .871). We identified six secondary loadings that differed notably across cultures. One was a secondary loading on neuroticism (e4) and four were secondary loadings on agreeableness (n5, e1, e3, o4), and one was a secondary loading on conscientiousness (n3). Freeing these parameters improved model fit (RMSEA = .052, CFI = .894). We were satisfied with this measurement model and continued with the means model. The first model fixed the item intercepts and factor means to be identical. This model had worse fit than the model without a means structure (RMSEA = .070, CFI = .803). The biggest MI was observed for the mean of the halo factor. Allowing for mean differences in halo improved model fit considerably (RMSEA = .060, CFI = .849). MIs next suggested to allow for mean differences in extraversion and agreeableness. We next allowed for mean differences in the other factors. This further improved model fit (RMSEA = .058, CFI = .864), but not as much. MIs suggested seven items with different item intercepts (n1, n5, e3, o3, a5, c3 c5). Relaxing these parameters improved model fit close to the level for the model without a mean structure (RMSEA = .053, CFI = .888).

Table 1 shows the primary loadings and the loadings on the halo factor for the 25 items.

Table 1

The results show very similar primary loadings for most items. This means that factors have similar meaning in the two samples and that it is possible to compare the two cultures. Nevertheless, there are some differences that could bias comparisons based on item-sum-scores. The item “feeling comfortable around people” loads much more strongly on the extraversion factor in the US than in Japan. The agreeableness items “insult people” and “sympathize with others’ feelings” also load more strongly in the US than in Japan. Finally, “making a mess of things” is a conscientiousness item in the US, but not in Japan. The fact that item loadings are more consistent with the theoretical structure can be attributed to the development of the items in the US.

A novel and important finding is that most loadings on the halo factor are also very similar across nations. For example, the item “have excellent ideas” shows a high loading for the US and Japan. This finding contradicts the idea that evaluative biases are culture-specific (Church et al., 2014). The only notable difference is the item “make a mess of things” that has no notable loading on the halo factor in Japan. Even in English, the meaning of this item is ambiguous and future studies should replace this item with a better item. The correlation between the halo loadings for the two samples is high, r = .96.

Table 2 shows the item means and the item intercepts of the model.

Table 2

The item means of the US sample are strongly correlated with the loadings on the halo factor, r = .81. This is a robust finding in Western samples. More desirable items are endorsed more. The reason could be that individuals actually act in desirable ways most of the time and that halo bias influences item means. Surprisingly, there is no notable correlation between item means and loadings on the halo factor for the Japanese sample, r = .08. This pattern of results suggests that US means are much more strongly influenced by halo bias than Japanese means. Further evidence is provided by inspecting the mean differences. For desirable items (low N, high E, O, A, & C) US means are always higher than Japanese’ means. For undesirable items, the US means are always lower than Japanese’ means, except for the item “stay in the background” where the means are identical. The difference scores are also positively correlated with the halo loadings, r = .90. In conclusion, there is strong evidence that halo bias distorts the comparison of personality in these two samples.

The item intercepts show cultural differences in items after taking cultural differences in halo and the other factors into account. Notable differences were observed from some items. Even after controlling for halo and extraversion, US respondents report higher levels of being comfortable around people than Japanese. This difference fits cultural stereotypes. After correcting for halo bias, Japanese now score higher on getting chores done right away than Americans. This also fits cultural stereotypes. However, Americans still report paying more attention to detail than Japanese, which is inconsistent with cultural stereotypes. Extensive validation research is needed to examine whether these results reflect actual cultural differences in personality and behaviours.

Figure 2 shows the mean differences on the Big Five factors and the two bias factors.

Figure 2

Figure 2 shows a very large difference in halo bias. The difference is so large that it seems implausible. Maybe the model is overcorrecting, which would bias the mean differences for the actual traits in the opposite direction. There is little evidence of cultural differences in acquiescence bias. One open question is whether the strong halo effect is entirely due to evaluative biases. It is also possible that a modesty bias plays a role because modesty implies less extreme responses to desirable items and less extreme responses to undesirable items. To separate the two, it would be necessary to include frequent and infrequent behaviours that are not evaluative.

The most interesting result for the Big Five factors is that the Japanese sample scores higher in conscientiousness than the US sample after halo bias is removed. This reverses the mean differences in this sample and previous studies that show higher conscientiousness for US than Japanese samples (). The present results suggest that halo bias masks the actual difference in conscientiousness. However, other results are more surprising. In particular, the present results suggest that Japanese people are more extraverted than Americans. This contradicts cultural stereotypes and previous studies. The problem is that cultural stereotypes could be wrong and that previous studies did not control for halo bias. More research with actual behaviours and less evaluative items is needed to draw strong conclusions about personality differences between cultures.


It has been known for 100 years that self-ratings of personality are biased by connotative meaning. At least in North America it is common to see a strong correlation between the desirability of items and the means of self-ratings. There is also consistent evidence that Americans rate themselves in a more desirable manner than the average American (). However, this does not mean that Americans are seeing themselves as better than everybody else. In fact, self-ratings tend to be slightly less favorable than ratings of friends or family members (), indicating a general evaluative biases to rate oneself and close others favorably.

Given the pervasiveness of evaluative biases in personality ratings it is surprising that halo bias has received so little attention in cross-cultural studies of personality. One reason could be the lack of a good method to measure and remove halo variance from personality ratings. Despite early attempts to detect socially desirable responding, lie scales have shown little validity as bias measures (ref). The problem is that manifest scores on lie scales contain as much valid personality variance as bias variance. Thus, correcting for scores on these scales literally throws out the baby (valid variance) with the bathwater (bias variance). Structural equation modeling (SEM) solves this problem by spitting observed variances into unobserved or latent variances. However, personality psychologists have been reluctant to take advantage of SEM because item models require large samples and theoretical models were too simplistic and produced bad fit. Informed by multi-rater studies that emerged in the 1990s, we developed a measurement model of the Big Five that separates personality variance from evaluative bias variance (Anusic, et al., 2009; Schimmack, Kim, & 2012; Schimmack, 2019). Here we applied this model for the first time to cross-cultural data to examine whether cultures differ in halo bias. The result suggest that halo bias has a strong influence on personality ratings in the US, but not in Japan. The differences in halo bias distort comparisons on the actual personality traits. While raw scores suggest that Japanese people are less conscientious than Americans, the corrected factor means suggest the opposite. Japanese participants also appeared to be less neurotic, more extraverted and open to experiences, which was a surprising result. Correcting for halo bias did not change the cultural differences in agreeableness. Americans were more agreeable than Japanese with and without correction for halo bias. Our results do not provide a conclusive answer about cultural differences in personality, but they shed a new light on several questions in personality research.

Cultural Differences in Self-enhancement

One unresolved question in personality psychology is whether positive biases in self-perceptions also known as self-enhancement are unique to American or Western cultures or whether they are a universal phenomenon (Church et al., 2016). One problem are different approaches to the measurement of self-enhancement. The most widely used method are social comparisons where individuals compare themselves to an average person. These studies tend to show a persistent better-than-average effect in all cultures (ref). However, this finding does not imply that halo biases are equally strong in all cultures. Brown and Kobayashi (2002) found better-than-average effects in the US and Japan, but Japanese ratings of the self and others were less favorable than those in the US. Kim et al. (2012) explain this pattern with a general norm to be positive in North America that influences ratings of the self as well as ratings of others. Our results are consistent with this view and suggests that self-enhancement is not a universal tendency. More research with other cultures is needed to examine which cultural factors moderate halo biases.

Rating Biases or Self-Perception Biases

An open question is whether halo biases are mere rating biases or reflect distorted self-perceptions. One model suggests that participants are well aware of their true personality, but merely present themselves in a more positive light to others. Another model suggests that individuals truly believe that their personality is more desirable than it actually is. It is not easy to distinguish between these two models empirically. zzz

Halo Bias and the Reference Group Effect

In an influential article, Heine et al. (2002) criticized cross-cultural comparisons in personality ratings as invalid. The main argument was that respondents adjust the response categories to cultural norms. This adjustment was called the reference group effect. For example, the item “insult people” is not answered based on the frequency of insults or a comparison of the frequency of insults to other behaviours. Rather it is answered in comparison to the typical frequency of insults in a particular culture. The main prediction made by the reference group effect is that responses in all cultures should cluster around the mid-point of a Likert-scale that represents the typical frequency of insults. As a result, cultures could differ dramatically in the actual frequency of insults, while means on the subjective rating scales are identical.

The present results are inconsistent with a simple reference group effect. Specifically, the US sample showed notable variation in item means that was related to item desirability. As a result, undesirable items like “insult people” had a much lower mean, M = 1.83, than the mid-point of the scale (3), and desirable items “have excellent ideas” had a higher mean (M = 3.73) than the midpoint of the scale. This finding suggests that halo bias rather than a reference group effect threatens the validity of cross-cultural comparisons.

Reference group effects may play a bigger role in Japan. Here item means were not related to item desirabilty and clustered more closely around the mid-point of the scale. The highest mean was 3.56 for worry and the lowest mean was 2.45 for feeling comfortable around people. However, other evidence contradicts this hypothesis. After removing effects of halo and the other personality factors, item intercepts were still highly correlated across the two national samples, r = .91. This finding is inconsistent with culture-specific reference groups that would not produce consistent item intercepts.

Our results also provide a new explanation for the low conscientiousness of Japanese samples. A reference group effect would not predict a significantly lower level of conscientiousness. However, a stronger halo effect in the US explains this finding because conscientiousness is typically assessed with desirable items. Our results are also consistent with the finding that self-esteem and self-enhancement are more pronounced in the US than in Japan (Heine & Buchtel, 2009). These aforementioned biases inflate conscientiousness scores in the US. After removing this bias, Japanese rate themselves as more conscientious than US Americans.

Limitations and Future Directions

We echo previous calls for validation of personality scores of nations (Heine & Buchtel, 2009). The current results are inconsistent across questionnaires and even the low level of convergent validity may be inflated by cultural differences in response styles. Future studies should try to measure personality with items that minimize social desirability and use response formats that avoid the use of reference groups (e.g., frequency estimates). Moreover, results based on ratings should be validated with objective indicators of behaviours.

Future research also needs to take advantage of developments in psychological measurement and use models that can identify and control for response artifacts. The present model shows the ability of separating evaluative biases or halo variance from actual personality variance. Future studies should use this model to compare a larger number of nations.

The main limitation of our study is the relatively small number of items. The larger the number of items, the easier it is to distinguish item-specific variance, method variance, and trait variance. The measure also did not properly take into account that the Big Five are higher-order factors of more basic traits called facets. Measures like the BFI-2 or the NEO-PI3 should be used to study cultural differences at the facet level, which often shows unique influences of culture that are different from effects on the Big Five (Schimmack, 2020).

We conclude with a statement of scientific humility. The present results should not be taken as clear evidence about cultural differences in personality. Our article is merely a little step towards the goal of measuring personality differences across cultures. One obstacle in revealing such differences is that national differences appear to be relatively small compared to the variation in personality within nations. One possible explanation for this is that variation in personality is caused more by biological than cultural factors. For example, twin studies suggest that 40% of the variance in personality traits is caused by genetic variation within a population, whereas cross-cultural studies suggest that at most 10% of the variance is caused by cultural influences on population means. Thus, while uncovering cultural variation in personality is of great scientific interest, evidence of cultural differences between nations should not be used to stereotype individuals from different nations. Finally, it is important to distinguish between personality traits that are captured by Big Five traits and other personality attributes like attitudes, values, or goals that may be more strongly influenced by culture. The key novel contribution of this article is to demonstrate that cultural differences in response styles exists and distort national comparisons of personality with simple scale means. Future studies need to take response styles into account.


Cronbach, L. J. (1942). Studies of acquiescence as a factor in the true-false test. Journal of Educational Psychology, 33(6), 401–415. https://doi.org/10.1037/h0054677

Heine, S. J., & Buchtel, E. E. (2009). Personality: The universal and the culturally specific. Annual Review of Psychology, 60, 369–394. https://doi.org/10.1146/annurev.psych.60.110707.163655

Perugini, M., & Richetin, J. (2007). In the land of the blind, the one-eyed man is king. European Journal of Personality, 21(8), 977–981. https://doi.org/10.1002/per.649

Schimmack, U. (2020). Personality science: The science of human diversity. TopHat, 978-1-77412-253-2.    https://tophat.com/marketplace/social-science/psychology/full-course/personality-science-the-science-of-human-diversity-ulrich-schimmack/4303/

Terracciano, A. et al. (2005). National character does not reflect mean personality
trait levels in 49 cultures. Science, 310, 96–100.

JPSP:PPID = Journal of Pseudo-Scientific Psychology: Pushing Paradigms – Ignoring Data


Ulrich Orth, Angus Clark, Brent Donnellan, Richard W. Robins (DOI: 10.1037/pspp0000358) present 10 studies that show the cross-lagged panel model (CLPM) does not fit the data. This does not stop them from interpreting a statistical artifact of the CLPM as evidence for their vulnerability model of depression. Here I explain in great detail why the CLPM does not fit the data and why it creates an artifactual cross-lagged path from self-esteem to depression. It is sad that the authors, reviewers, and editors were blind to the simple truth that a bad-fitting model should be rejected and that it is unscientific to interpret parameters of models with bad fit. Ignorance of basic scientific principles in a high-profile article reveals poor training and understanding of the scientific method among psychologists. If psychology wants to gain respect and credibility, it needs to take scientific principles more seriously.


Psychology is in a crisis. Researchers are trained within narrow paradigms, methods, and theories that populate small islands of researchers. The aim is to grow the island and to become a leading and popular island. This competition between islands is rewarded by an incentive structure that imposes the reward structure of capitalism on science. The winner gets to dominate the top journals that are mistaken as outlets of quality. However, just like Coke is not superior to Pepsi (sorry Coke fans), the winner is not better than the losers. They are just market leaders for some time. No progress is being made because the dominant theories and practices are never challenged and replaced with superior ones. Even the past decade that has focused on replication failures has changed little in the way research is conducted and rewarded. Quantity of production is rewarded, even if the products fail to meet basic quality standards as long as naive consumers of researchers are happy.

This post is about the lack of training in the analysis of longitudinal data with a panel structure. A panel study essentially repeats the measurement of one or several attributes several times. Nine years of undergradute and graduate training leave most psychologists without any training how to analyze these data. This explains why the cross-lagged panel model (CLPM) was criticized four decades ago (Rogosa, 1980), but researchers continue to use it with the naive assumption that it is a plausible model to analyze panel data. Critical articles are simply ignored. This is the preferred way of dealing with criticism by psychologists. Here, I provide a detailed critique of CLPM using Orth et al.’s data (https://osf.io/5rjsm/) and simulations.

Step 1: Examine your data

Psychologists are not trained to examine correlation matrices for patterns. They are trained to submit their data to pre-specified (cookie-cutter) models and hope that the data fit the model. Even if the model does not fit, results are interpreted because researchers are not trained in modifying cookie cutter models to explore reasons for bad fit. To understand why a model does not fit the data, it is useful to inspect the actual pattern of correlations.

To illustrate the benefits of visual inspection of the actual data, I am using the data from the Berkeley Longitudinal Study (BLS), which is the first dataset listed in Orth et al.’s (2020) table that lists 10 datasets.

To ease interpretation, I break up the correlation table into three components, namely (a) correlations among self-esteem measures (se1-se4 with se1-se4), correlations among depression measures (de1-de4 with de1-de4), and correlations of self-esteem measures with depression measures (se1-se4 with de1-de4);

Table 1

Table 1 shows the correlation matrix for the four repeated measurements of self-esteem. The most important information in this table is how much the magnitude of the correlations decreases along the diagonals that represent different time lags. For example, the lag-1 correlations are .76, .79, and .74, which approximately average to a value of .76. The lag-2 correlations are .65 and .69, which averages to .67. The lag-3 correlation is .60.

The first observation is that correlations are getting weaker as the time-lag gets longer. This is what we would expect from a model that assumes self-esteem actually changes over time, rather than just fluctuating around a fixed set-point. The latter model implies that retest correlations remain the same over different time lags. So, we do have evidence that self-esteem changes over time, as predicted by the cross-lagged panel model.

The next question is how much retest correlations decrease with increasing time lags. The difference from lag-1 to lag-2 is .74 – .67 = .07. The difference from lag-2 to lag-3 is .67 – .60, which is also .07. This shows no leveling off of the decrease in these data. It is possible that the next wave would produce a lag-4 correlation of .53, which would be .07 lower than then lag-3 correlation. However, a difference of .07 is not very different from 0, which would imply that change asymptotes at .60. The data are simply insufficient to provide strong information about this.

The third observation is that the lag-2 correlation is much stronger than the square of the lag-1 correlations, .67 > .74^2 = .55. Similarly, the lag-3 correlation is stronger than the product of the lag-1 and lag-2 correlations, .60 > .74 * .67 = .50 This means that a simple autoregressive model with observed variables does not fit the data. However, this is exactly the model of Orth et al.’s CLPM.

It is easy to examine the fit of this part of the CLPM model, by fitting an autoregressive model to the self-esteem panel data.

se2-se4 PON se1-3 ! This command regresses each measure on the previous measure (n on n-1).
! There is one thing I learned from Orth et al., and it was the PON command of MPLUS

Table 2

Table 2 shows the fit of the autoregressive model. While CFI meets the conventional threshold of .95 (higher is better), RMSEA shows terrible fit of the model (.06 or lower are considered acceptable). This is a problem for cookie-cutter researchers who think CLPM is a generic model that fits all data. Here we see that the model makes unrealistic assumptions and we already know what the problem is based on our inspection of the correlation table. The model predicts more change than the data actually show. We are therefore in a good position to reject the CLPM as a viable model for these data. This is actually a positive outcome. The biggest problem in correlational research are data that fit all kinds of models. Here we have data that actually disconfirm some models. Progress can be made, but only if we are willing to abandon the CLPM.

Now let’s take a look at the depression data, following the same steps as for the self-esteem data.

Table 3

The average lag-1 correlation is .43. The average lag-2 correlaiton is .45, and the lag-3 correlation is .4. These results are problematic for an autoregressive model because the lag-2 correlation is not even lower than the lag-1 correlation.

Once more it is hard to tell, whether retest-correlations are approaching an asymptote. In this case, the lag-2 minus lag-1 difference is -.02 and the lag-3 minus lag-2 difference is .05.

Finally, it is clear that the autoregressive model with manifest variables overestimates change. The lag-2 correlation is stronger than the square of the lag-1 correlations, .45 > .43^2 = .18, and the lag-3 correlation is stronger than the lag-1 * lag-2 correlation, .40 > .43*.45 = .19.

Given these results, it is not surprising that the autoregressive model fits the data even less than for the self-esteem measures (Table 4).

de2-de4 PON de1-de3 ! regress each depression measure on the previous one.

Talble 4

Even the CFI value is now in the toilet and the RMSEA value is totally unacceptable. Thus, the basic model of stability and change implemented in CLPM is inconsistent with the data. Nobody should proceed to build a more complex, bivariate model if the univariate models are inconsistent with the data. The only reason why psychologists do so all the time is that they do not think about CLPM as a model. They think CLPM is like a t-test that can be fitted to any panel data without thinking. No wonder psychology is not making any progress.

Step 2: Find a Model That Fits the Data

The second step may seem uncontroversial. If one model does not fit the data, there is probably another model that does fit the data and this model has a higher chance of being the model that reflects the causal processes that produced the data. However, psychologists have an uncanny ability to mess up even the simplest steps in data analysis. They have convinced themselves that it is wrong to fit models to data. The model has to come first so that the results can be presented as confirming a theory. However, what is the theoretical rational of the CLPM? It is not motivated by any theory of development, stability, or change. It is as atheoretical as any other model. It only has the advantage that it became popular on an island of psychology and now people use it without being questioned about it. Convention and conformity are not pillars of science.

There are many alternative models to CLPM that can be tried. One model is 60 years old and was introduced by Heise (1969). It is also an autoregressive model, but it also allows for occassion specific variance. That is, some factors may temporarily change individuals’ self-esteem or depression without any lasting effects on future measurements. This is a particularly appealing idea for a symptom checklist of depression that asks about depressive symptoms in the past four weeks. Maybe somebody’s cat died or it was a midterm period and depressive symptoms were present for a brief period, but these factors have no influence on depressive symptoms a year later.

I first fitted Heise’s model to the self-esteem data.

sse1 BY se1@1;
sse2 BY se2@1;
sse3 BY se3@1;
sse4 BY se4@1;
sse2-sse4 PON sse1-sse3 (stability);
se1-se4 (se_osv) ! occasion specific variance in self-esteem

Model fit for this model is perfect. Even the chi-square test is not significant (which in SEM is a good thing, because it means the model closely fits the data).

Model results show that there is significant occasion specific variance. After taking this variance into account the stability of the variance that is not occassion-specific, called state variance by Heise, is around r = .9 from one occasion to the next.

Fit for the depression data is also perfect.

There is even more occasion specific variance in depressive symptoms, but the non-occasion-specific variance is even more stable as the non-occasion-specific variance in self-esteem.

These results make perfect sense if we think about the way self-esteem and depression are measured. Self-esteem is measured with a trait measure of how individuals see themselves in general, ignoring ups and downs and temporary shifts in self-esteem. In contrast, depression is assessed with questions about a specific time period and respondents are supposed to focus on their current ups and downs. Their general disposition should be reflected in these judgments only to the extent that it influences their actual symptoms in the past weeks. These episodic measures are expected to have more occasion specific variance if they are valid. These results show that participants are responding to the different questions in different ways.

In conclusion, model fit and the results favor Heise’s model over the cookie-cutter CLPM.

Step 3: Putting the two autoregressive models together

Let’s first examine the correlations of self-esteem measures with depression measures.

The first observation is that the same-occasion correlations are stronger (more negative) than the cross-occasion correlations. This suggests that occasion specific variance in self-esteem is correlated with occasion specific variance in depression.

The second observation is that the lagged self-esteem to depression correlations (e.g., se1 with de2) do not become weaker (less negative) with increasing time lag, lag-1 r = -.36, lag-2 r = -.32, lag-3 r = .33.

The third observation is that the lagged depression to self-esteem correlations (e.g., de1 with se2) do not decrease from lag-1 to lag-2, although they do become weaker from lag-2 to lag-3, lag-1 r = -.44, lag-2 r = -.45, lag-3 r = -.35.

The fourth observation is that the lagged self-esteem to depression correlations (se1 with de2) are weaker than the lagged depression to self-esteem (de1 with se2) correlations . This pattern is expected because self-esteem is more stable than depressive symptoms. As illustrated in the Figure below, the path from de1-se4 is stronger than the path form se1 to de4 because the path from se1 to se4 is stronger than the path from de1 to de4.

Regression analysis or structural equation modeling is needed to examine whether there are any additional lagged effects of self-esteem on depressive symptoms. However, a strong cross-lagged path from se1 to de4 would produce a stronger correlation of se1 and de4, if stability were equal or if the effect is strong. So, a stronger lagged self-esteem to depression correlation than a lagged depression to self-esteem correlation would imply a cross-lagged effect from self-esteem to depression, but the reverse pattern is inconclusive because self-esteem is more stable.

Like Orth et al. (2020) I found that Heise’s model did not converge. However, unlike Orth et al. I did not conclude from this finding that the CLPM model is preferable. After all, it does not fit the data. Model convergence is sometimes simply a problem of default starting values that work for most models but not for all models. In this case, the high stability of self-esteem produced a problem with default starting values. Just setting this starting value to 1 solved the convergence problem and produced a well-fitting result.

The model results show no negative lagged prediction of depression from self-esteem. In fact, a small positive relationship emerged, but it was not statistically significant.

It is instructive to compare these results with the CLPM results. The CLPM model is nested in the Heise model. The only difference is that the occasion-specific variances of depression and self-esteem are fixed to zero. As these parameters were constrained across occasions, this model has two fewer parameters and the model df increase from 24 to 26. Model fit decreased in the more parsimonious model. However, the overall fit is not terrible, although RMSEA should be below .06 [Interestingly, the CFI value changed from a value over .95 to a value .94 when I estimated the model with MPLUS8.2, whereas Orth et al. used MPLUS8]. This shows the problem of relying on overall fit to endorse models. Overall fit is often good with longitudinal data because all models predict weaker correlations over longer time intervals. The direct model comparison shows that the Heise model is the better model.

In the CLPM model, self-esteem is a negative lagged predictor of depression. This is the key finding that Orth and colleagues have been using to support the vulnerability model of depression (low self-esteem leads to depression).

Why does the CLPM model produce negative lagged effects of self-esteem on depression. The reason is that the model underestimates the long-term stability of depression from time 1 to time 3 and time 4. To compensate for this it can use self-esteem that is more stable and then link self-esteem at time 2 with depression at time 3 (.745 * -.191) and self-esteem at time 3 with depression at time 4 (.742 * .739 * -.190). But even this is not sufficient to compensate for the misprediction of depression over time. Hence, the worse fit of the model. This can be seen by examining the model reproduced correlation matrix in the MPLUS Tech1 output.

Even with the additional cross-lagged path, the model predicts only a correlation of r = .157 from de1 to de4, while the observed correlation was r = .403. This discrepancy merely confirms what the univariate models showed. A model without occasion-specific variances underestimates long-term stability.

Interem Conclusion

Closer inspection of Orth et al.’s data shows that the CLPM does not fit the data. This is not surprising because it is well-known that the cross-lagged panel model often underestimates long-term stability. Even Orth has published univariate analyses of self-esteem that show a simple autoregressive model does not fit the data (Kuster & Orth, 2013). Here I showed that using the wrong model of stability creates statistical artifacts in the estimation of cross-lagged path coefficients. The only empirical support for the vulnerability model of depression is a statistical artifact.

Replication Study

I picked the My Work and I (MWI) dataset for a replication study. I picked it because it used the same measures and had a relatively large sample size (N = 663). However, the study is not an exact or direct replication of the previous one. One important difference is that measurements were repeated every two months rather than every year. The length of the time interval can influence the pattern of correlations.

There are two notable differences in the correlation table. First, the correlations increase with each measurement from .782 for se1 with se2 to .871 from se4 to se5. This suggests a response artifact, such as a stereotypic response styles that inflates consistency over time. This is more likely to happen for shorter intervals. Second, the difference between correlations with different lags are much smaller. They were .07 in the previous study. Here the differences are .02 to .03. This means there is hardly any autoregressive structure, suggesting that a trait model may fit the data better.

The pattern for depression is also different from the previous study. First, the correlations are stronger, which makes sense, because the retest interval is shorter. Somebody who suffers from depressive symptoms is more likely to still suffer two months later than a year later.

There is a clearer autoregressive structure for depression and no sign of stereotypic responding. The reason could be that depression was assessed with a symptom checklist that asks about the previous four weeks. As this question covers a new time period each time, participants may avoid stereotypic responding.

The depression-self-esteem correlations also become stronger (more negative) over time from r = -.538 to r = -.675. This means that a model with constrained coefficients may not fit the data.

The higher stability of depression explains why there is no longer a consistent pattern of stronger lagged depression to self-esteem correlations (de1 with se2) above the diagonal than self-esteem to depression correlations (se1 with de2) below the diagonal. Five correlations are stronger one way and five correlations are stronger the other way.

For self-esteem, the autoregressive model without occasion-specific variance had poor fit (RMSEA = .170, CFI = .920). Allowing for occasion-specific variance improved fit and fit was excellent (RMSEA = .002, CFI = .999). For depression, the autoregressive model without occasion-specific variance had poor fit (RMSEA = .113, CFI = .918). The model with occasion-specific variance fit better and had excellent fit (RMSEA = .029, CFI = .995). These results replicate the previous results and show that CLPM does not fit because it underestimates stability of self-esteem and depression.

The CLPM model also had bad fit in the original article (RMSEA = .105, CFI = .932). In comparison, the model with occasion specific variances had much better fit (RMSEA = .038, CFI = .991). Interestingly, this model did show a small, but statistically significant path from self-esteem to depression (effect size r = -.08). This raises the possibility that the vulnerability effect may exist for shorter time intervals of a few months, but not for longer time intervals of a year or more. However, Orth et al. do not consider this possibility. Rather, they try to justify the use of the CLPM to analyze panel data even though the model does not fit.


Orth et al. note “fit values were lowest for the CLPM” (p. 21) with a footnote that recognizes the problem of the CLPM, “As discussed in the Introduction, the CLPM underestimates the long-term stability of constructs, and this issue leads to misfit as the number of waves increases” (p. 63).

Orth et al. also note correctly that the cross-lagged effect of self-esteem on depression emerges more consistently with the CLPM model. By now it is clear why this is the case. It emerges consistently because it is a statistical artifact produced by the underestimation of stability in depression in the CLPM model. However, Orth et al.’s belief in the vulnerability effect is so strong that they are unable to come to a rational conclusion. Instead they propose that the CLPM model, despite its bad fit, shows something meaningful.

We argue that precisely because the prospective effects tested in the CLPM are also based on between-person variance, it may answer questions that cannot be assessed with models that focus on within-person effects. For example, consider the possible effects of warm parenting on children’s self-esteem (Krauss, Orth, & Robins, 2019): A cross-lagged effect in the CLPM would indicate that children raised by warm parents would be more likely to develop high self-esteem than children raised by less warm parents. A cross-lagged effect in the RI-CLPM would indicate that children who experience more parental warmth than usual at a particular time point will show a subsequent increase in self-esteem at the next time point, whereas children who experience less parental warmth than usual at a particular time point will show a subsequent drop in self-esteem at the next time point

Orth et al. then point out correctly that the CLPM is nested in other models and makes more restrictive assumptions about the absence of occasion specific variance or trait variance, but they convince themselves that this is not a problem.

As was evident also in the present analyses, the fit of the CLPM is typically not as good as the fit of the RI-CLPM (Hamaker et al., 2015; Masselink, Van Roekel, Hankin, et al., 2018). It is important to note that the CLPM is nested in the RI-CLPM (for further information about how the models examined in this research are nested, see Usami, Murayama, et al., 2019). That is, the CLPM is a special case of the RI-CLPM, where the variances of the two random intercept factors and the covariance between the random intercept factors are constrained to zero (thus, the CLPM has three additional degrees of freedom). Consequently, with increasing sample size, the RI-CLPM necessarily fits significantly better than the CLPM (MacCallum, Browne, & Cai, 2006). However, does this mean that the RI-CLPM should be preferred in model selection? Given that the two models differ in their conceptual meaning (see the discussion on between- and within-person effects above), we believe that the decision between the CLPM and RI-CLPM should not be based on model fit, but rather on theoretical considerations.

As shown here, the bad fit of CLPM is not an unfair punishment of a parsimonious model. The bad fit reveals that the model fails to model stability correctly. To disregard bad fit and to favor the more parsimonious model even if it doesn’t fit makes no sense. By the same logic, a model without cross-lagged paths would be more parsimonious than a model with cross-lagged paths and we could reject the vulnerability model simply because it is not parsimonious. For example, when I fitted the model with occasion specific variances and without cross-lagged paths, model fit was better than model fit of the CLPM (RMSEA = .041 vs. RMSEA = .109) and only slightly worse than model fit of the model with occasion specific variance and cross-lagged paths (RMSEA = .040).

It is incomprehensible to methodologists that anybody would try to argue in favor of a model that does not fit the data. If a model consistently produces bad fit, it is not a proper model of the data and has to be rejected. To prefer a model because it produces a consistent artifact that fits theoretical preferences is not science.

Replication II

Although the first replication mostly confirmed the results of the first study, one notable difference was the presence of statistically significant cross-lagged effects in the second study. There are a variety of explanations for this inconsistency. The lack of an effect in the first study could be a type-II error. The presence of an effect in the first replication study could be a type-I errror. Finally, the difference in time intervals could be a moderator.

I picked the Your Personality (YP) dataset because it was the only dataset that used the same measures as the previous two studies. The time interval was 6 months, which is in the middle of the other two intervals. This made it interesting to see whether results would be more consistent with the 2-month or the 1-year intervals.

For self-esteem, the autoregressive model with occasion specific variance had a good fit to the data (RMSEA = .016, CFI = .999). Constraining the occasion specific variance to zero reduced model fit considerably (RMSEA = .160, CFI = .912). Results for depression were unexpected. The model with occasion specific variance showed non-significant and slightly negative residuals for the state variances. This finding implies that there are no detectable changes in depression over time and that depression scores only have a stable trait and occasion specific variance. Thus, I fixed the autoregressive parameters to 1 and the residual state variances to zero. This model is equivalent to a model that specifies a trait factor. Even this model had barely acceptable fit (RMSEA = .062, CFI = .962). Fit could be increased by relaxing the constraints on the occasion specific variance (RMSEA = .060, CFI = .978). However, a simple trait model fit the data even better (RMSEA = .000, CFI = 1.000). The lack of an autoregressive structure makes it implausible that there are cross-lagged effects on depression. If there is no new state variance, self-esteem cannot be a predictor of new state variance.

The presence of a trait factor for depression suggests that there could also be a trait factor for self-esteem and that some of the correlations between self-esteem and depression are due to correlated traits. Therefore I added a trait factor to the measurement model of self-esteem. This model had good fit (RMSEA = .043, .993) and fit was superior to the CLPM (RMSEA = .123, CFI = .883). The model showed no significant cross-lagged effect from self-esteem to depression and the parameter estimate was positive rather than negative, .07. This finding is not surprising given the lack of decreasing correlations over time for depression.

Replication III

The last openly shared datasets are from the California Families Project (CFP). I first examined the children’s data (CFP-C) because Orth et al. (2020) reported a significant vulnerability effect with the RI-CLPM.

For self-esteem, the autoregressive model without occasion-specific variance had bad fit (RMSEA = .108, CFI = .908). Even the model with occasion-specific variance had poor fit (RMSEA = .091, CFI = .945). In contrast, a model with a trait factor and without occasion specific variance had good fit (RMSEA = .023, CFI = .997). This finding suggests that it is necessary to include a stable trait factor to model stability of self-esteem correctly in this dataset.

For depression, the autoregressive model without occasion-specific variance had bad fit (RMSEA = .104, CFI = .878). Even the model with occasion-specific variance had poor fit (RMSEA = .103, CFI = .897). Adding a trait factor produced a model with acceptable fit (RMSEA = .051, CFI = .983).

The trait-state model fit the data well (RMSEA = .989, CFI = .032) and much better than the CLPM (RMSEA = .079, CFI = .914). The autoregressive effect of self-esteem on depression was not significant, and only have the size of the effect size in the RI-CLPM ( -.05 vs. -.09). The difference is due to the constraint on the trait factor. Relaxing these constraints improves model fit and the vulnerability effect becomes non-significant.

Replication IV

The last dataset is based on the mothers’ self-reports in the California Families Project (CFP-M).

For self-esteem, the autoregressive model without occasion-specific variance had bad fit (RMSEA = .139, CFI = .885). The model with occasion specific variance improved fit (RMSEA = .049, CFI = .988). However, the trait-state model had even better fit (RMSEA = .046, CFI = .993).

For depression, the autoregressive model without occasion-specific variance had bad fit (RMSEA = .127, CFI = .880). The model with occasion-specific variance had excellent fit (RMSEA = .000, CFI = 1.000). The trait-state model also had excellent fit (RMSEA = .000, CFI = 1.000).

The CLPM had bad fit to the data (RMSEA = .092, CFI = .913). The Heise model improved fit (RMSEA = .038, CFI = .987). The trait-state model had even better fit (RMSEA = .031, CFI = .992). The cross-lagged effect of self-esteem on depression was negative, but small and not significant, -.05 (95%CI = -.13 to .02).

Simulation Study 1

The first simulation demonstrates that a cross-lagged effect emerges when the CLPM is fitted to data with a trait factor and one of the constructs has more trait variance which produces more stability over time.

I simulated 64% trait variance and 36% occasion-specific variance for self-esteem.

I simulated 36% trait variance and 64% occasion-specific variance for depression.

The correlation between the two trait factors was r = -.7. This produced manifest correlations of r = -.71*sqrt(.36)*sqrt(.64) = -.7 * .6 * .8 = -.34.

For self-esteem the autoregressive model without occasion specific variance had bad fit (). For depression, the autoregressive model without occasion specific variance had bad fit. The CLPM model also had bad fit (RMSEA = .141, CFI = .820). Although the simulation did not include cross-lagged paths, the CLPM showed a significant cross-lagged effect from self-esteem to depression (-.25) and a weaker cross-lagged effect from depression to self-esteem (-.14).

Needless to say, the trait-state model had perfect fit to the data and showed cross-lagged path coefficients of zero.

This simulation shows that CLPM produces artificial cross-lagged effects because it underestimates long-term stability. This problem is well-known, but Orth et al. (2020) deliberately ignore it when they interpret cross-lagged parameters in CLPM with bad fit.

Simulation Study 2

The second simulation shows that a model with a significant cross-lagged path can fit the data, if this path is actually present in the data. The cross-lagged effect was specified as a moderate effect with b = .3. Inspection of the correlation matrix shows the expected pattern that cross-lagged correlations from se to de (se1 with de2) are stronger than cross-lagged correlations from de to se (se2 with de1). The differences are strongest for lag-1.

The model with the cross-lagged paths had perfect fit (RMSEA = .000, CFI = 1.000). The model without cross-lagged paths had worse fit and RMSEA was above .06 (RMSEA = .073, CFI = .968).


The publication of Orth et al.’s (2020 article in JPSP is an embarrassment for the PPID section of JPSP. The authors did not make an innocent mistake. Their own analyses showed across 10 datasets that CLPM does not fit their data. One would expect that a team of researchers would be able to draw the correct conclusion from this finding. However, the power of motivated reasoning is strong. Rather than admitting that the vulnerability model of depression is based on a statistical artifact, the authors try to rationalize why the model with bad fit should not be rejected.

The authors write “the CLPM findings suggest that individual differences in self-esteem predict changes in individual differences in depression, consistent with the vulnerability model” (p. 39).

This conclusions is blatantly false. A finding in a model with bad fit should never be interpreted. After all, the purpose of fitting models to data and to examine model fit is to falsify models that are inconsistent with the data. However, psychologists have been brainwashed into thinking that the purpose of data analysis is only to confirm theoretical predictions and to ignore evidence that is inconsistent with theoretical models. It is therefore not a surprise that psychology has a theory crisis. Theories are nothing more than hunches that guided first explorations and are never challenged. Every discovery in psychology is considered to be true. This does not stop psychologists from developing and supporting contradictory models, which results in an every growing number of theories and confusion. It is like evolution without a selection mechanism. No wonder psychology is making little progress.

Numerous critics of psychology have pointed out that nil-hypothesis testing can be blamed for the lack of development because null-results are ambiguous. However, this excuse cannot be used here. Structural equation modeling is different from null-hypothesis testing because significant results like a high Chi-square value and derived fit indices provide clear and unambiguous evidence that a model does not fit the data. To ignore this evidence and to interpret parameters in these models is unscientific. The fact that authors, reviewers, and editors were willing to publish these unscientific claims in the top journal of personality psychology shows how poorly methods and statistics are understood by applied researchers. To gain respect and credibility, personality psychologists need to respect the scientific method.