Category Archives: Big Five Inventory

Personality and Self-Esteem

In the 1980s, personality psychologists agreed on the Big Five as a broad framework to describe and measure personality; that is, variation in psychological attributes across individuals.

You can think about the Big Five as a five-dimensional map. Like the two-dimensional map (or a three-dimensional globe), the Big Five are independent dimensions that create a space with coordinates that can be used to describe the vast number of psychological attributes that distinguish one person from another. One area of research in personality psychology is to correlate measures of personality attributes with Big Five measures to pinpoint their coordinates.

One important and frequently studied personality attribute is self-esteem, and dozens of studies have correlated self-esteem measures with Big Five measures. Robins, Tracy, and Trzesniewski (2001) reviewed some of these studies.

The results are robust and there is no worry about the replicability of these results. The strongest predictor of self-esteem is neuroticism vs. emotional stability. Self-esteem is located at the high end of neuroticism. The second predictor is extraversion vs. introversion. Self-esteem is located at the higher end of extraversion. The third predictor is conscientiousness which shows a slight positive location on the conscientious vs. careless dimension. Openness vs. closeness also shows a slight tendency towards openness. Finally, the results for agreeableness are more variable and show at least one negative correlation, but most correlations tend to be positive.

Evaluative Bias

Psychologists have a naive view of the validity of their measures. Although they sometimes compute reliability and examine convergent validity in methodological articles that are published in obscure journals like “Psychological Assessment,” they treat measures as perfectly valid in substantive articles that are published in journals like “Journal of Personality” or “Journal of Research in Personality.” Unfortunately, measurement problems can distort effect sizes and occasionally they can change the sign of a correlation.

Anusic et al. (2009) developed a measurement model for the Big Five that separates valid variance in the Big Five dimensions from rating biases. Rating biases can be content free (acquiescence) or respond to the desirability of items (halo, evaluative bias). They showed that evaluative bias can obscure the location of self-esteem in the Big Five space. Here, I revisit this question with better data that measure the Big Five with a measurement model fitted to the 44-items of the Big Five Inventory (Schimmack, 2019a).

I used the same data, which is the Canadian subsample of Gosling and colleagues large internet study that collects data from visitors who receive feedback about their personality. I simply added the single-item self-esteem measure to the dataset. I then fitted three different models. One model regressed the self-esteem item only on the Big Five dimensions. This model essentially replicates analyses with scale scores. I then added the method factors to the set of predictors.

Self-Esteem M1-0.430.300.08-0.030.16
Self-Esteem M2-0.330.190.00-

Results for the first model reproduce previous findings (see Table 1). However, results changed when the method factors were added. Most important, self-esteem is now placed on the negative side of agreeableness towards being more assertive. This makes sense given the selfless and other-focused nature of agreeableness. Agreeable people are less like to think about themselves and may subordinate their own needs to the needs of others. In contrast, people with high self-esteem are more likely to focus on themselves. Even though this is not a strong relationship, it is noteworthy that the relationship is negative rather than positive.

The other noteworthy finding is that evaluative bias is the strongest predictor of self-esteem. There are two interpretations of this finding and it is not clear which explanation accounts for this finding.

One interpretation is that self-esteem is rooted in a trait to see everything related to the self in an overly positive way. This interpretation implies that responses to personality items are driven by the desirability of items and individuals with high self-esteem see themselves as possessing all kinds of desirable attributes that they do not have (or have to a lesser degree). They think that they are kinder, smarter, funnier, and prettier than others, when they are actually not. In this way, the evaluative bias in personality ratings is an indirect measure of self-esteem.

The other interpretation is that evaluative bias is a rating bias that influences self-ratings, which includes self-ratings. Thus, the loading of the self-esteem item on the evaluative bias factor shows simply that self-esteem ratings are influenced by evaluative bias because self-esteem is a desirable attribute.

Disentangling these two interpretations requires the use of a multi-method approach. If evaluative bias is merely a rating bias, it should not correlated with actual life-outcomes. However, if evaluative bias reflects actual self-evaluations, it should be correlated with outcomes of high self-esteem.


Hopefully, this blog-post will create some awareness that personality psychology needs to move beyond the use of self-ratings in mapping the location of personality attributes in the Big Five space.

The blog post also has important implications for theories of personality development that assign value to personality dimensions (Dweck, 2008). Accordingly, the goal of personality development is to become more agreeable and conscientious and less neurotic among other things. However, I question that personality traits have intrinsic value. That is, agreeableness is not intrinsically good and low conscientiousness is not intrinsically bad. The presence of evaluative bias in personality items shows only that personality psychologists assign value to some traits and do not include items like “I am a clean-freak” in their questionnaires. Without a clear evaluation, there is no direction to personality change. Becoming more conscientious is no longer a sign of personal growth and maturation, but rather a change that may have positive or negative consequences for individuals. Although these issues can be debated, it is problematic that current models of personality development do not even question the evaluation of personality traits and treat the positive nature of some traits as a fundamental assumption that cannot be questioned. I suggest it is worthwhile to think about personality like sexual orientation or attractiveness. Although society has created strong evaluations that are hard to change, the goal should be to change these evaluations, not to change individuals to conform to these norms.

How Valid are Short Big-Five Scales?

The first measures of the Big Five used a large number of items to measure personality. This made it difficult to include personality measures in studies as the assessment of personality would take up all of the survey time. Over time, shorter scales became available. One important short Big Five measure is the BFI-S (Lang et al., 2011).  This 15-item measure has been used in several national representative, longitudinal studies such as the German Socio-Economic Panel (Schimmack, 2019a). These results provide unique insights into the stability of personality (Schimmack, 2019b) and the relationship of personality with other constructs such as life-satisfaction (Schimmack, 2019c). Some of these results overturn textbook claims about personality. However, critics argue that these results cannot be trusted because the BFI-S is an invalid measure of personality.

Thus, it is is critical importance to evaluate the validity of the BFI-S. Here I use Gosling and colleagues data to examine the validity of the BFI-S. Previously, I fitted a measurement model to the full 44-item BFI (Schimmack, 2019d). It is straightforward to evaluate the validity of the BFI-S by examining the correlation of the 3-item BFI-S scale scores with the latent factors based on all 44 BFI items. For comparison purposes, I also show the correlations for the BFI scale scores. The complete results for individual items are shown in the previous blog post (Schimmack, 2019d).

The measurement model for the BFS has seven independent factors. Five factors represent the Big Five and two factors represent method factors. One factor represents acquiescence bias. The other factor represents evaluative bias that is present in all self-ratings of personality (Anusic et al., 2009). As all factors are independent, the squared coefficients can be interpreted as the amount of variance that a factor explains in a scale score.

The results show that the BFI-S scales are nearly as valid as the longer BFI scales (Table 1).


For example, the factor-scale correlations for neuroticism, extraversion, and agreeableness are nearly identical. The biggest difference was observed for openness with a correlation of r = .76 for the BFI-scale and r = .66 for the BFI-S scale. The only other notable systematic variance in scales is the evaluative bias influence which tends to be stronger for the longer scales with the exception of conscientiousness. In the future, measurement models with an evaluative bias factor can be used to select items with low loadings on the evaluative bias factor to reduce the influence of this bias on scale scores. Given these results, one would expect that the BFI and BFI-S produce similar results. The next analyses tested this prediction.

Gender Differences

I examined gender differences three ways. First, I examined standardized mean differences at the level of latent factors in a model with scalar invariance (Schimmack, 2019d). Second, I computed standardized mean differences with the BFI scales. Finally, I computed standardized mean differences with the BFI-S scales. Table 2 shows the results. Results for the BFI and BFI-S scales are very similar. The latent mean differences show somewhat larger differences for neuroticism and agreeablness because these mean differences are not attenuated by random measurement error. The latent means also show very small gender differences for the method factors. Thus, mean differences based on scale scores are not biased by method variance.

Table 2. Standardized Mean Differences between Men and Women


Note. Positive values indicate higher means for women than for men.

In short, there is no evidence that using 3-item scales invalidates the study of gender differences.

Age Differences

I demonstrated measurement invariance for different age groups (Schimmack, 2019d). Thus, I used simple correlations to examine the relationship between age and the Big Five. I restricted the age range from 17 to 70. Analyses of the full dataset suggest that older respondents have higher levels of conscientiousness and agreeableness (Soto, John, Gosling, & Potter, 2011).

Table 3 shows the results. The BFI and the BFI-S both show the predicted positive relationship with conscientiousness and the effect size is practically identical. The effect size for the latent variable model is stronger because the relationship is not attenuated by random measurement error. Other relationships are weaker and also consistent across measures except for Openness. The latent variable model reveals the reason for the discrepancies. Three items (#15 ingenious, #l35 like routine work, and #10 sophisticated in art) showed unique relationships with age. The art-related items showed a unique relationship with age. The latent factor does not include the unique content of these items and shows a positive relationship between openness and age. The scale scores include this content and show a weaker relationship. The positive relationship of openness with age for the latent factor is rather surprising as it is not found in nationally representative samples (Schimmack, 2019b). One possible explanation for this relationship is that older individuals who take an online personality test are more open.


In sum, the most important finding is that the 3-item BFI-S conscientiousness scale shows the same relationship with age as the BFI-scale and the latent factor. Thus, the failure to find aging effects in the longitudinal SOEP data with the BFI-S cannot be attributed to the use of an invalid short measure of conscientiousness. The real scientific question is why the cross-sectional study by Soto et al. (2011) and my analysis of the longitudinal SOEP data show divergent results.


Science has changed since researchers are able to communicate and discuss research findings on social media. I strongly believe that open science outside of peer-controlled journals is beneficial for the advancement of science. However, the downside of social media of open science is that it becomes more difficult to evaluate expertise of online commentators. True experts are able to back up their claims with scientific evidence. This is what I did here. I showed that Brenton Wiernik’s comment has as much scientific validity as a Donald Trump tweet. Whatever the reason for the lack of personality change in the SOEP data will be, it is not the use of the BFI-S to measure the Big Five.

Personality Measurement with the Big Five Inventory

In one of the worst psychometric articles every published (although the authors still have a chance to retract their in press article before it is actually published), Hussey and Hughes argue that personality psychologists intentionally fail to test the validity of personality measures. They call this practice validity-hacking. They also conduct some psychometric tests of popular personality measures and claim that they fail to demonstrate structural validity.

I have demonstrated that this claim is blatantly false and that the authors failed to conduct a proper test of structural validity (Schimmack, 2019a). That is, the authors fitted a model to the data that is known to be false. Not surprisingly, they found that their model didn’t meet standard criteria of model fit. This is exactly what should happen when a false model is subjected to a test of structural validity. Bad models should not fit the data. However, a real test of structural validity requires fitting a plausible model to the data. I already demonstrated with several Big Five measures that these measures have good structural validity and that scale scores can be used as reasonable measures of the latent constructs (Schimmack, 2019b). Here I examine the structural validity of the Big Five Inventory (Oliver John) that was used by Hussay and Hughes.

While I am still waiting to receive the actual data that were used by Hussay and Hughes, I obtained a much larger and better dataset from Sam Gosling that includes data from 1 million visitors to a website that provides personality feedback (

For the present analyses I focused on the subgroup of Canadian visitors with complete data (N = 340,000). Subsequent analyses can examine measurement invariance with the US sample and samples from other nations. To examine the structure of the BFI, I fitted a structural equation model. The model has seven factors. Five factors represent the Big Five personality traits. The other two factores represent rating biases. One bias is an evaluative bias and the other bias is acquiescence bias. Initially, loadings on the method factors were fixed. This basic model was then modified in three ways. First, item loadings on the evaluative bias factor were relaxed to allow for some items to show more or less evaluative bias. Second, secondary loadings were added to allow for some items to be influenced by more than one factor. Finally, items of the same construct were allowed to covary to allow for similar wording or shared meaning (e.g., three arts items from the openness factor were allowed to covary). The final model and the complete results can be found on OSF (

Model fit was acceptable, CFI = .953, RMSEA = .030, SRMR = .032. In contrast, fitting a simple structure without method factors produced unacceptable fit for all three fit indices, CFI = .734, RMSEA = .068, SRMR = .110. This shows that the model specification by Hussey and Hughes accounted for the bad fit. It has been known for over 20 years that a simple structure does not fit Big Five data (McCrae et al., 1996). Thus, Hussay and Hughes claim that the BFI lacks validity is based on an outdated and implausible measurement model.

Table 1 shows the factor loading pattern for the 44 BFI items on the Big Five factors and the two method factors. It also shows the contribution of the seven factors to the scale scores that are used to provide visitors with personality feedback and in many research articles that use scale scores as proxies for the latent constructs.

emotionally stable24-0.610.270.18
full of energy110.34-0.110.580.20
generate enthusiasm160.070.440.110.500.20
shy and inhibited310.180.64-0.220.17
ingenious 150.570.090.21
active imagination200.130.53-
value art300.120.460.090.160.18
like routine work35-
like reflecting40-0.080.580.270.21
few artistic interests41-0.26-0.090.15
sophisticated in art440.070.44-
find faults w. others20.15-0.42-0.240.19
helpful / unselfish70.440.100.290.23
start quarrels 120.130.20-0.50-0.09-0.240.19
trusting 220.150.330.260.20
cold and aloof27-0.190.14-0.46-0.350.17
considerate and kind320.040.620.290.23
like to cooperate420.15-0.100.440.280.22
thorough job30.590.280.22
careless 8-0.17-0.51-0.230.18
reliable worker13-
persevere until finished280.560.260.20
follow plans380.10-0.060.460.260.20
easily distracted430.190.09-0.52-0.220.17

Most of the secondary loadings are very small, although they are statistically highly significant in this large sample. Most items also have the highest loading on the primary factor. Exceptions are the items blue/depressed, full of engery, and generate enthusiasm that have higher loadings on the evaluative bias factor. Except for two openness items, all items also have loadings greater than .3 on the primary factor. Thus, the loadings are consistent with the intended factor structure.

The most important results are the loadings of the scale scores on the latent factors. As the factors are all independent, squaring these coefficients shows the amount of explained variance by each factor. By far the largest variance component is the intended construct with correlations ranging from .76 for openness to .83 for extraversion. Thus, the lion share of the reliable variance in scale scores reflects the intended construct. The next biggest contributor is evaluative bias with correlations ranging from .36 for openness to .44 for extraversion. Although this means only 15 to 20 percent of the total variance in scale scores reflects evaluative bias, this systematic variance can produce spurious correlations when scale scores are used to predict other self-report measures (e.g., life satisfaction, Schimmack, 2019c).

In sum, a careful psychometric evaluation of the BFI shows that the BFI has good structural validity. The key problem is the presence of evaluative bias in scale scores. Although this requires caution in the interpretation of results obtained with BFI scales, it doesn’t justify the conclusion that the BFI is invalid.

Measurement Invariance

Hussey and Hughes also examined measurement invariance across age-groups and the two largest gender groups. They claimed that the BFI lacks measurement invariance, but this claim was based on a cunning misrepresentation of the results (Schimmack, 2019a). The claim is based on the fact that the simple-structure model does not fit in any group. However, fit did not decrease when measurement invariance was imposed on different groups. Thus, all groups showed the same structure and fit did not increase when measurement invariance was imposed, but this fact was hidden in the supplementary results.

I replicated their analyses with the current dataset. First, I fitted the model for the whole sample separately to the male and female samples. Fit for the male sample was acceptable, CFI = .949, RMSEA = .029, SRMR = .033. So was fit for the female sample, CFI = .947, RMSEA = .030, SRMR = .037.

Table 2 shows the results side by side. There are no notable differences between the parameter estimates for males and females (m/f). This finding replicates results with other Big Five measures (Schimmack, 2019a).

depressed/blue4.33/ .30-.18/-.11.19/ .20-.45/-.50.07/.05
relaxed9-.71/-.72.24/ .23.19/.18
tense14.52/ .49-.17/-.14.11/ .13-.27/-.32.20/ .20
worry19.58/ .57-.10/-.08.05/ .07-.22/-.22.17/ .17
emotionally stable24-.58/-.58.10/ .06.25/ .30.19/ .17
moody29.41/ .38-.26/-.25-.30/-.38.18/ .18
calm34-.55/-.59-.02/-.03.14/ .13.12/ .13-.27/-.24.21/ .19
nervous39.51/ .49-.21/.26-.10/-.10.08/ .08-.11/-.11-.27/-.25.18/ .17
SUM.78/ .77-.09/-.08-.01/-.01-.07/-.05-.02/-.02-.42-.46.05/ .04
talkative1.09/ .11.69/ .70-.10/-.08.24/ .24.19/ .18
reserved6-.55/-.60.08/.10.21/ .22.19/ .18
full of energy11.33/ .32-.09/-.04.56/ .59.21/ .20
generate enthusiasm16.04/ .03.44/ .43.12/ .13.48/ .50.20/ .20
quiet21-.79/-.82.03/ .04-.22/-.21.17/ .16
assertive26-.08/-.10.39/ .40.12/ .14-.23/-.25.18/ .17.26/ .24.20/ .18
shy and inhibited31.19/ .15.61/ .66.23/ .22.18/ .17
outgoing36.71/ .71.10/ .07.35/ .38.18/ .18
SUM-.02/-.02.82/ .82.04/ .05-.04-.06.00/ .00.45/ .44.07/ .06
original5.50/ .54-.12/-.12.40/ .39.22/ .20
curious10.40/ .42-.05/-.08.32/ .30.25/ .23
ingenious 150.00/0.00.60/ .56.18/ .16.10/ .04.22/ .20
active imagination20.50/ .55-.07/-.06-.17/-.18.29/ .26.23/ .21
inventive25-.07/ -.08.51/ .55-.12/-.10.37/ .34.21/ .19
value art30.10/ .03.43/ .52.08/ .07.17/ .14.18/ .19
like routine work35-.27/-.27.10/ .10.09/ .15-.22/-.21.17/ .16
like reflecting40-.09/-.08.58/ .58.28/ .26.22/ .20
few artistic interests41-.25/-.29-.10/-.09.16/ .15
sophisticated in art44.03/ .00.42/ .49-.08/-.08.09/ .09.16/ .16
SUM.01/ -.01-.01/-.01.74/ .78-.05/-.05-.03/-.06.38/ .34.20/ .19
find faults w. others2.14/ .17-.42/-.42-.24/-.24.19/ .19
helpful / unselfish7.45/ .43.09/.11.29/ .29.23/ .23
start quarrels 12.12/ .16.23/ .18-.49/-.49-.07/-.08-.24/-.24.19/ .19
forgiving17.49/ .46-.14/-.13.25/ .24.20/ .19
trusting 22-.14/-.16.38/ .32.27/ .25.21/ .19
cold and aloof27-.20/-.18.14/ .12.44/ .46-.34/-.37.18/ .17
considerate and kind32.02/.01.62/.61.28/ .30.22/ .23
rude37.10/ .12.12/ .12-.62/-.62-.13/-.08-.23/-.23.19/ .18
like to cooperate42.18/ .11-.09/-.10.43/ .45.28/ .29.23/ .22
SUM-.07/-.08.00/ .00-.07/-.07.78/ .77.03/ .03.43/ .44.04/ .04
thorough job3.58/ .59.29/ .28.23/ .22
careless 8-0.16-.49/-.51.24/ .23.19/ .18
reliable worker13-.10/-.09.09/ .10.55/ .55.30/ .31.24/ .24
disorganized18.13/ .16-.58/-.59-.21/-.20.17/ .15
lazy23-.52/-.51-.45/-.45.18/ .17
persevere until finished28.54/ .58.27/ .25.21/ .19
efficient33-.11/-.07.52/ .58.30/ .29.24/ .23
follow plans38.00/ .00-.06/-.07.45/ .44.27/ .26.21/ .20
easily distracted43.17/ .19.07/ .06-.53/-.53-.22/-.22.18/ .17
SUM-.05/-.05-.01/-.01-.05/-.06.04/ .04.81/ .82.43/ .41.03/ .03

I then fitted a multi-group model with metric invariance. Despite the high similarity between the individual models, model fit decreased, CFI = .925, RMSEA = .033, SRMR = .062. Although RMSEA and SRMR were still good, the decrease in fit might be considered evidence that the invariance assumption is violated. Table 2 shows that it is insufficient to examine changes in global fit indices. What matters is whether the decrease in fit has any substantial meaning. Given the results in Table 2, this is not the case.

The next model imposed scalar invariance. Before presenting the results, it is helpful to know what scalar invariance implies. Take extraversion as an example. Assume that there are no notable gender differences in extraversion. However, extraversion has multiple facets that are represented by items in the BFI. One facet is assertiveness and the BFI includes an assertiveness item. Scalar invariance implies that there cannot be gender differences in assertiveness if there are no gender differences in extraversion. It is obvious that this is an odd assumption because gender differences can occur at any level in the hierarchy of personality traits. Thus, evidence that scalar invariance is violated does not imply that we can not examine gender differences in personality. Rather, it would require further examination of the pattern of mean differences at the level of the factors and the item residuals.

However, imposing scalar invariance did not produce a decrease in fit, CFI = .921, RMSEA = .034, SRMR = .063. Inspection of the modification indices showed the highest modification index for item O6 “valuing art” with an implied mean difference of 0.058. This implies that there are no notable gender differences at the item-level. The pattern of mean differences at the factor level is consistent with previous studies, showing higher levels of neuroticism (d = .64) and agreeableness (d = .31), although the difference in agreeableness is relatively small compared to some other studies.

In sum, the results show that the BFI can be used to examine gender differences in personality and that the pattern of gender differences observed with the BFI is not a measurement artifact.

Age Differences

Hussey and Hughes used a median split to examine invariance across age-groups. The problem with a median split is that online samples tend to be very young. Figure 1 shows the age distribution for the Canadian sample. The median age is 22.

To create two age-groups, I split the sample into a group of under 30 and 30+ participants. The unequal sample size is not a problem because both groups are large given the large overall sample size (young N = 221,801, old N = 88,713). A published article examined age differences in the full sample, but the article did not use SEM to test measurement invariance (Soto, John, Gosling, & Potter, 2011). Given the cross-sectional nature of the data, it is not clear whether age differences are cohort differences or aging effects. Longitudinal studies suggest that age differences may reflect generational changes rather than longitudinal changes over time (Schimmack, 2019d). In any case, the main point of the present analyses is to examine measurement invariance across different age groups.

Fit for the model with metric invariance was similar to the fit for the gender model, CFI = .927, RMSEA = .033, SRMR = .062. Fit for the model with scalar invariance was only slightly weaker for CFI and better for RMSEA. More important, inspection of the modification indices showed the largest difference for O10 “sophisticated in art” with a standardized mean difference of .068. Thus, there were no notable differences between the two age groups at the item level.

The results at the factor level reproduced the finding with scale scores by Soto et al. (2011). The older group had a higher level of conscientiousness (d = .61) than the younger group. Differences for the other personalty dimensions were statistically small. There were no notable differences in response styles.

In sum, the results show that the BFI shows reasonable measurement invariance across age groups. Contrary to the claims by Hussey and Hughes, this finding is consistent with the results reported in Hussay and Hughes’s supplementary materials. These results suggest that BFI scale scores provide useful information about personality and that published articles that used scale scores produced meaningful results.


Hussey and Hughes accused personality researchers of validity hacking. That is, they do not report results of psychometric tests because these tests would show that personality measures are invalid. This is a strong claim that requires strong evidence. However, closer inspection of this claim shows that the authors used an outdated measurement model and misrepresented the results of their invariance analyses. Here I showed that the BFI has good structural validity and shows reasonable invariance across gender and age groups. Thus Hussay and Hughes’s claims are blatantly false.

So far, i have only examined the BFI, but I have little confidence in the authors’ conclusions about other measures like Rosenberg’s self-esteem scale. I am still waiting for the authors to share all of their data to examine all of their claims. At present, there is no evidence of v-hacking. Of course, this does not mean that self-ratings of personality are perfectly valid. As I showed, self-ratings of the Big Five are contaminated with evaluative bias. I presented a measurement model that can test for the presence of these biases and that can be used to control for rating biases. Future validation studies might benefit from using this measurement model as a basis for developing better measures and better measurement models. Substantive articles might also benefit from using a measurement model rather than scale scores, especially when the BFI is used as a predictor of other self-report measures to control for shared rating biases.