Category Archives: Measurement Invariance

Personality Measurement with the Big Five Inventory

In one of the worst psychometric articles every published (although the authors still have a chance to retract their in press article before it is actually published), Hussey and Hughes argue that personality psychologists intentionally fail to test the validity of personality measures. They call this practice validity-hacking. They also conduct some psychometric tests of popular personality measures and claim that they fail to demonstrate structural validity.

I have demonstrated that this claim is blatantly false and that the authors failed to conduct a proper test of structural validity (Schimmack, 2019a). That is, the authors fitted a model to the data that is known to be false. Not surprisingly, they found that their model didn’t meet standard criteria of model fit. This is exactly what should happen when a false model is subjected to a test of structural validity. Bad models should not fit the data. However, a real test of structural validity requires fitting a plausible model to the data. I already demonstrated with several Big Five measures that these measures have good structural validity and that scale scores can be used as reasonable measures of the latent constructs (Schimmack, 2019b). Here I examine the structural validity of the Big Five Inventory (Oliver John) that was used by Hussay and Hughes.

While I am still waiting to receive the actual data that were used by Hussay and Hughes, I obtained a much larger and better dataset from Sam Gosling that includes data from 1 million visitors to a website that provides personality feedback (

For the present analyses I focused on the subgroup of Canadian visitors with complete data (N = 340,000). Subsequent analyses can examine measurement invariance with the US sample and samples from other nations. To examine the structure of the BFI, I fitted a structural equation model. The model has seven factors. Five factors represent the Big Five personality traits. The other two factores represent rating biases. One bias is an evaluative bias and the other bias is acquiescence bias. Initially, loadings on the method factors were fixed. This basic model was then modified in three ways. First, item loadings on the evaluative bias factor were relaxed to allow for some items to show more or less evaluative bias. Second, secondary loadings were added to allow for some items to be influenced by more than one factor. Finally, items of the same construct were allowed to covary to allow for similar wording or shared meaning (e.g., three arts items from the openness factor were allowed to covary). The final model and the complete results can be found on OSF (

Model fit was acceptable, CFI = .953, RMSEA = .030, SRMR = .032. In contrast, fitting a simple structure without method factors produced unacceptable fit for all three fit indices, CFI = .734, RMSEA = .068, SRMR = .110. This shows that the model specification by Hussey and Hughes accounted for the bad fit. It has been known for over 20 years that a simple structure does not fit Big Five data (McCrae et al., 1996). Thus, Hussay and Hughes claim that the BFI lacks validity is based on an outdated and implausible measurement model.

Table 1 shows the factor loading pattern for the 44 BFI items on the Big Five factors and the two method factors. It also shows the contribution of the seven factors to the scale scores that are used to provide visitors with personality feedback and in many research articles that use scale scores as proxies for the latent constructs.

emotionally stable24-0.610.270.18
full of energy110.34-0.110.580.20
generate enthusiasm160.070.440.110.500.20
shy and inhibited310.180.64-0.220.17
ingenious 150.570.090.21
active imagination200.130.53-
value art300.120.460.090.160.18
like routine work35-
like reflecting40-0.080.580.270.21
few artistic interests41-0.26-0.090.15
sophisticated in art440.070.44-
find faults w. others20.15-0.42-0.240.19
helpful / unselfish70.440.100.290.23
start quarrels 120.130.20-0.50-0.09-0.240.19
trusting 220.150.330.260.20
cold and aloof27-0.190.14-0.46-0.350.17
considerate and kind320.040.620.290.23
like to cooperate420.15-0.100.440.280.22
thorough job30.590.280.22
careless 8-0.17-0.51-0.230.18
reliable worker13-
persevere until finished280.560.260.20
follow plans380.10-0.060.460.260.20
easily distracted430.190.09-0.52-0.220.17

Most of the secondary loadings are very small, although they are statistically highly significant in this large sample. Most items also have the highest loading on the primary factor. Exceptions are the items blue/depressed, full of engery, and generate enthusiasm that have higher loadings on the evaluative bias factor. Except for two openness items, all items also have loadings greater than .3 on the primary factor. Thus, the loadings are consistent with the intended factor structure.

The most important results are the loadings of the scale scores on the latent factors. As the factors are all independent, squaring these coefficients shows the amount of explained variance by each factor. By far the largest variance component is the intended construct with correlations ranging from .76 for openness to .83 for extraversion. Thus, the lion share of the reliable variance in scale scores reflects the intended construct. The next biggest contributor is evaluative bias with correlations ranging from .36 for openness to .44 for extraversion. Although this means only 15 to 20 percent of the total variance in scale scores reflects evaluative bias, this systematic variance can produce spurious correlations when scale scores are used to predict other self-report measures (e.g., life satisfaction, Schimmack, 2019c).

In sum, a careful psychometric evaluation of the BFI shows that the BFI has good structural validity. The key problem is the presence of evaluative bias in scale scores. Although this requires caution in the interpretation of results obtained with BFI scales, it doesn’t justify the conclusion that the BFI is invalid.

Measurement Invariance

Hussey and Hughes also examined measurement invariance across age-groups and the two largest gender groups. They claimed that the BFI lacks measurement invariance, but this claim was based on a cunning misrepresentation of the results (Schimmack, 2019a). The claim is based on the fact that the simple-structure model does not fit in any group. However, fit did not decrease when measurement invariance was imposed on different groups. Thus, all groups showed the same structure and fit did not increase when measurement invariance was imposed, but this fact was hidden in the supplementary results.

I replicated their analyses with the current dataset. First, I fitted the model for the whole sample separately to the male and female samples. Fit for the male sample was acceptable, CFI = .949, RMSEA = .029, SRMR = .033. So was fit for the female sample, CFI = .947, RMSEA = .030, SRMR = .037.

Table 2 shows the results side by side. There are no notable differences between the parameter estimates for males and females (m/f). This finding replicates results with other Big Five measures (Schimmack, 2019a).

depressed/blue4.33/ .30-.18/-.11.19/ .20-.45/-.50.07/.05
relaxed9-.71/-.72.24/ .23.19/.18
tense14.52/ .49-.17/-.14.11/ .13-.27/-.32.20/ .20
worry19.58/ .57-.10/-.08.05/ .07-.22/-.22.17/ .17
emotionally stable24-.58/-.58.10/ .06.25/ .30.19/ .17
moody29.41/ .38-.26/-.25-.30/-.38.18/ .18
calm34-.55/-.59-.02/-.03.14/ .13.12/ .13-.27/-.24.21/ .19
nervous39.51/ .49-.21/.26-.10/-.10.08/ .08-.11/-.11-.27/-.25.18/ .17
SUM.78/ .77-.09/-.08-.01/-.01-.07/-.05-.02/-.02-.42-.46.05/ .04
talkative1.09/ .11.69/ .70-.10/-.08.24/ .24.19/ .18
reserved6-.55/-.60.08/.10.21/ .22.19/ .18
full of energy11.33/ .32-.09/-.04.56/ .59.21/ .20
generate enthusiasm16.04/ .03.44/ .43.12/ .13.48/ .50.20/ .20
quiet21-.79/-.82.03/ .04-.22/-.21.17/ .16
assertive26-.08/-.10.39/ .40.12/ .14-.23/-.25.18/ .17.26/ .24.20/ .18
shy and inhibited31.19/ .15.61/ .66.23/ .22.18/ .17
outgoing36.71/ .71.10/ .07.35/ .38.18/ .18
SUM-.02/-.02.82/ .82.04/ .05-.04-.06.00/ .00.45/ .44.07/ .06
original5.50/ .54-.12/-.12.40/ .39.22/ .20
curious10.40/ .42-.05/-.08.32/ .30.25/ .23
ingenious 150.00/0.00.60/ .56.18/ .16.10/ .04.22/ .20
active imagination20.50/ .55-.07/-.06-.17/-.18.29/ .26.23/ .21
inventive25-.07/ -.08.51/ .55-.12/-.10.37/ .34.21/ .19
value art30.10/ .03.43/ .52.08/ .07.17/ .14.18/ .19
like routine work35-.27/-.27.10/ .10.09/ .15-.22/-.21.17/ .16
like reflecting40-.09/-.08.58/ .58.28/ .26.22/ .20
few artistic interests41-.25/-.29-.10/-.09.16/ .15
sophisticated in art44.03/ .00.42/ .49-.08/-.08.09/ .09.16/ .16
SUM.01/ -.01-.01/-.01.74/ .78-.05/-.05-.03/-.06.38/ .34.20/ .19
find faults w. others2.14/ .17-.42/-.42-.24/-.24.19/ .19
helpful / unselfish7.45/ .43.09/.11.29/ .29.23/ .23
start quarrels 12.12/ .16.23/ .18-.49/-.49-.07/-.08-.24/-.24.19/ .19
forgiving17.49/ .46-.14/-.13.25/ .24.20/ .19
trusting 22-.14/-.16.38/ .32.27/ .25.21/ .19
cold and aloof27-.20/-.18.14/ .12.44/ .46-.34/-.37.18/ .17
considerate and kind32.02/.01.62/.61.28/ .30.22/ .23
rude37.10/ .12.12/ .12-.62/-.62-.13/-.08-.23/-.23.19/ .18
like to cooperate42.18/ .11-.09/-.10.43/ .45.28/ .29.23/ .22
SUM-.07/-.08.00/ .00-.07/-.07.78/ .77.03/ .03.43/ .44.04/ .04
thorough job3.58/ .59.29/ .28.23/ .22
careless 8-0.16-.49/-.51.24/ .23.19/ .18
reliable worker13-.10/-.09.09/ .10.55/ .55.30/ .31.24/ .24
disorganized18.13/ .16-.58/-.59-.21/-.20.17/ .15
lazy23-.52/-.51-.45/-.45.18/ .17
persevere until finished28.54/ .58.27/ .25.21/ .19
efficient33-.11/-.07.52/ .58.30/ .29.24/ .23
follow plans38.00/ .00-.06/-.07.45/ .44.27/ .26.21/ .20
easily distracted43.17/ .19.07/ .06-.53/-.53-.22/-.22.18/ .17
SUM-.05/-.05-.01/-.01-.05/-.06.04/ .04.81/ .82.43/ .41.03/ .03

I then fitted a multi-group model with metric invariance. Despite the high similarity between the individual models, model fit decreased, CFI = .925, RMSEA = .033, SRMR = .062. Although RMSEA and SRMR were still good, the decrease in fit might be considered evidence that the invariance assumption is violated. Table 2 shows that it is insufficient to examine changes in global fit indices. What matters is whether the decrease in fit has any substantial meaning. Given the results in Table 2, this is not the case.

The next model imposed scalar invariance. Before presenting the results, it is helpful to know what scalar invariance implies. Take extraversion as an example. Assume that there are no notable gender differences in extraversion. However, extraversion has multiple facets that are represented by items in the BFI. One facet is assertiveness and the BFI includes an assertiveness item. Scalar invariance implies that there cannot be gender differences in assertiveness if there are no gender differences in extraversion. It is obvious that this is an odd assumption because gender differences can occur at any level in the hierarchy of personality traits. Thus, evidence that scalar invariance is violated does not imply that we can not examine gender differences in personality. Rather, it would require further examination of the pattern of mean differences at the level of the factors and the item residuals.

However, imposing scalar invariance did not produce a decrease in fit, CFI = .921, RMSEA = .034, SRMR = .063. Inspection of the modification indices showed the highest modification index for item O6 “valuing art” with an implied mean difference of 0.058. This implies that there are no notable gender differences at the item-level. The pattern of mean differences at the factor level is consistent with previous studies, showing higher levels of neuroticism (d = .64) and agreeableness (d = .31), although the difference in agreeableness is relatively small compared to some other studies.

In sum, the results show that the BFI can be used to examine gender differences in personality and that the pattern of gender differences observed with the BFI is not a measurement artifact.

Age Differences

Hussey and Hughes used a median split to examine invariance across age-groups. The problem with a median split is that online samples tend to be very young. Figure 1 shows the age distribution for the Canadian sample. The median age is 22.

To create two age-groups, I split the sample into a group of under 30 and 30+ participants. The unequal sample size is not a problem because both groups are large given the large overall sample size (young N = 221,801, old N = 88,713). A published article examined age differences in the full sample, but the article did not use SEM to test measurement invariance (Soto, John, Gosling, & Potter, 2011). Given the cross-sectional nature of the data, it is not clear whether age differences are cohort differences or aging effects. Longitudinal studies suggest that age differences may reflect generational changes rather than longitudinal changes over time (Schimmack, 2019d). In any case, the main point of the present analyses is to examine measurement invariance across different age groups.

Fit for the model with metric invariance was similar to the fit for the gender model, CFI = .927, RMSEA = .033, SRMR = .062. Fit for the model with scalar invariance was only slightly weaker for CFI and better for RMSEA. More important, inspection of the modification indices showed the largest difference for O10 “sophisticated in art” with a standardized mean difference of .068. Thus, there were no notable differences between the two age groups at the item level.

The results at the factor level reproduced the finding with scale scores by Soto et al. (2011). The older group had a higher level of conscientiousness (d = .61) than the younger group. Differences for the other personalty dimensions were statistically small. There were no notable differences in response styles.

In sum, the results show that the BFI shows reasonable measurement invariance across age groups. Contrary to the claims by Hussey and Hughes, this finding is consistent with the results reported in Hussay and Hughes’s supplementary materials. These results suggest that BFI scale scores provide useful information about personality and that published articles that used scale scores produced meaningful results.


Hussey and Hughes accused personality researchers of validity hacking. That is, they do not report results of psychometric tests because these tests would show that personality measures are invalid. This is a strong claim that requires strong evidence. However, closer inspection of this claim shows that the authors used an outdated measurement model and misrepresented the results of their invariance analyses. Here I showed that the BFI has good structural validity and shows reasonable invariance across gender and age groups. Thus Hussay and Hughes’s claims are blatantly false.

So far, i have only examined the BFI, but I have little confidence in the authors’ conclusions about other measures like Rosenberg’s self-esteem scale. I am still waiting for the authors to share all of their data to examine all of their claims. At present, there is no evidence of v-hacking. Of course, this does not mean that self-ratings of personality are perfectly valid. As I showed, self-ratings of the Big Five are contaminated with evaluative bias. I presented a measurement model that can test for the presence of these biases and that can be used to control for rating biases. Future validation studies might benefit from using this measurement model as a basis for developing better measures and better measurement models. Substantive articles might also benefit from using a measurement model rather than scale scores, especially when the BFI is used as a predictor of other self-report measures to control for shared rating biases.