Hidden Invalidity of Personality Measures?

Sometimes journal articles have ironic titles. The article “Hidden invalidity among fifteen commonly used measures in social and personality psychology” (in press at AMPPS) is one of them. The authors (Ian Hussey & Sean Hughes) claim that personality psychologists engaged in validity-hacking (v-hacking) and claim validity for personality measures when actual validation studies show that these measures have poor validity. As it turns out, these claims are false and and the article is an example of invalidity hacking where the authors ignore and hide evidence that contradicts their claims.

The authors focus on several aspects of validity. Many measures show good internal consistency and retest-reliability. The authors ignore convergent and discriminant validity as important criteria of construct validity (Campbell & Fiske, 1959). The claim that many personality measures are invalid is based on examination of structural validity and measurement invariance across age groups and genders.

Yet, when validity was assessed comprehensively (via internal consistency, immediate and delayed test-retest reliability, factor structure, and measurement invariance for median age and gender) only 4% demonstrated good validity. Furthermore, the less commonly a test is reported in the literature, the more likely it was to be failed (e.g., measurement invariance). This suggests that the pattern of underreporting in the field may represent widespread hidden invalidity of the measures we use, and therefore pose a threat to many research findings. We highlight the degrees of freedom afforded to researchers in the assessment and reporting of structural validity. Similar to the better-known concept of p-hacking, we introduce the concept of validity hacking (v-hacking) and argue that it should be acknowledged and addressed.

Structural validity is important when researchers rely on manifest scale scores to test theoretical predictions that hold at the level of unobserved constructs. For example, gender differences in agreeableness are assumed to exist at the level of the construct. If a measurement model is invalid, mean differences between men and women on an (invalid) agreeableness scale may not reveal the actual differences in agreeableness.

The authors claim that “rigorous tests of validity are rarely conducted or reported” and that “many of the measures we use appear perfectly adequate on the surface and yet fall apart when subjected to more rigorous tests of validity beyond Cronbach’s α.” This claim is neither supported by citation nor consistent with the general practice in the development of psychological measures to explore the factor structure of items. For example, the Big Five were not conceived theoretically, but found empirically by employing exploratory factor analysis (or principal component analysis). Thus claims of widepread v-hacking by omitting structural analyses seems inconsistent with actual practices.

Based on a questionable description of the state of the affairs, the authors suggest that they are the fist to conduct empirical tests of structural validity.

“With this in mind, we examined the structural validity of fifteen well-known selfreport measures that are often used in social and personality psychology using several best practices (see Table 1).”

The practice to present something as novel by omitting relevant prior studies has been called l-hacking (literature review hacking). It also makes it unnecessary to compare results with prior results and to address potentially inconsistent results.

This also allows the authors to make false claims about their data. “The sheer size of the sample involved (N = 81,986 individuals, N = 144,496 experimental sessions) allowed us to assess the psychometric properties of these measures with numbers that were far greater than those used in many earlier validation studies. Contrary to this claim, Nye, Allemand, Gosling, and Roberts (2016) published a study of structural validity of the same personality measure (BFI) with over 150,000 participants. Thus, their study was neither novel nor did it have a larger sample size than prior studies.

The authors also made important and questionable choices that highlight the problem of researchers’ degrees of freedom in validation studies. In this case, their choice to fit a simple-structure model to the data ensured that they would obtain relatively bad fit if scales included reverse scored items, which is a good practice to reduce the influence of acquiescence bias on scale scores. However, the presence of acquiescence bias will also produce weaker correlations between direct and revere scored items. This response style can be modeled by including a method factor in the measurement model. Prior articles showed that acquiescence bias is present and that including an acquiescence factor improves model fit (Anusic et al., 2009; Nye et al., 2016). The choice not to include a method factor contributed to the authors conclusion that Big Five scales are structurally invalid. Thus, the authors conclusion is based on their own choice of a poor measurement model rather than hidden invalidity of the BFI.

The authors justify their choice of a simple-structure with the claim that most researchers who use these scales simply calculate sum scores and rely on these in their subsequent analyses. In doing so, they are tacitly endorsing simple measurement models with no cross-loadings or method factors). This claim is plain wrong. The only purpose of reverse scored items is to reduce the influence of acquiescence bias on scale scores because aggregation of direct and reverse scored items reduces the bias that is common to both types of items. If researchers would assume that acquiescence bias is absent, there would be no need for reverse scored items. Moreover, aggregation of items does not imply that all items are pure indicators of the latent construct or that there are no additional relationships among items (see Nye et al., 2016). The main rational for summing items is that they all have moderate to high loadings on the primary factor. When this is the case, most of the variance in sum scores reflects the common primary factor (see, e.g., Schimmack, 2019, for an example).

The authors also developed their own coding scheme to determine whether a scale has good, mixed, or poor fit to the data based on three fit indices. A scale was said to have poor fit, if CFI was below .95, TLI was below .95, RMSEA was below .06, and SRMR was above .09. That is, to have good fit, a scale must meet all four criteria. A scale was said to have poor fit, if it met none of the four criteria. All other possibilities were considered to be mixed fit. Only Conscientiousness met all four criteria. Agreeableness met 4 out of 3 (RMSEA = .063). Extraversion met 3 out of 4 (RMSEA .075). Neuroticism met 4 out of 3 (RMSEA = .065). And openness met 1 out of 4 (SRMR = .060), but was misclassified as poor. Thus, although the authors fitted a highly implausible simple structure model, fit suggested that a single-factor model fitted the data reasonably well. Experienced SEM researchers would also wonder about the classification of Openness as poor fit given that CFI was .933 and RMSEA was .069.

More important than meeting conventional cut-off values is to examine problems with a measurement model. In this case, one obvious problem is the lack of a method factor for acquiescence bias; or the presence of substantive variance that reflects lower-order traits (facets).

It is instructive to compare these results to Nye et al.’s (2016) prior results of structural validity. They found slighlty worse fit for the simple-structure model, but they also showed that model fit improved when they modeled the presence of lower-order factors or acquiescence bias (2 factor, pos/neg.) in the data. An even better model fit would have been obtained by modeling facets and aquiescence bias in a single model (Schimmack, 2019).

In short, the problem with the Big Five Inventory is not that it has poor validity as a measure of the Big Five. Poor fit of a simple-structure simply shows that other content and method factors contribute to variance in Big Five scales. A proper assessment of validity would require quantifying how much of the variance in Big Five scales can be attributed to the variance in the intended construct. That is, how much of the variance in extraversion scores on the BFI reflects actual variation in extraversion? This fundamental question was not addressed in the “hidden invalidity” article.

The “hidden invalidity” article also examined measurement invariance across age groups (median split) and the two largest gender groups (male, female). The actual results are only reported in a Supplement. Inspecting the Supplement shows hidden validity. Big Five measures passed most tests of metric and scalar invariance by the authors own criteria.

Big 5 Inventory – Aagefit_configuralNANANANAPoor
Big 5 Inventory – Aagefit_metric0.0200.044-0.0130.001Passed
Big 5 Inventory – Aagefit_scalar-0.0130.0000.0000.002Passed
Big 5 Inventory – Asexfit_configuralNANANANAPoor
Big 5 Inventory – Asexfit_metric0.0230.046-0.0140.001Passed
Big 5 Inventory – Asexfit_scalar-0.014-0.0030.0010.003Passed
Big 5 Inventory – Cagefit_configuralNANANANAPoor
Big 5 Inventory – Cagefit_metric0.0290.054-0.0170.001Passed
Big 5 Inventory – Cagefit_scalar-0.013-0.0030.0010.003Passed
Big 5 Inventory – Csexfit_configuralNANANANAPoor
Big 5 Inventory – Csexfit_metric0.0310.055-0.0180.002Passed
Big 5 Inventory – Csexfit_scalar-0.0050.006-0.0020.001Passed
Big 5 Inventory – Eagefit_configuralNANANANAPoor
Big 5 Inventory – Eagefit_metric0.0450.081-0.0320.001Passed
Big 5 Inventory – Eagefit_scalar-0.013-0.0010.0000.003Passed
Big 5 Inventory – Esexfit_configuralNANANANAPoor
Big 5 Inventory – Esexfit_metric0.0420.078-0.0300.002Passed
Big 5 Inventory – Esexfit_scalar-0.0070.007-0.0030.001Passed
Big 5 Inventory – Nagefit_configuralNANANANAPoor
Big 5 Inventory – Nagefit_metric0.0260.054-0.0200.002Passed
Big 5 Inventory – Nagefit_scalar-0.022-0.0100.0040.005Failed
Big 5 Inventory – Nsexfit_configuralNANANANAPoor
Big 5 Inventory – Nsexfit_metric0.0320.061-0.0220.001Passed
Big 5 Inventory – Nsexfit_scalar-0.0110.001-0.0010.003Passed
Big 5 Inventory – Oagefit_configuralNANANANAPoor
Big 5 Inventory – Oagefit_metric0.0360.068-0.0140.001Passed
Big 5 Inventory – Oagefit_scalar-0.041-0.0240.0050.006Failed
Big 5 Inventory – Osexfit_configuralNANANANAPoor
Big 5 Inventory – Osexfit_metric0.0350.065-0.0140.002Passed
Big 5 Inventory – Osexfit_scalar-0.043-0.0280.0060.006Failed

However, readers of the article don’t get to see this evidence. Instead they are presented with a table that suggests Big Five measures lack measurement invariance.

Aside from the misleading presentation of the results, the results are not very informative because they don’t reveal whether deviations from a simple-structure pose a serious threat to the validity of Big Five scales. Unfortunately, the authors’ data are currently not available to examine this question.

Own Investigation

Incidentally, I had just posted a blog post about measurement models of Big Five data (Schimmack, 2019), using open data from another study (Beck, Condon, & Jackson, 2019) using a large, online dataset with the IPIP-100 items. I showed that it is possible to fit a measurement model to the IPIP-100. To achieve model fit, the model included secondary loadings, some correlated residuals, and method factors for acquiescence bias and evaluative (halo) bias. These results show that a reasonable measurement model can fit Big Five data, as was demonstrated in several previous studies (Anusic et al., 2009; Nye et al., 2016).

Here, I examine measurement invariance for gender and age groups. I also modified and improved the measurement model, by using several of the rejected IPIP-100 items as indicators of the halo factor. Item analysis showed that the items “quick to understand things,” “carry conversations to a higher level,” “take charge,” “try to avoid complex people,” “wait for others to lead the way,” “will not probe deeply into a subject” loaded more highly on the halo factor than on the intended Big Five factor. This makes these items ideal candidates for the construction of a manifest measure of evaluative bias.

The sample were 9,309 Canadians, 140,479, US Americans, 5,804 British, and 5,091 Australians between the age of 14 and 60 (see https://osf.io/23k8v/ for data). Data were analyzed with MPLUS.8.2 using robust maximum likelihood (see https://osf.io/23k8v/ for complete syntax). The final model met the standard criteria for acceptable fit (CFI = .965, RMSEA = .015, SRMR = .032).

Table 1. Factor Loadings and Item Intercepts for Men (First) and Women (Second)

3 .48/  .47-.07 /-.08-.21/-.21.14/ .13-0.34/-0.45
10-.61/-.60.16/ .16.13/ .130.21/ 0.22
17-.66/-.63.09/ .09.14/ .13.15/ .140.71/ 0.70
46 .73/ .73.11/ .09-.20/-.20 .13/ .12-0.19/-0.20
56  .56/ .54-.15/-.16-.20/-.19.13/ .12-0.46/-0.46
SUM .83/ .83-.02/-.02.00/  .00.03/ .03-.06/-.07-.26/-.26.04/ .04
16.12/ .12-.72/-.69-.20/-.19.13/ .120.31/ 0.32
33-.32/-.32 .60/ .61.24/ .21.08/ .09.19/ .20.15/ .140.56/ 0.58
38.20/ .20-.71/-.70-.09/-.10-.24/-.24.13/ .12-0.17/-0.18
60.09/ .08-.69/-.67-.21/-.21.14/ .13-0.07/-0.08
88-.10/-.10 .72/ .72.16 /.14 .26/ .26.14/ .140.44/ 0.46
93-.15/-.15 .72/ .72.10/ .09.16/ .16.12/ .110.03/ 0.03
SUM-.23/-.23 .81/ .81.00/ .00.14/ .12.05/ .05.24/ .24.08/ .07
5.10/.09 .40/ .42-.07/-.08.50/ .47.19/ .171.32/ 1.29
27-.70/-.74-.23/-.22.15/ .14-1.12/-1.09
52.08/ .08 .73/ .76-.14/-.14.29/ .27.17/ .151.21/ 1.16
53-.64/-.65-.37/-.34.18/ .16-1.49/-1.40
SUM.03/ .02.03/ .03 .79/ .82.00/ .00-.07/-.07.44/.40.00/.00
8-.59/-.53-.22/-.24.15/ .15-0.65/-0.71
12-.59/-.51-.22/-.23.14/ .14-0.49/-0.52
35-.63/-.54-.23/-.24.15/ .14-0.79/-0.83
51.63/ .61.02/ .02.15/ .160.62/0.72
89.74/ .72.20/ .23.16/ .180.80/ 0.95
94.58/ .53.10/ .12.12/ .10.16/ .160.45/ 0.50
SUM.00/ .00.00/ .00.00/ .00.87/ .84.02/ .03.24/ .26.00/ .00
40 .42/ .46.08/ .08.14/ .130.15/ 0.38
43.09/.09 .63/ .66.14/ .13.13/ .12-0.26/-0.13
63-.76/-.79-.18/-.17.12/ .100.19/ 0.18
64-.69/-.74-.04/-.04.12/ .11-0.08/ 0.01
68.12/ .12-.06/-.06-.06/-.07.14/ .12 .43/ .48-.03/-.03.15/ .140.38/ 0.39
79-.72/-.76-.19/-.18.12/ .11-0.11/-0.11
SUM.03/ .02.01/ .01-.01/-.01.03/ .02 .87/ .89 .15/ .14.00 / .00
15 .53/ .51.19/ .181.38/ 1.38
23.10 / .10.20/ .20 .61/ .62.16/ .160.79/ 0.82
90.43/ .41.14/ .15 .36/ .35.15/ .140.65/ 0.64
95.13/ .13-.49/-.46.15/ .13-0.73/-0.70
97-.41/-.39.14/ .12-.10/-.11-.41/-.40.14/ .13-0.38/-0.38
99-.11/-.11-.54/-.53.16/ .15-0.86/-0.87
SUM.06/ .05.29/ .28.00/ .00-.04/-.04.03/ .03 .77/ .77.00/ .00
SUM.10/ .10.03/ .03-.04/-.05.14/ .12-.19/-.21-.17/-.17 .71/ .69

The factor loadings show that items load on the primary factors and that these factor loadings are consistent for men and women. Secondary loadings tended to be weak, although even the small loadings were highly significant and consistent across both genders; so were loadings on the two method factors. The results for the sum scores show that most of the variance in sum scores was explained by the primary factor with effect sizes ranging from .71 to .89.

Item-intercepts show the deviation from the middle of the scale in standardized units (standardized mean differences from 3.5). The assumption of equal item-intercepts was relaxed for four items (#3, #40, #43, #64), but even for these items the standardized mean differences were small. The largest difference was observed for following a schedule (M = 0.15, F = 0.38). Constraining these coefficients would reduce fit, but it would have a negligible effect on gender differences on the Big Five traits.

Table 2 and Figure 1 show the standardized mean differences between men and women for latent variables and for sum scores. The results for sum scores were based on the estimated means and variances in the Tech4 output of MPLUS (see output file on OSF).



Given the high degree of measurement invariance and the fairly high correlations between latent scores and sum scores, the results are very similar and replicate previous findings that most gender differences are small, but that women score higher on neuroticism and agreeableness. These results show that these differences cannot be attributed to hidden invalidity of Big Five measures. In addition, the results show a small difference in evaluative bias. Men are more likely to describe their personality in an overly positive way. However, given the size effect and the modest contribution of halo bias to sum scores, it has a small effect effect on mean differences in scales. Along with unreliability, it attenuates the gender differences in agreeableness from d = .80 to d = .56.


Hussey and Hughes claim that personality psychologist were hiding invalidity of personality measures by not reporting tests of structural validity. They also claim that personality measures fail tests of structural validity. The first claim is false because personality psychologists have examined factor structures and measurement invariance for the Big Five (e.g., Anusic et al., 2009; Nye et al., 2016). Thus, Hussey and Hughes misrepresent the literature and fail to cite relevant work. The second claim is inconsistent with Nye et al. results and with my new examination of structural invariance in personality ratings. Thus, Hussey and Hughes article does not make a contribution to the advancement of psychological science. Rather it is an example of poor scholarship, where authors make strong claims (validity hachking) with weak evidence.

The substantive conclusion is that men and women have similar measurement models of personality and that it is possible to use sum scores to compare them. Thus, past results that are based on sum scores reflect valid personality differences. This is not surprising because men and women speak the same language and are able to communicate to each other about personality traits of men and women. There is also no evidence to suggest that memory retrieval processes underlying personality ratings differ between men and women. Thus, there are no reasons to expect structural invariance in personality ratings.

A more important question is whether gender differences in self-ratings reflect actual differences in personality. One threat to the validity could be social comparison processes where women compare to other women and men compare to other men. However, social comparison would attenuate gender differences and cannot explain the moderate to large differences in neuroticism and agreeableness. Nevertheless, future research should examine gender differences using measures of actual behavior and informant ratings. Althoug sum scores are mostly valid, it is preferable to use latent variable models for these studies because latent variable models make it possible to test assumptions that are merely assumed to hold in studies with sum scores.

7 thoughts on “Hidden Invalidity of Personality Measures?

  1. One thing I am concerned about is that personality items are usually very abstract. It seems likely that the abstractness of personality items could lead to misleading conclusions of measurement invariance, since if there is e.g. a difference between men and women in which kinds of things they find anxiety-inducing, then that difference might be abstracted over by the sorts of questions commonly asked in personality tests.

    1. Traits are usually defined as cross-situationally consistent dispositions. So, we are not focusing on responses to a specific situation (e.g., fear of snakes). Of course, this doesn’t mean there are no specific traits and that they are never relevant.

      1. Yes, but in terms of models, a cross-situationally consistent disposition would basically be a latent factor that affects behavior across different situations, right? So trait anxiety would be a latent factor that affects anxiety in the presence of snakes, when packing for a vacation, when one’s relatives are sick, etc.. And then one can ask whether this trait factor is invariant with respect to variables such as sex.

        However, when actually measuring the factor, the items usually abstract over the specific situation of interest. What this means in practice depends a lot on how the respondents handle the abstraction, but it seems plausible to me that the respondents would handle the abstraction by averaging over their experience across the relevant situations. So if we imagine that for each person there’s an average anxiety level, and each item-response is just that average anxiety level plus some noise, then that would be a semi-plausible model of how people respond to anxiety scales.

        But under that model, even if the underlying trait is not invariant with respect to sex in the different situations where the trait is present, then the scale would still be shown to be measurement invariant across the sexes. However, many of the problems that come when dealing with non-MI measures would still arise.

Leave a Reply