The Evaluative Factor in Self-Ratings of Personality Disorder: A Threat to Construct Validity


This blog post reanalyzes data from a study that examined the validity of the Levels of Personality Functioning Scale and the Personality Inventory for DSM-5. I show that the halo-Big Five model fits the correlations among the 25 DSM-5 scales and that the halo factor shows high convergent validity with the Levels of Personality Functioning factor. Whereas personality disorder researchers interpret the general factor in self-report measures of personality functioning as a broad disposition, research on this factor with measures of normal personality suggests that it reflects a response style that is unique to self-ratings. While the evidence is not conclusive, it is problematic that personality disorder researchers ignore the potential contribution of response styles in the assessment of personality disorders with self-reports.


Concerns about the validity of self-reports are as old as self-reports themselves. Some psychologists distrust self-reports so much that they interpret low correlations between self-ratings and behavioral measures as evidence that the behavioral measure is valid (Greenwald et al., 1998). On the other hand, other psychologists often uncritically accept self-reports as valid measures (Baumeister, Campbell, Krueger, Vohs, 2003). The uncritically acceptance of self-reports may be traced back to the philosophy of operationalism in psychology. Accordingly, constructs are defined by methods as in the infamous saying that intelligence is whatever an IQ test measures. Similarly, personality traits like extraversion might be operationalized by self-reports of personality. Accordingly, extraversion is whatever a self report measure of extraversion measures.

Most psychometricians today would reject operationalism and distinguish between constructs and measures. As a result, it is possible to critically examine whether a measure measures the construct it was designed to measure. This property of a measure is called construct validity (Cronbach & Meehl, 1955). From this perspective, it is possible that an IQ test may be a biased measure of intelligence or that a self-report measure of extraversion is an imperfect measure of extraversion. To examine construct validity, it is necessary to measure the same construct with multiple (i.e., at least two) independent methods (Campbell & Fiske, 1959). If two independent measures measure the same construct, they should be positively correlated. This property of measures is called convergent validity.

The most common approach to measure personality with multiple methods is to ask acquaintances to provide informant reports of personality. This approach has been used to demonstrate that self-ratings of many personality traits have convergent validity with reports by others (Connelly & Ones, 2010). The same method has also been used to demonstrate convergent validity for the measurement of maladaptive personality traits (Markon, Quilty, Bagby, & Krueger, 2013). These studies also show that convergent validity is lower than the reliability of self-ratings and informant ratings. This finding indicates that some of the reliable variance in these ratings is method variance (Campbell & Fiske, 1959). To increase the validity of self-ratings of personality, it is necessary to examine the factors that produce method variance and minimize their contribution to the variance in self-ratings.

Research on the unique variance in self-ratings of personality has demonstrate that a large portion of this variance reflects a general evaluative factor (Anusic, Schimmack, Lockwood, & Pinkus, 2009). This factor is present in self-ratings and informant ratings, but does not show convergent validity across raters (Anusic et al., 2009; Biesanz & West, 2004; DeYoung, 2006). Moreover, it is related to other measures of self-enhancement such as inflated ratings of desirable traits like attractiveness and intelligence (Anusic et al., 2009). It also predicts self-ratings of well-being, but not informant ratings of well-being (Kim, Schimmack, & Oishi, 2012; Schimmack & Kim, 2012), suggesting that it is not a substantive trait. Finally, using personality items that are less evaluative, reduces correlations among personality factors (Bäckström & Björklund, 2020). Taken together, these findings suggest that self-reports are influenced by the desirability of traits and that a consistent bias produces artificial correlations between items. This bias is often called halo bias (Thorndike, 1920) or socially desirable responding (Campbell & Fiske, 1959).

It seems plausible that socially desirable responding is an even bigger problem for the use of self-reports in the measurement of personality disorders that are intrinsically undesirable (i.e, nobody wants to have a disorder). Yet, researchers of personality disorders have largely ignored the possibility that socially desirable responding biases self-ratings of personality. Rather, they have interpreted the presence of a general evaluative factor as evidence for a substantive factor that is either interpreted as a broad risk factor in a hierarchical model of factors that contribute to personality disorders (Morey, Krueger, & Skodol, 2013) or as an independent factor that reflects severity of personality disorders (Morey, 2017). These substantive interpretations have been challenged by evidence that the general factor in self-reports of personality disorders is highly correlated with the halo factor in self-ratings of normal personality (McCabe, Oltmanns, & Widiger, 2022). Using existing data, I was able to show that the halo factor in self-ratings of the Big Five personality factors was highly correlated with the general factor in the Levels of Personality Functioning Scale (Morey, 2017), r = .88 (Schimmack, 2022a), and the general factor in the Computerized Adaptive Test of Personality Disorders (CAT-PD), r = 94 (Schimmack, 2022b). In addition, the general factor in the Levels of Personality Function items is highly correlated with the general factor in the CAT-PD items, r = .86. These results suggest that the same factor contributes to correlations among self-ratings of personality and that this factor reflects the desirability of the items.

In this post, I extend this investigation to another measure of maladaptive personality traits, namely the Personality Inventory for DSM-5 (PID-5; Krueger, Derringer, Markon, Watson, & Skodol, 2012). I also provide further evidence about the amount of variance in PID-5 scales that is explained by the general factor. McCabe et al.’s (2022) findings suggested that a large amount of the variance in some scales reflects mostly the general factor. For example, the general factor explained over 60% of the variance in Perceptual Dysregulation, Unusual Beliefs, Deceitfulness, Irresponsibility, Distractibility, and Impulsivity. If these self-report measures were used to diagnose personality disorders, it is vital to examine whether this variance reflects substantive problems or a mere response style to agree or disagree with desirable items.

I used Hopwood et al.’s (2018) data from Study 2 to fit a model to the correlations among the 25 PID-5 scales ( The dataset also included ratings for the Levels of Personality Functioning Scale (Morey, 2017). Based on previous analyses, I used 10 items to validate the general factor in the PID-5 (Schimmack, 2022a). To ensure robustness of the model, I fitted the same model to two random splits of the full dataset and retained only parameters that were statistically significant across both models. The final model had acceptable fit, CFI = .928, RMSEA = .06, and better fit than exploratory factor analyses of the PID-5 (Markon, Quilty, Bagby, & Krueger, 2013).

The main finding was that the general factor in the PID-5 correlated r = .837, SE = .016, with the factor based on the 10 LPFS items. This finding supports the hypothesis that the same, or at least highly similar factors, influence self-ratings on measures of normal personality and maladaptive personality. Table 1 shows the loadings of the 25 PID-5 scales on the general factor and on five additional factors that are likely to correspond to the Big-Five factors of normal personality. It also shows the contribution of unique factors to each scale that may be valid unique variance of dysfunctional personality.

The results replicate McCabe et al.’s (2022) finding that all PID-5 scales load on a general factor. Although the loadings are not as high, they are substantial and 23 out of the 25 loadings are above .5. Table 1 also shows that all scales have unique variance that is not explained by the factors of the halo-Big5 model. 18 of the 25 Loadings on the uniqueness factor are above .5. Finally, the loadings on the Big Five factors are consistent with factor analyses of the PID-5, but the magnitude of these loadings is relatively modest. Only 7 of the 25 loadings are above .5, and only 5 of the 25 scales have a higher loading on a Big-Five factor than on the general factor. These results are consistent with a similar analysis of the CAT-PD scales (Schimmack, 2022b).

In conclusion, self-reports of maladaptive personality that have been proposed as instruments for clinical diagnoses of personality disorders are strongly influenced by a general factor that is common to different instruments such as the Levels of Personality Functioning Scale (Morey, 2017), the Computerized Adapative Test of Personality Disorders (CAT-PD; Simms, Goldberg, Roberts, Watson, Welte, & Rotterman, 2011), and the Personality Inventory for DSM-5 (PID-5; Krueger, Derringer, Markon, Watson, & Skodol, 2012).


Concerns about the influence of response styles on self-ratings are as old as self-reports. Campbell and Fiske (1959) demonstrated that self-ratings of personality traits are more highly correlated with each other than with informant ratings of the same traits. Confirmatory factor analyses of multi-trait-multi-rater data revealed the presence of an evaluative factor that is largely independent across different raters (Anusic et al., 2009). This factor has several names, but it reflects the desirability of items independent of their descriptive meaning. Clinical researchers interested in personality disorder also observed a general factor, but interpreted it as a substantive factor. As I demonstrated in several studies, the halo factor in ratings of normal personality is strongly correlated with the general factor in self-report instruments to diagnose personality disorders. This finding challenges the prevalent interpretation of the general factor as a valid dimension of personality disorder. At a minimum, the results suggest that ratings of maladaptive and undesirable traits are influenced by socially desirable responding. Despite the long history of research on socially desirability, researchers who study personality disorders have downplayed or ignored this possibility. For example, McCabe et al. (2022) dismiss the response style explanation with the argument that “it is more likely that persons are providing reasonably forthright self-descriptions,” while ignoring the finding that the evaluative factor in self-ratings lacks convergent validity with informant ratings (Anusic et al., 2009), and mentioning earlier that Anusic et al.’s (2009) results provided support for this hypothesis. They also admit that their results “are consistent with, or at least don’t rule out, the social undesirability or evaluative bias hypothesis” (p. 151), but then conclude that “some persons do indeed have many undesirable traits whereas other persons have many desirable traits” (p. 151) without citing any evidence for this claim. In fact, it is much more common for respondents to have none of the personality disorders (i.e., a score in the upper 10% of a scale, 44%) and few participants have more than 10 disorders (5%). This asymmetry is more consistent with a response style that attenuates scores on undesirable traits than a broad dispositions that make some people have many undesirable traits.

Markon et al. (2013) advocate the use of informant ratings to control for response styles in self-ratings, PID-5, but do not use the informant ratings to examine the presence of biases in self-ratings. More broadly, numerous articles claim to examine the validity of personality disorder instruments and all of these articles conclude that these instruments are valid (Krueger et al., 2012; Long, Reinhard, Sellbom, & Anderson, 2021; Morey, 2017; Simms et al., 2011). Some authors are also inconsistent across articles. For example, Ringwald, Manuck, Marsland, and Wright (2022) note that “many studies suggest it [the general factor] is primarily the
product of rater-specific variance” (p. 1316), but Ringwald, Emery, Khoo, Clark, Kotelnikova, Scalco, Watson, Wright and Simms (2022) neither model the general factor, nor mention that response styles could influence scores on the CAT-PD scales. Evidence that the general factor in personality disorder instruments is strongly correlated with the evaluative factor in ratings of normal personality requires further investigation. Claims that personality disorder measures are valid are misleading and fail to acknowledge the possibility that response styles produce method variance. The presence of method variance does not invalidate measures because validity is a quantitative construct (Cronbach & Meehl, 1955). Markon et al. (2013) demonstrate convergent validity between self-reports and informant ratings of the PID-5 traits. Thus, there is evidence that self-ratings have some validity. The goal of future validation research should be to identify method factors and develop revised measures with higher validity as personality researchers are trying to reduce the evaluative bias in measures of normal personality (Bäckström & Björklund, 2020; Wood, Anglim, & Horwood, 2022; ). However, this might be more difficult for measures of disorders because disorders are intrinsically undesirable. Thus, it may be necessary to use statistical controls or a multi-rater assessment to increase the validity of self-report instruments designed to measure maladaptive traits.

Meanwhile, personality disorder researchers continue to disregard the possibility that a large portion of the variance in self-report measures is merely a response style and make claims about construct validity based on inappropriate methods to separate valid construct variance from method variance (e.g., Widiger & Crego, 2019). Most of the claims that personality disorder instruments are valid are based on correlations of one self-report measure with another or evidence that factor analyses of personality disorder scales have a similar factor structure to factor analysis of normal personality traits (i.e., the Big Five). Neither finding warrants the claim that maladaptive personality scales measure maladaptive personality traits. Instead, the finding that the halo-Big Five model can be fitted to correlations among personality disorder scales suggests that these scales merely have more evaluative content and are more strongly influenced by socially desirable responding. Multi-method evidence is needed to demonstrate that the general factor reflects a substantive trait and that specific traits are maladaptive; that is, produce intrapersonal or interpersonal problems for individuals with these traits. For now, claims about the validity of personality disorder instruments are invalid because they fail to meet basic standards of construct validity and fail to quantify the amount of method variance in these scales (Campbell & Fiske, 1959; Cronbach & Meehl, 1955).

Leave a Reply