All posts by Ulrich Schimmack

About Ulrich Schimmack

Since Cohen (1962) published his famous article on statistical power in psychological journals, statistical power has not increased. The R-Index makes it possible f to distinguish studies with high power (good science) and studies with low power (bad science). Protect yourself from bad science and check the R-Index before you believe statistical results.

Hierarchical Factor Analysis

One important scientific activity is to find common elements among objects. Well-known scientific examples are the color wheel in physics, the periodic table in chemistry, and the Linnaean Taxonomy in biology. A key feature of these systems is the assumption that objects are more or less similar along some fundamental features. For example, similar animals in the Linnaean Taxonomy share prototypical features because they have a more recent common ancestor.

The prominences of classification systems in mature sciences suggests that psychology could also benefit from classification of psychological objects. A key goal of psychological science is to understand human’s experiences and behaviors. At a very abstract level, the causes of experiences and behaviors can be separated into situational and personality factors (Lewin, 1935). The influence of personality factors can be observed when individuals act differently in the same situation. The influence of situations is visible when the same person acts differently in differently situations.

Personality psychologists have worked on a classification system of personality factors for nearly a century, starting with Allport and Odbert (1936) catalogue of trait words, and Thurstone’s (1934) invention of factor analysis. Factor analysis has evolved and there are many different options to conduct a factor analyses. The most important development was the invention of confirmatory factor analysis (Joreskog, 1969). Confirmatory factor analysis has several advantages over traditional factor analytic models that are called exploratory factor analyses to distinguish them from confirmatory analyses. Confirmatory factor analysis has several advantages over exploratory factor analysis. The most important advantage is the ability to test hierarchical models of personality traits (Marsh & Myers, 1986). The specification of hierarchical models with CFA is called hierarchical factor analysis. Despite the popularity of hierarchical trait models, personality researchers continue to rely on exploratory factor analysis as the method of choice. This methodological choice impedes progress in the search for a structural model of personality traits.

Metaphorical Science

A key difference between EFA and CFA is that EFA is atheoretical. The main goal is to capture the most variance in observed variables with a minimum of factors. This purely data driven criterion implies that the number of factors and the nature of factors is arbitrary. In contrast, CFA models aim to fit the data and it is possible to compare models with different numbers of factors. For example, EFA would have no problem of showing a single first factor, even if feminine and masculine traits were independent (Marsh & Myers, 1986). However, such a model might show bad fit, and model comparison could show that a model with two factors fits the data better. The lack of model fit in traditional EFA applications may explain Goldberg’s attempt to explore hierarchical structures with a series of EFA models that specify different numbers of factors, starting with a single factor and adding one more factor at each step. For all solutions, factors are rotated based on some arbitrary criterion. Goldberg prefers Varimax rotation. As a consequence, factors within the same model are uncorrelated. His Figure 2 shows the results when this approach was used for a large number of personality items.

To reinforce the impression that this method reveals a hierarchical structure, factors at different levels are connected with arrows that point from the higher levels to the lower levels. Furthermore, correlations between factor scores are used to show how strong factors at different levels are related. Readers may falsely interpret the image as evidence for a hierarchical model with a general factor on top. Goldberg openly admits that his method does not hierarchical causal models and that none of the levels may correspond to actual personality factors.

‘To many factor theorists, the structural representations included in this article are not
truly “hierarchical,” in the sense that this term is most often used in the methodological
literature (e.g., Yung, Thissen, & McLeod, 1999). For those who define hierarchies in
conventional ways, one might think of the present procedure in a metaphorical sense” (p. 356).

The difference between a conventional and unconventional hierarchical model is best explained by the meaning of a directed arrow in a hierarchical model. In a conventional model, an arrow implies a causal effect and causal effects of a common cause produce a correlation between the variables that share a common cause (PSY100). For example, in Figure 1 , the general factor correlates r = .79 with the first factor a level 2 and r = .62 with the second factor at level 2. The causal interpretation of these path coefficients would imply that the correlation between the two level-2 factors is .79 x .62 = .49. Yet, it is clear that this prediction is false because factors at the same level were specified to be independent. It therefore makes no sense to draw the arrows in this direction. Goldberg realizes this, but does it anyways.

“While the author has found it useful to speak of the correlations between factor scores at different levels as “path coefficients,” strictly speaking they are akin to part-whole correlations, but again the non-traditional usage can be construed metaphorically” .

It would have been better to draw the arrows in the opposite direction because we can interpret the reversed path coefficients as information about the loss of information when the number of factors is reduced by one. For example, the correlation of r = .79 between the first level 2 factor and the general factor implies that the general factor sill captures .79^2 = 62% of the variance of the first level 2 factor and 38% of the variance is lost in the one-factor model. Goldberg fittingly called his approach ass-backwards and that means the arrows need be interpreted in the reverse direction.

The key advantages of Goldberg’s approach is that researchers did not need to buy additional software before R made CFA free of charge, did no have to learn structural equation modeling, and did not have to worry about model fit. A hierarchical structure with a general factor could always be found, even if the first factor is unrelated to some of the lower factors (see Figure 3 in Goldberg).

There is also no need to demonstrate consistency across datasets. The factors in he two models show different relations to the five factors at the lowest level. This is the beauty of metaphorical science. Every analysis provides a new metaphor that reflects personality without any ambition to reveal fundamental factors that influence human behavior.

Metaphorical Pathological Traits

It would be unnecessary to mention Goldberg’s metaphorical hierarchical models, if personality researchers had ignored his approach and used CFA to test hierarchical models. in fact, there have been no notable applications of Goldberg’s approach in mainstream personality psychology. However, the method has gained popularity among clinical psychologists interested in personality disorders. A highly cited article by Kotov et al. (2017) claims that Goldberg’s method “supported the presence of a p factor, but also suggested
that multiple meaningful structures of different generality exist between the six spectra and a p factor” (p. 463). I do not doubt that meaningful metaphors can be found to describe maladaptive traits, but it is problematic that interpretability is the sole criterion to justify the claim of a hierarchical structure of personality factors that may cause intrapersonal and interpersonal problems. Although Kotov et al. (2017) mention confirmatory factor analysis as a potential research tool, they do not mention that Goldberg’s method is fundamentally different from hierarchical CFA.

The most highly cited application of Goldberg’s method is published in an article by Wright, Thomas, Hopwood, Markon, Pincus, and Krueger (2012). The data are undergraduate (N = 2,461) self-ratings on the 220 items of the Personality Inventory for DSM-5. The 220 items are scored to provide information about 25 maladaptive traits that are correlated with each other. Wright et al. show that the correlations among the 25 scales can be represented with five correlated factors, but they do not provide fit indices of the five-factor solution. Correlations among the five factors ranged from r = .043 to .437.

Figure 1 in Wright et al. (2012 shows Goldberg’s hierarchical structure.

Naive interpretation of the structure and path coefficient seems to suggest the presence of a strong general factor that contributes to personality pathology. This general factor appears to explain a large amount of variance in internalizing and externalizing personality problems. Internalizing and externalizing factors explain considerable variance in four of the five primary factors, but psychoticism appears to be rather weakly related to the other traits and the p-factor. However, this interpretation of the results is only metaphorical.

A proper interpretation of the hierarchy focuses on the variance that is lost when five factors are reduced to fewer factors. For example, by combining the internal and external factors into a single p-factor implies that 72% of the variance in internalizing traits is retrained and 28% are lost. For externalizing traits, only 32% of the variance is retained and 68% is lost. Combined the reduction of two factors to one factors leads to a loss of 96% of the variance. This implies that the two factors are orthogonal because reducing two independent factors to one leads to a loss of 50% of the variance in each and a loss of 100% of the variance in both (200% total). Thus, rather than supporting the presence of a strong p-factor, Figure 1 actually suggests that there is no strong general factor. This is not surprising when we look at the correlations among the factors. Negative affect (internalizing) correlated weakly with the externalizing factors antagonism, r = .04, and disinhibition, r = .09. These correlations suggest that internalizing and externalizing traits are independent, rather than sharing a common influence of a general pathology factor.

Hierarchical Confirmatory Factor Analysis

I used the correlation matrix in Wright et al.’s (2012) Table 1 to build a hierarchical model with CFA. The first model imposed a hierarchical structure with 4 levels and a correlation between the top two factors. It would have been possible to specify a general factor, but the loadings of the two factors on this general factor are not determined. The model had good fit, chi2 (2) = 2.51, CFI = 1.000, RMSEA = .010.

The first observation is that the top two factors are only weakly correlated, r = .11. This supports the conclusion that there is no evidence for a general factor of personality pathology that contributes substantially to correlations among specific PID-5 scales. The second observation is that many factors at higher levels are identical to lower level traits. Thus, the observation that there are factors at all levels is illusory. The NA factor at the highest level is practically identical with the NA factor at the lowest level. The duplication of factors at various levels is unnecessary and confusing. Therefore I built a truly hierarchical CFA model that does not specify the number of levels in the hierarchy a priori. This model also had good fit, chi2(df = 2) = 2.51, CFI = 1.000, RMSEA = .010.

The model shows that detachment and negative affect are related to each other by a shared factor (F1-1) that could be interpreted as internalizing. Similarly, Antagonism and Disinhibition share a common factor (F1-2) that could be labeled externalizing. At a higher level, a general factor relates these two factors as well as psychoticism. The loadings on the general factor are high, suggesting that scores on the PID-5 scales are correlated with each other because they share a single common factor. The low correlations between Negative Affect and externalizing are attributed to a negative relationship of the externalizing factor (F1-2) and Negative Affect.

The good fit of these models does not imply that they capture the true nature of the relationships among PID-5 scales. It is also not clear whether the p-factor is a substantive factor or reflects response styles. However, unlike Goldberg’s method, HCFA can be used to test hierarchical models of personality traits. Thus, researchers who are speculating about hierarchical structures need to subject their models to empirical tests with HCFA. Goldberg’s method is metaphorical, unsuitable, and unscientific. It creates the illusion that it reveals hierarchical structures, but it merely shows which variances are lost in models with fewer factors. In contrast, HCFA can be used to test models that aim to explain variance rather than throwing it away.

Personality Disorder Research: A Pathological Paradigm

Every scientist who read Kuhn’s influential book “The structure of scientific revolutions” might wonder about the long-term impact of their work. According to Kuhn, scientific progress is marked by periods of normal growth and periods of revolutionary change (Stanford Encyclopedia of Philosophy, 2004). During times of calm, scientific research is guided by paradigms. Paradigms are defined as “the key theories, instruments, values and metaphysical assumptions that guide research and are shared among researchers within a field” Paradigm shifts occur when one or more of these fundamental assumptions are challenged and shown to be false.

Revolutionary paradigm shifts can have existential consequences for scientists who are invested in a paradigm. Just like revolutionary technologies may threaten incumbent technologies (e.g., electric vehicles), scientific research may lose its value after a paradigm shift. For example, while the general principles of operant conditioning hold, many of the specific studies with Skinner boxes lost their significance after the demise of behaviorism. Similarly, the replicability revolution invalidated many social psychological experiments on priming after it became apparent that selective publication of significant results with small samples produces results that cannot be replicated, a hallmark feature of science.

Personality research has seen a surprisingly long period of paradigmatic growth over the past 40 years. In the 1980s, a consensus emerged that many personality traits can be related to five higher-order factors that became to be known as the Big Five. Paradigmatic research on the Big Five has produced thousands of results that show how the Big Five are related to other trait measures, genetic and environmental causes, and life outcomes. The key paradigmatic assumption of the Big Five paradigm is that self-reports of personality are accurate measures of the Big Five traits. Aside from this basic assumption, the Big Five paradigm is surprisingly vague about other common features of paradigms. For example, there does not exist a dominant theory about the nature of the Big Five traits (i.e., What are these dimensions). It is also unclear why there would be five rather than four or six higher order traits. Moreover, there is no agreement about the relationship among the Big Five (are they independent or correlated), or the relationship of specific traits to the Big Five (e.g., is trust related to neuroticism, agreeableness, or both?). These questions can be considered paradigmatic questions that researchers aim to answer by conducting studies with self-report measures of the Big Five.

Research on personality disorders has an even longer history with its roots in psychiatry and psychodynamic theories. However, the diagnosis of personality disorders also witnessed a scientific revolution when clinical psychologists started to examine disorders from the perspective of Big Five theories of normal personality (Millon & Frances, 1987). Psychological research on personality disorders was explicitly framed as developing an alternative model of personality disorders that would replace the model of personality disorders developed by psychiatrists (Widiger & Simonsen, 2005). Currently, the scientific revolution is ongoing and the Diagnostic and Statistical Manual of Mental Disorders lists several approaches to the diagnosis of personality disorders.

The common assumptions of the Personality Disorder Paradigm in Clinical Psychology are that (a) there is no clear distinction between normal and disordered personality and that disorders are defined by arbitrary values on a continuum (Markon, Krueger, & Watson, 2005), (b) the Big Five traits account for most of the meaningful variance in personality disorders (Costa & McCrae, 1980), and self-reports of personality disorders are valid measures of actual personality disorders (Markon, Quilty, Bagby, & Krueger, 2013).

Over the past two decades, paradigmatic research within the PDP has examined personality disorders using the following paradigmatic steps: (a) write items that are intended to measure personality disorders or maladaptive traits, (b) administer these items to a sample of participants, (c) demonstrate that these items have internal consistency and can be summed to create scale scores, and (d) examine the correlations among scale scores. These studies typically show that personality disorder scales (PDS) are correlated and that five factors can represent most, but not all, of these correlations (Kotov et al., 2017). Four of these five factors appear to be similar to four of the Big Five factors, but Openness is not represented in factor models of personality disorders and Psychoticism is not represented in the Big Five. This has led to paradigmatic questions about the relationship between Openness and Psychoticism among personality disorder researchers.

Another finding has been that factor analytic models that allow for correlations among factors show replicable patterns of correlations. This finding is surprising because the Big Five factors and measures of the Big Five were developed using factor models that impose independence on factors and items were selected to be representative of these orthogonal factors. The correlations among personality scales have been the topic of various articles in the Big Five paradigm and the personality disorder paradigm. This research has led to hierarchical models of personality with a single factor at the top of the hierarchy (Musek, 2007). In the personality disorder paradigm, this factor is often called “general personality pathology,” the “general factor of personality pathology,” or simply the p-factor (Asadi, Bagby, Krueger, Pollock, & Quilty, 2021; Constantinou et al., 2022; Hopewood, Good, & Morey, 2018; Hyatt et al., 2021; McCabe, Oltmanns, & Widiger, 2022; Oltmanns, Smith, Oltmanns, & Widiger, 2018; Shields, Giljen, Espana, & Tackett, 2021; Uliaszek, Al-Dajani, Bagby, 2015, Van den Broeck, Bastiaansen, Rossi, Dierckx, De Clercq, & Hofmans, 2014; Widiger & Oltmanns, 2017; Williams, Scalco, & Simms, 2018).

It is symptomatic of a pathological paradigm that researchers within the paradigm have uncritically accepted that the general factor in a factor analysis represents a valid construct and that alternative interpretations of this finding are ignored or dismissed with flawed arguments. Most aforementioned articles do not even mention alternative explanations for the general factor in self-ratings. Others, mention, but dismiss the possibility that this general factor at least partially reflects method variance in self-ratings. McCabe et al. (2022) note that “the results of the current study are consistent with, or at least don’t rule out, the social undesirability or evaluation bias hypothesis” (p. 151). They dismiss this alternative explanation with a reference to a single study from 1983 that showed “much of the variance in socially desirability scales was substantively meaningful individual differences (McCrae & Costa, 1983)” (p. 151). Notably, the authors cite several more recent articles that provided direct evidence for the presence of evaluative biases in self-ratings of personality (Anusic, Schimmack, Pinkus, & Lockwood, 2009; Backstrom, Bjorklund, and Larsson, 2009; Chang, Connelly, & Geeza, 2012; Pettersson, Turkheimer, Horn, & Menatti, 2012), but do not explain why these studies do not challenge their interpretation of the general factor in self-ratings of personality disorders.

The strongest evidence for the interpretation of the general factor as a method factor comes from multi-trait-multi-method studies (Campbell & Fiske, 1959). True traits should show convergent validity across raters. In contrast, method factors produce correlations among ratings by the same rater, but not across different raters. Most factors are likely to be a mixture of trait and method variance. Thus, it is essential to quantify the amount of method and trait variance and avoid general statements of validity (Cronbach & Meehl, 1955; Schimmack, 2021). A few studies of personality disorders have used multiple methods. However, most publications have not analyzed these data using a multi-trait-multi-method approach to separate trait and method variance. I could find only one article that modeled multi-method data (Blackburn, Donnelly, Logan, & Renwick, 2004). Consistent with multi-method studies of normal personality, the results showed modest convergent validity across raters and a clear method factor that often explained more variance in self-report scales than the trait factors. However, this finding has been ignored by subsequent researchers. To revisit this issue, I analyzed three multi-method datasets.

Study 1

Lisa M. Niemeyer, Michael P. Grosz, Johannes Zimmermann & Mitja D. Back
(2022) Assessing Maladaptive Personality in the Forensic Context: Development and Validation
of the Personality Inventory for DSM-5 Forensic Faceted Brief Form (PID-5-FFBF), Journal of
Personality Assessment, 104:1, 30-43, DOI: 10.1080/00223891.2021.1923522

This study was conducted in Germany with male prisoners. Personality was measured with self-reports and informant ratings by the prisoners’ psychologist or social worker and a penal service officer. This made it possible to separate method and trait variance, where trait variance is defined as variance that is shared among the three raters.

Normal personality was assessed with a 15-item measure of the Big Five. Given the small sample, scale scores rather than items were used as indicators of normal personality. Maladaptive personality was measured with a forensic adaptation of the German version of the PID-5 faceted Brief Form. This measure has 25 scales. Given the small sample size the focus was on scales that can serve as indicators of four higher-order factors that are related to neuroticism, extraversion, agreeableness, and conscientiousness. A fifth psychoticism factor could not be identified in this dataset. The scales were Anxiousness, Separation Insecurity, and Depression for Neuroticism (Negative Affectivity), Withdrawal, Intimacy Avoidance, and Anhedonia for low Extraversion (Detachment), Manipulativeness, Deceitfulness, and Grandiosity for low Agreeableness (Antagonism), and Impulsivity, Irresponsibility, and Distractibility for low Conscientiousness (Disinhibition).

There are multiple ways to separate method and trait variance in hierarchical multi-trait-multi-method models. I used Anusic et al.’s (2009) approach that first modeled the hierarchical structure separately for each rater and then defined trait factors at the highest level of the hierarchy. This approach makes it possible to examine the amount of convergent validity for the higher-order factors that are the primary focus in this analysis. Additional agreement for the unique variance in facets was modeled using a bi-factor approach where additional facet factors reflect only the unique variance in facets.

The first model assumed that there are no secondary loadings, no correlations among the four trait factors, and no correlations among the rater-specific indicators. This model had poor fit, CFI = .732, RMSEA = .079.

The second model added correlations among the four Big Five factors. The general personality pathology model predicts correlations among the four factors. Neuroticism should be negatively corelated with the other three factors and the other other three factors should be positively correlated. Allowing for these correlations improved model fit, CFI = .754, RMSEA = .076, but overall fit was still poor. Furthermore, the pattern of correlations did not conform to predictions. Mainly, neuroticism was positively correlated with agreeableness, r = .24, and extraversion was negatively correlated with agreeableness, r = -.12.

Exploration of the other relationships suggested that secondary loadings accounted for some of the correlations among the trait factors. Namely, Anxiousness and Depression had negatively loadings on Extraversion and Anhedonia had a secondary loading on Neuroticism (N-E); Deceitfulness had a secondary loading on Conscientiousness and Impulsiveness and Irresponsibility had secondary loadings on Agreeableness (A-C); finally, Impulsiveness, Irresponsibility, and Distractedness had secondary loadings on Neuroticism (N-C). Adding these secondary loadings to the measurement model of each rater improved model fit and RMSEA suggested acceptable fit, CFI = .853, RMSEA = .060. In this model, none of the correlations among the trait factors were significant at the .01 level, and the pattern still did not conform to predictions of the g-factor model. However, it was possible to replace the correlations among the four factors with a fixed loading pattern. This model had only slightly worse fit, CFI = .850, RMSEA = .060. Loadings on this factor ranged from .19 for extraversion to r = .46 for conscientiousness. At the same time, a model without correlations or a GFP factor equally fit the data, CFI = .850, RMSEA = .060.

The next models examined potential method factors. Evaluative bias factors were added for each of the three raters with fixed loadings. This improved model fit, CFI = .854, RMSEA = .059. The standardized loadings for the halo factor in self-ratings ranged from r = .31 (Extraversion) to .67 (Conscientiousness). A general factor could not be identified for one of the informant factors and none of the loadings on the other informant factor were significant at alpha = .01. This suggests that evaluative biases were mostly present in self-ratings. Removing the method factors for the two informants did not change model fit, CFI = .854, RMSEA = .059.

To examine the unconstrained relationships among the self-rating factors, I replaced the method factor with free correlations among the four self-rating factors. Model fit decreased a bit, indicating that the more parsimonious model did not severely distort the pattern of correlations, CFI = .854, RMSEA = .060. The pattern of correlations matched predictions of the halo model, but only one of the correlations was significant at alpha = .01.

In conclusion, model comparisons suggested the presence of an evaluative bias factor in self-ratings of German male prisoners and provided no evidence that a general personality pathology factor produces correlations among measures of normal and maladaptive personality. Of course, these results from a relatively small sample drawn from a unique population cannot be generalized to other populations, but the results are consistent with multi-method studies of normal personality in various samples ().

Study 2

Brauer, K., Sendatzki, R. & Proyer, R. T. (2022). Localizing gelotophobia, gelotophilia, and katagelasticism in domains and facets of maladaptive personality traits: A multi-study report using self- and informant ratings. Journal of Research in Personality98, No. 104224.

This study examined personality disorders in a German community sample. For each target, one close other provided informant ratings. Personality disorders were assessed with the German version of the brief PID-5 that is designed to measure five higher-order dimensions of personality pathology, namely Negative Affectivity (Neuroticism), Detachment (low Extraversion), Antagonism (low Agreeableness), Disinhibition (low Conscientiousness), and Psychoticism (not represented in the Big Five model of normal personality).

With only two methods, it is necessary to make assumptions about the validity of each rater. A simple way of doing so is to constrain the unstandardized loadings of self-ratings and informant ratings under assumption that self-ratings and informant ratings are approximately equally valid. The first model assumed that there is no method variance and that the five factors are independent. This model had poor fit, CFI = .295, RMSEA = .236. I then allowed the five factors to correlate freely with each other. Model fit improved, but remained low, CFI = .605, RMSEA = .204. The pattern of correlations conformed to the predictions of the general personality pathology model. I then added method factors for self-ratings and informant ratings to the model. Loadings on this factors were constrained to be equal for all five scales. This modification increased model fit and model fit considerably, CFI = .957, RMSEA = .067. Loadings on the method factor were substantial (> .4). Furthermore, several of the trait correlations were no longer significant at the .01 level, suggesting that some of these correlations were spurious and reflected unmodeled method variance.

The next model examined whether some of the variance in the method factors reflected an actual g-factor of personality pathology. To do so, I removed the correlations among the trait factors and let the two method factors correlate. A correlation between these method factors can be interpreted as convergent validity for independent measures of the g-factor. This modification produced a reduction in model fit, CFI = .865, RMSEA = .107, and showed a significant correlation between the two method factors, r = .33. This finding suggests that one-third of the variance in these factors may reflect a real g-factor. However, model fit suggests that this model mispresents the actual pattern of correlations in the data. Exploratory analyses suggested that Extraversion and Psychoticism were negatively related to Conscientiousness and that Psychoticism had a stronger loading on the method factor. Adding these modifications raised model fit to acceptable levels, CFI = .957, RMSEA = .064. In this model the two method factors remained correlated, but the confidence interval shows that a substantial amount of the variance is unique to each method factor, r = .27, 95%CI = .08 to .44. Although, the lower bound of the confidence interval is close to zero. In sum, these results provide further evidence that the general factor in self-ratings of personality pathology reflects partially method variance rather than a general disposition to have many personality disorders.

Study 3

Oltmanns, J. R., & Widiger, T. A. (2021). The self- and informant-personality inventories for ICD-11: Agreement, structure, and relations with health, social, and satisfaction variables in older adults. Psychological Assessment, 33(4), 300–310.

The data of this study are from a longitudinal study of personality disorders. Participants nominated close relatives who provided informant ratings. Personality disorders were measured using the Personality Inventory for ICD-11 and an informant version of the same questionnaire. The questionnaire assesses 5 dimensions. Four of these dimensions correspond to the Big Five, namely Negative Affectivity (Neuroticism), Detachment (low Extraversion), Dissociality (low Agreeableness), and Disinhibition (low Conscientiousness). The fifth dimension is called Anankastia, which may be related to a maladaptive form of high Conscientiousness (e.g., Perfectionism).

I used the published MTMM matrix in Table 2 to examine the presence of a general personality factor and method variance. Like the original authors, I fitted a four-factor model with Disinhibition and Anankastia as opposite indicators of a bipolar Conscientiousness factor. The four factors were clearly identified, but the model without method factors and correlations among the four factors did not fit the data, CFI = .279, RMSEA = .243. Allowing for correlations among the four factors improved model fit, but fit remained poor, CFI = .443, RMSEA = .232. The pattern of correlations was consistent with the p-factor predictions. The next model added method factors for self-ratings and informant ratings. All loadings except those for the Anankastia scales were fixed. The Anankastia loadings were free because high conscientiousness is desirable and should load less on a factor that reflects undesirable content in other scales. The inclusion of method factors improved model fit, CFI = .844, RMSEA = .131, but fit was not acceptable. All scales except the Anankastia scales had notable (> .4) loadings on the method factors. Only the trait correlation between agreeableness and conscientiousness was significant at alpha = .01, r = .22. The next model removed all of the other correlations and allowed for a correlation between the two method factors. This model had similar fit, CFI = .845, RMSEA = .123. The correlation between the two method factors was r = .24. Exploratory analysis showed rater specific correlations between the Disinhibition and Anankastia (i.e., low and high conscientiousness) scales. Adding these parameters to the model improved model fit, CFI = .920, RMSEA = .091, but did not alter the correlation between the two method factors, r = .26. Freeing the loading of the Negative Affectivity scale on the method factors further improved model fit, CFI = .948, RMSEA = .076, but did not alter the correlation between the two method factors. Freeing the loading of the Disinhibition scales on the method factors further improved model fit, CFI = .972, RMSEA = .058, but the correlation between the two method factors remained the same, r = .24. The 95% confidence interval ranged from .14 to .33. These results are consistent with Study 2.

General Discussion

Factor analyses of personality disorder questionnaires have suggested the presence of one general factor that predicts higher scores on all measures of maladaptive personality traits. A major limitation of these studies was the reliance on a single method, typically self-reports. Mono-method studies are unable to distinguish between method and trait variance (Campbell & Fiske, 1959). Although a few studies have used multiple methods to measure personality traits, personality disorder researchers did not analyze these data with multi-method models. I provide results of multi-method modeling of three datasets. All three datasets show a method factor in self-reports of personality disorders that is either independent of informant ratings by observers (Study 1) or only weakly related to informant ratings by close others (Studies 2 and 3). These results are consistent with the presence of method factors in self-ratings of normal personality (Anusic et al., 2009; Biesanz & West, 2004; Chang et al., 2012). It is therefore not surprising that the same results were obtained. However, it is surprising that the personality disorder researchers have ignored this evidence. The reason does not appear a lack of awareness that multi-method data are important. For example, nearly a decade ago several prominent personality disorder researchers noted that “because most of these studies (including our own study) are based on self-report measures, substantial parts of the multitrait-multimethod matrix currently remain unexplored” (Zimmermann, Altenstein, Krieger, Holtforth, Pretsch, Alexopoulus, Spitzer, Benecke, Krueger, Markon, & Leising, 2014). One possible explanation for the lack of multi-method analyses might be that they threaten the construct validity of personality disorder instruments if a substantial portion of the variance in personality disorder scales reflects method factors. Nearly two decades ago, Blackburn, a multi-method model of personality showed that method factors explained more variance than trait factors (Blackburn, Donnelly, Logan, & Renwick, 2004). However, this article received only 43 citations, whereas articles that interpret method variance as a general trait have garnered over 1,000 citations (Kotov et al., 2017). The uncritical reliance on self-ratings reveals a pathological paradigm that is built on false assumptions. Self-reports of personality disorders are not highly valid measures of personality disorders. Even if scale scores are internally consistent and reliable over time, they cannot be accepted as unbiased measures of actual pathology. The limitations of self-reports are well known in other domains and have led to reforms in clinical assessments of other disorders. For example, the DSM-5 now explicitly states that the diagnosis of Attention Deficit Hyperactivity Disorder (ADHD) requires assessment with symptom ratings by at least two raters (Martel, Schimmack, Nikolas, & Nigg, 2015). The present results show the importance of a multi-rater approach to assess personality disorders.

In contrast, the assessment of personality disorders in the DSM-5 is not clearly specified. A traditional system is still in place, but two alternative approaches are also mentioned. One approach is called the criterion A. It is based on the assumption that distinct personality disorders are highly correlated and that it is sufficient to measure a single dimension that is assumed to reflect severity of dysfunction (Sharp & Wall, 2021). A popular self-report measure of Criterion A is the Levels of Personality Functioning Scale (Morey et al., 2011). Aside from conceptual problems, it has been shown that the LPFS is nearly identical to measures of evaluative bias in self-ratings of normal personality (Schimmack, 2022). The present results show that most of this variance is rater-specific and reflects method factors. Thus, there is currently no evidence for the construct validity for measures of general personality functioning. Unless such evidence can be provided with multi-method data, Criterion A should be removed from the next version of the DSM.

Meta-Psychological Reflections

Psychology is not a paradigmatic science that rests on well-established assumptions. In contrast, psychology is best characterized as a collection of mini-paradigms that are based on assumptions that are not shared across paradigms. For example, many experimentalists would request evidence based on cross-sectional studies of self-reports. In contrast, research on personality disorders rests nearly entirely on the assumption that self-reports provide valid information about personality disorders. While most researchers would probably acknowledge that method factors exist, there are no scientific attempts to assess and minimize their impact. Instead, method effects are minimized by false appeals to the validity of self-reports. For example, without any empirical evidence and total disregard of existing evidence, Widiger and Oltmanns (2017) state “It is evident that most persons are providing reasonably accurate and
honest self-descriptions. It would be quite unlikely that such a large degree of variance would reflect simply impression management” (p. 182). The need for a multi-method assessment is often acknowledged in the limitation section and delegated to future research that is never conducted, even if these data are available (Asadi, Bagby, Krueger, Pollock, & Quilty, 2021). These examples are symptomatic of a pathological paradigm in need of a scientific revolution. Unlike other pathological paradigms in psychology, the assessment of mental illnesses has huge practical implications that require a thorough examination of the assumptions that underpin clinical diagnoses. At present, the diagnosis of personality disorder is not based on scientific evidence and claims about validity of personality disorder measures are unscientific. Most likely a valid assessment of personality disorders requires a multi-method approach that follows the steps outlined by Chronbach and Meehl (1955). To make progress, clinical psychologists need better training in psychometrics and journals need to prioritize multi-method studies over quick and easy studies that rely exclusively on self-reports with online samples (e.g. Hopwood et al., 2018). Most importantly, it is necessary to overcome the human tendency to confirm prior beliefs and to allow empirical data to disconfirm fundamental assumptions. It is also important to listen to scientific criticism, even if it challenges fundamental assumptions of a paradigm. Too often scientists take these criticism as personal attacks because their self-esteem and identity is wrapped up in their paradigmatic achievements. As a result, threats to the paradigm become existential threats. A healthy distinction between self and work is needed to avoid defensive reactions to valid criticism. Clinical psychologists should be the first to realize the importance of this recommendation for the long-term well-being of their science and their own well-being.

The Evaluative Factor in Self-Ratings of Personality Disorder: A Threat to Construct Validity


This blog post reanalyzes data from a study that examined the validity of the Levels of Personality Functioning Scale and the Personality Inventory for DSM-5. I show that the halo-Big Five model fits the correlations among the 25 DSM-5 scales and that the halo factor shows high convergent validity with the Levels of Personality Functioning factor. Whereas personality disorder researchers interpret the general factor in self-report measures of personality functioning as a broad disposition, research on this factor with measures of normal personality suggests that it reflects a response style that is unique to self-ratings. While the evidence is not conclusive, it is problematic that personality disorder researchers ignore the potential contribution of response styles in the assessment of personality disorders with self-reports.


Concerns about the validity of self-reports are as old as self-reports themselves. Some psychologists distrust self-reports so much that they interpret low correlations between self-ratings and behavioral measures as evidence that the behavioral measure is valid (Greenwald et al., 1998). On the other hand, other psychologists often uncritically accept self-reports as valid measures (Baumeister, Campbell, Krueger, Vohs, 2003). The uncritically acceptance of self-reports may be traced back to the philosophy of operationalism in psychology. Accordingly, constructs are defined by methods as in the infamous saying that intelligence is whatever an IQ test measures. Similarly, personality traits like extraversion might be operationalized by self-reports of personality. Accordingly, extraversion is whatever a self report measure of extraversion measures.

Most psychometricians today would reject operationalism and distinguish between constructs and measures. As a result, it is possible to critically examine whether a measure measures the construct it was designed to measure. This property of a measure is called construct validity (Cronbach & Meehl, 1955). From this perspective, it is possible that an IQ test may be a biased measure of intelligence or that a self-report measure of extraversion is an imperfect measure of extraversion. To examine construct validity, it is necessary to measure the same construct with multiple (i.e., at least two) independent methods (Campbell & Fiske, 1959). If two independent measures measure the same construct, they should be positively correlated. This property of measures is called convergent validity.

The most common approach to measure personality with multiple methods is to ask acquaintances to provide informant reports of personality. This approach has been used to demonstrate that self-ratings of many personality traits have convergent validity with reports by others (Connelly & Ones, 2010). The same method has also been used to demonstrate convergent validity for the measurement of maladaptive personality traits (Markon, Quilty, Bagby, & Krueger, 2013). These studies also show that convergent validity is lower than the reliability of self-ratings and informant ratings. This finding indicates that some of the reliable variance in these ratings is method variance (Campbell & Fiske, 1959). To increase the validity of self-ratings of personality, it is necessary to examine the factors that produce method variance and minimize their contribution to the variance in self-ratings.

Research on the unique variance in self-ratings of personality has demonstrate that a large portion of this variance reflects a general evaluative factor (Anusic, Schimmack, Lockwood, & Pinkus, 2009). This factor is present in self-ratings and informant ratings, but does not show convergent validity across raters (Anusic et al., 2009; Biesanz & West, 2004; DeYoung, 2006). Moreover, it is related to other measures of self-enhancement such as inflated ratings of desirable traits like attractiveness and intelligence (Anusic et al., 2009). It also predicts self-ratings of well-being, but not informant ratings of well-being (Kim, Schimmack, & Oishi, 2012; Schimmack & Kim, 2012), suggesting that it is not a substantive trait. Finally, using personality items that are less evaluative, reduces correlations among personality factors (Bäckström & Björklund, 2020). Taken together, these findings suggest that self-reports are influenced by the desirability of traits and that a consistent bias produces artificial correlations between items. This bias is often called halo bias (Thorndike, 1920) or socially desirable responding (Campbell & Fiske, 1959).

It seems plausible that socially desirable responding is an even bigger problem for the use of self-reports in the measurement of personality disorders that are intrinsically undesirable (i.e, nobody wants to have a disorder). Yet, researchers of personality disorders have largely ignored the possibility that socially desirable responding biases self-ratings of personality. Rather, they have interpreted the presence of a general evaluative factor as evidence for a substantive factor that is either interpreted as a broad risk factor in a hierarchical model of factors that contribute to personality disorders (Morey, Krueger, & Skodol, 2013) or as an independent factor that reflects severity of personality disorders (Morey, 2017). These substantive interpretations have been challenged by evidence that the general factor in self-reports of personality disorders is highly correlated with the halo factor in self-ratings of normal personality (McCabe, Oltmanns, & Widiger, 2022). Using existing data, I was able to show that the halo factor in self-ratings of the Big Five personality factors was highly correlated with the general factor in the Levels of Personality Functioning Scale (Morey, 2017), r = .88 (Schimmack, 2022a), and the general factor in the Computerized Adaptive Test of Personality Disorders (CAT-PD), r = 94 (Schimmack, 2022b). In addition, the general factor in the Levels of Personality Function items is highly correlated with the general factor in the CAT-PD items, r = .86. These results suggest that the same factor contributes to correlations among self-ratings of personality and that this factor reflects the desirability of the items.

In this post, I extend this investigation to another measure of maladaptive personality traits, namely the Personality Inventory for DSM-5 (PID-5; Krueger, Derringer, Markon, Watson, & Skodol, 2012). I also provide further evidence about the amount of variance in PID-5 scales that is explained by the general factor. McCabe et al.’s (2022) findings suggested that a large amount of the variance in some scales reflects mostly the general factor. For example, the general factor explained over 60% of the variance in Perceptual Dysregulation, Unusual Beliefs, Deceitfulness, Irresponsibility, Distractibility, and Impulsivity. If these self-report measures were used to diagnose personality disorders, it is vital to examine whether this variance reflects substantive problems or a mere response style to agree or disagree with desirable items.

I used Hopwood et al.’s (2018) data from Study 2 to fit a model to the correlations among the 25 PID-5 scales ( The dataset also included ratings for the Levels of Personality Functioning Scale (Morey, 2017). Based on previous analyses, I used 10 items to validate the general factor in the PID-5 (Schimmack, 2022a). To ensure robustness of the model, I fitted the same model to two random splits of the full dataset and retained only parameters that were statistically significant across both models. The final model had acceptable fit, CFI = .928, RMSEA = .06, and better fit than exploratory factor analyses of the PID-5 (Markon, Quilty, Bagby, & Krueger, 2013).

The main finding was that the general factor in the PID-5 correlated r = .837, SE = .016, with the factor based on the 10 LPFS items. This finding supports the hypothesis that the same, or at least highly similar factors, influence self-ratings on measures of normal personality and maladaptive personality. Table 1 shows the loadings of the 25 PID-5 scales on the general factor and on five additional factors that are likely to correspond to the Big-Five factors of normal personality. It also shows the contribution of unique factors to each scale that may be valid unique variance of dysfunctional personality.

The results replicate McCabe et al.’s (2022) finding that all PID-5 scales load on a general factor. Although the loadings are not as high, they are substantial and 23 out of the 25 loadings are above .5. Table 1 also shows that all scales have unique variance that is not explained by the factors of the halo-Big5 model. 18 of the 25 Loadings on the uniqueness factor are above .5. Finally, the loadings on the Big Five factors are consistent with factor analyses of the PID-5, but the magnitude of these loadings is relatively modest. Only 7 of the 25 loadings are above .5, and only 5 of the 25 scales have a higher loading on a Big-Five factor than on the general factor. These results are consistent with a similar analysis of the CAT-PD scales (Schimmack, 2022b).

In conclusion, self-reports of maladaptive personality that have been proposed as instruments for clinical diagnoses of personality disorders are strongly influenced by a general factor that is common to different instruments such as the Levels of Personality Functioning Scale (Morey, 2017), the Computerized Adapative Test of Personality Disorders (CAT-PD; Simms, Goldberg, Roberts, Watson, Welte, & Rotterman, 2011), and the Personality Inventory for DSM-5 (PID-5; Krueger, Derringer, Markon, Watson, & Skodol, 2012).


Concerns about the influence of response styles on self-ratings are as old as self-reports. Campbell and Fiske (1959) demonstrated that self-ratings of personality traits are more highly correlated with each other than with informant ratings of the same traits. Confirmatory factor analyses of multi-trait-multi-rater data revealed the presence of an evaluative factor that is largely independent across different raters (Anusic et al., 2009). This factor has several names, but it reflects the desirability of items independent of their descriptive meaning. Clinical researchers interested in personality disorder also observed a general factor, but interpreted it as a substantive factor. As I demonstrated in several studies, the halo factor in ratings of normal personality is strongly correlated with the general factor in self-report instruments to diagnose personality disorders. This finding challenges the prevalent interpretation of the general factor as a valid dimension of personality disorder. At a minimum, the results suggest that ratings of maladaptive and undesirable traits are influenced by socially desirable responding. Despite the long history of research on socially desirability, researchers who study personality disorders have downplayed or ignored this possibility. For example, McCabe et al. (2022) dismiss the response style explanation with the argument that “it is more likely that persons are providing reasonably forthright self-descriptions,” while ignoring the finding that the evaluative factor in self-ratings lacks convergent validity with informant ratings (Anusic et al., 2009), and mentioning earlier that Anusic et al.’s (2009) results provided support for this hypothesis. They also admit that their results “are consistent with, or at least don’t rule out, the social undesirability or evaluative bias hypothesis” (p. 151), but then conclude that “some persons do indeed have many undesirable traits whereas other persons have many desirable traits” (p. 151) without citing any evidence for this claim. In fact, it is much more common for respondents to have none of the personality disorders (i.e., a score in the upper 10% of a scale, 44%) and few participants have more than 10 disorders (5%). This asymmetry is more consistent with a response style that attenuates scores on undesirable traits than a broad dispositions that make some people have many undesirable traits.

Markon et al. (2013) advocate the use of informant ratings to control for response styles in self-ratings, PID-5, but do not use the informant ratings to examine the presence of biases in self-ratings. More broadly, numerous articles claim to examine the validity of personality disorder instruments and all of these articles conclude that these instruments are valid (Krueger et al., 2012; Long, Reinhard, Sellbom, & Anderson, 2021; Morey, 2017; Simms et al., 2011). Some authors are also inconsistent across articles. For example, Ringwald, Manuck, Marsland, and Wright (2022) note that “many studies suggest it [the general factor] is primarily the
product of rater-specific variance” (p. 1316), but Ringwald, Emery, Khoo, Clark, Kotelnikova, Scalco, Watson, Wright and Simms (2022) neither model the general factor, nor mention that response styles could influence scores on the CAT-PD scales. Evidence that the general factor in personality disorder instruments is strongly correlated with the evaluative factor in ratings of normal personality requires further investigation. Claims that personality disorder measures are valid are misleading and fail to acknowledge the possibility that response styles produce method variance. The presence of method variance does not invalidate measures because validity is a quantitative construct (Cronbach & Meehl, 1955). Markon et al. (2013) demonstrate convergent validity between self-reports and informant ratings of the PID-5 traits. Thus, there is evidence that self-ratings have some validity. The goal of future validation research should be to identify method factors and develop revised measures with higher validity as personality researchers are trying to reduce the evaluative bias in measures of normal personality (Bäckström & Björklund, 2020; Wood, Anglim, & Horwood, 2022; ). However, this might be more difficult for measures of disorders because disorders are intrinsically undesirable. Thus, it may be necessary to use statistical controls or a multi-rater assessment to increase the validity of self-report instruments designed to measure maladaptive traits.

Meanwhile, personality disorder researchers continue to disregard the possibility that a large portion of the variance in self-report measures is merely a response style and make claims about construct validity based on inappropriate methods to separate valid construct variance from method variance (e.g., Widiger & Crego, 2019). Most of the claims that personality disorder instruments are valid are based on correlations of one self-report measure with another or evidence that factor analyses of personality disorder scales have a similar factor structure to factor analysis of normal personality traits (i.e., the Big Five). Neither finding warrants the claim that maladaptive personality scales measure maladaptive personality traits. Instead, the finding that the halo-Big Five model can be fitted to correlations among personality disorder scales suggests that these scales merely have more evaluative content and are more strongly influenced by socially desirable responding. Multi-method evidence is needed to demonstrate that the general factor reflects a substantive trait and that specific traits are maladaptive; that is, produce intrapersonal or interpersonal problems for individuals with these traits. For now, claims about the validity of personality disorder instruments are invalid because they fail to meet basic standards of construct validity and fail to quantify the amount of method variance in these scales (Campbell & Fiske, 1959; Cronbach & Meehl, 1955).

Validity of the Computerized Adaptive Test of Personality Disorders (CAT-PD)


Clinical psychologists have developed alternative models of personality disorders to the traditional model that was developed by psychiatrists in the 20th century. The new model aims to integrate modern research on normal personality with clinical observations of maladaptive traits. The alternative model aims to explain why specific personality disorders are related, which is called co-morbidity in the medical literature. Ringwald et al. (2022) tested this assumption with the Comprehensive Adaptive Test of Personality Disorders (CAT-PD). They fitted a model with six correlated factors to the covariance matrix among the 33 CAT-PD scales. Four of these six factors overlapped with Big Five factors. They concluded that their study supports “the validity of the CAT-PD for assessing multiple levels of the pathological trait hierarchy” I fitted a model of normal personality to the data. This model assumes that self-ratings of personality are influenced by 6 independent traits, the Big Five and a general evaluative factor called halo. I was able to identify all six factors in the CAT-PD, although additional relationships among the 33 scales were also present. I cross-validated this model and showed high (r > .8) correlations of the factors with factors in a Big Five questionnaire. I show that Big Five factors explain only a modest amount of variance in most CAT-PD scales. Based on these results, I conclude that these factors reflect normal variation in personality rather than a distinct level in a hierarchical model of pathological traits. Rather, the Big Five traits are normal traits that are risk factors for specific types of personality disorders, but extreme levels of a normal trait are normal and not pathological. Furthermore, a large portion of the variance in self-ratings of traits are method variance. Thus, valid assessment of personality disorders requires a multi-rater approach.


The notion of personality disorders has a long history in psychiatry that is based on clinical observations and psychoanalytic theories. It is currently recognized that the old system to diagnose personality disorders is no longer compatible with modern theories of personality, but there is no consensus among clinical psychologists and psychiatrists about the definition and assessment of personality disorders. This confusing state of affairs is reflected in the presence of several competing conceptualizations of personality disorders in the Diagnostic and Statistical Model of Mental Disorders (DSM-5).

Simms et al. (2011) introduced the Computerized Adaptive Test of Personality Disorders (CAT-PD) as one potential model of personality disorders. The CAT-PD aims to measure 33 maladaptive personality traits (CAT-PD-SF). In this blog post, I take a critical look at the claim that the CAT-PD is capable of measuring personality disorders at varies levels in a hierarchical model of personality functioning (Ringwald, Emery, Khoo, Clark, Kotelnikova, Scalco, Watson, Wright, & Simms, 2022).

The notion of a hierarchy of disorders implies that the 33 dimensions of the CAT-PD measure distinct disorders, where more extreme levels on these dimensions indicate higher levels of personality dysfunction. Correlations among the scales measuring the 33 dimensions suggest that they share a common cause. This causes explain why some primary disorders covary (i.e., comorbidity in the terminology of categorical diagnoses). They may also reflect broader dimensions of disorders. Ringwald et al. (2022) used confirmatory factor analysis to test this hypothesis. They tested one model with five-factors and another model with 6-factors. The 6-factor model had better fit. Thus, I focus on the six factor model. The six factors are called Negative Affectivity, Detachment, Disinhibition, Antagonism, Psychoticism, and Anankastia.

Table 1 shows the CAT-PD scales with the highest loadings on these factors.

Table 1
Negative Affectivity: Anxious, Affect Lability
Detachment: Social Withdrawal, Anhedonia
Antagonism: Callousness, Domineering
Disinhibition: Non-Planfulness, Irresponsibility
Psychoticism: Unusual Experiences, Unusual Beliefs
Anankastia: Workaholism, Perfectionism

The CFA model imposed no restrictions on the correlations among the six factors and an inspection of the correlation matrix showed that the six factors are correlated to varying degrees (Table 2).

Although some of these correlations are moderate to strong, the results are consistent with the assumption that all six dimensions reflect different constructs (discriminant validity). The authors discuss the surprising finding that disinhibition (e.g., Non-Planfulness) and Anankastia (e.g., Perfectionism) appear as independent factors. This would suggest that some people have both disorders although they seem to be related to low and high conscientiousness, respectively. Correlations with an independent measure of conscientiousness shows that inhibition is negatively correlated with conscientiousness, r = -.64, but Anankastia was not positively correlated with conscientiousness, r = .11. There are two explanations for these results. One explanation is that there is a general factor of personality functioning that has a positive influence on all personality disorders, even if they seem to express themselves in seemingly opposite ways. That is, general functioning increases the risk of being irresponsible and perfectionistic. An alternative explanation is that self-reports of personality disorders are influenced by the same response styles that influence self-ratings of normal personality (Anusic et al., 2009). Either interpretation implies that a general factor contributes to the pattern of correlations among the CAT-33 scales. The authors discuss this issue in their limitation section.

“A challenge facing modeling bipolar factors is the shared impairment that generally creates positive manifolds in the correlations among maladaptive scales. This can be circumvented with separate modeling of impairment in a distress or dysfunction factor, as has long been done in the IIP-SC (Alden et al., 1990) but has not been attempted in any comprehensive way in a published five- or six-factor pathological trait inventory.” (p. 26)

It is not clear why the authors did not fit this model to their data. I pursued this avenue of future research based on measurement models of normal personality. Accordingly, it is possible to distinguish a general evaluative factor (halo) that produces positive correlations among desirable traits and the Big Five as largely independent factors. I refer to this model as the Halo-Big-Five model. Four of the Big Five factors are closely related, if not identical, with four of the CAT-PD factors, namely Neuroticism corresponds to Negative Affectivity, Detachment corresponds to Introversion (low Extraversion), Antagonism corresponds to low Agreeableness, and Disinhibition corresponds to low Conscientiousness. Openness is not strongly related to personality disorder, but the item content of the Fantasy Proneness scale (e.g., I sometimes get lost in daydreams) and the Peculiarity scale (“I am considered to be eccentric”) might be related to Openness. Furthermore, the correlation between the two scales was high, r = .588. This was the highest correlation for both items, except for equally high correlations with the Cognitive Problems scale (e.g., “I often space out and lose track of what’s going on.”, r = .64, .57. Thus, these items were used as makers of an Openness to Experience factor. Modification indices were used to allow for additional loadings on these predefined theoretical factors, but model fit remained lower than the fit of the 6-factor model, suggesting additional factors were present. I split the dataset into random halves and created a model that generalized across the two halves. The model is a lot simpler than the six-factor model (405 vs. 345 degrees of freedom) because it did not use free parameters for theoretically unimportant and small parameters. This explains why the model has superior fit to the 6-factor model for fit indices that take simplicity into account.

The main limitation is the lack of empirical evidence that factors correspond to the same factors that can be found with self-ratings of normal personality traits. To examine this question, I used a dataset ( that included ratings of normal personality on the Big-Five Inventory-2 and ratings on the Levels of Personality Functioning scale s (Hopwood et al., 2018). I first fitted the halo-Big Five model to the covariances among the 33 CAT-PD scales. Overall model fit was lower, indicating some differences between the datasets, but overall model fit was acceptable, CFI = .939, RMSEA = .060. All parameters were replicated with alpha = .05 as being statistically significant. Thus, the model shows some generalizability across datasets. Then I combined this model with a prior model of the Big Five and a Level of Personality Functioning factor (Schimmack, 2022). Given the large number of items, I further simplified the model by using the BFI-2 facet scales as indicators for the Big Five factors. This reduced the number of normal personality variables from 60 items to 15 scales. This model had acceptable fit, CFI = .902, RMSEA = .066.

I combined this model with the CAT-PD model by allowing the general factors and the corresponding Big Five traits to correlate. In addition, I allowed for correlated residuals between facets and related CAT-PD scales. For example, I allowed for unique relations between the Depressiveness facet in the BDF-I and the Depression scale of the CAT-PD. The combined model retrained acceptable fit, CFI = .892, RMSEA = .058. The key finding was that the CAT-PD factors were highly correlated with the BFI-2 factors and the general CAT-PD factor was highly correlated with the LPFS factor (Table 3).

These results provide empirical evidence for the interpretation of the CAT-PD factors as the Big-Five factors of normal personality As a result, it is possible to describe the variance in CAT-PD scales as a function of variation in (a) a general factor that reflects desirability of a trait, (b) variance that is explained by variation in normal personality, and (c) residual variance that may reflect maladaptive expressions of normal personality. Table 4 shows how much these different factors contribute to variance in the 33 CAT-PD scales.

only effect sizes > .2 (4% explained variance) are shown

The general factor makes a strong contribution for most CAT-33 scales. 21 of the 33 effect sizes are larger than .6 (36% explained variance), and only 4 effect sizes are below .4 (16% explained variance), namely Exhibitionism, .33, Romantic Disinterest, .28, Perfectionism, .37, and Workaholism, .32. In comparison, the effect sizes for the Big Five traits are more moderate. Only 3 effect sizes are above .6, namely Anxiousness (N, .60), Exhibitionism (E, .68), and Social Withdrawal (E, -.66). As a result, most CAT-PD scales have a substantial amount of unique variance that is not explained by the general or the Big Five factors. 19 of the 33 effect sizes were above .6 (36%) explained variance, and not a single effect size was below .4 (16% explained variance). Although these effect sizes may be inflated by random and systematic measurement error, the results suggest that the constructs that are measured with the CAT-33 scales are related, but not identical to factors that produce variation in normal personality.


Normal Personality Factors and Maladaptive Personality Traits

Correlations of personality measures with other measures provide valuable information about the construct validity of personality measures (Cronbach & Meehl, 1955; Schimmack, 2021). Unfortunately, there are no generally accepted psychometric standards to evaluate construct validity. Ringwald et al. (2022) claim that their results provide evidence of construct validity and evidence that the CAT-PD can be used to measure personality pathology at multiple levels of a hierarchy. I think this conclusion is premature and ignores key steps in a program of validation research. First, construct validation requires a clear definition of a construct that is the target of psychological measurement. After all, it is impossible to evaluate whether a measure measures an intended construct, if the construct is not properly defined. Moreover, the CAT-PD has 33 scales and each scale is intended to measure a distinct construct. Thus, construct validation of the CAT-PD requires clear definitions of 33 constructs. The concepts are well-defined, but it is questionable that all of these constructs can be considered disorders (CAT-PD Manual). For example, the Domineering scale is intended to measure “a general need for power and the tendency to be controlling, dominant, and forceful in interpersonal relationships” and the Submissiveness scale is intended to measure “the yielding of power to others, over-accommodation of others’ needs and wishes, exploitation by others, and lack of self-confidence
in decision-making, often to the extent that one’s own needs are ignored, minimized, or undermined.” The labeling of these scales as measures of personality pathology implies that variation along these dimensions is pathological and that the scales are valid measures of actual behavioral tendencies that cause intrapersonal or interpersonal problems for individuals who score high on these scales. As the CAT-PD is a relatively novel questionnaire, there is insufficient evidence to show that the CAT-PD scales assess pathology rather than normal variation in personality. An even bigger problem is the claim that the CAT-PD can be used to measure multiple levels in a hierarchy of personality disorders. The first problem is the assumption that disorders have a hierarchical structure. It would be difficult to understand the notion of a hierarchy for physical disorders. Let’s take cancer as an example (Fowler et al., 2020). Cancers are distinguished by their location such as lung cancer, breast cancer, brain cancer and so on. As cancer can spread some patients may have more than one cancer (i.e., cancer in multiple locations). While some cancers are more likely to co-occur, nobody has proposed a hierarchy of cancers, in which known or unknown causes of co-morbidity are considered a disorder. In probabilistic models it is also problematic to call all causes of a disorder a disorder. For example, not having sickle cell anemia is a risk factor for malaria infection, but it would be questionable at best to call normal blood cells a disorder.

I demonstrated that the Big Five factors explain some, but not all, of the correlations among the CAT-PD scales. Ringwald et al.’s factor model is similar and the authors come to a similar conclusion that four of their six factors correspond to four of the Big Five scales. They even claim that there is convergent validity between measures of the Big Five and the CAT-PD factors. The problem is that convergent validity implies that two measures measure the same construct (Campbell & Fiske, 1959). However, the Big Five factors are used to describe the correlations among traits that describe normal variation in personality. In contrast, Ringwald et al. (2022) claim that their factors reflect a level in a hierarchy of personality pathology. Unless we pathologize normal personality or normalize pathology, these are different constructs. Thus, high correlations between Big Five factors and CAT-PD factors do not show convergent validity. Rather they show a lack of discriminant validity (Campbell & Fiske, 1959). There is, however, a simple way to reconcile the notion of personality disorders with the finding that Big Five factors are related to measures of personality disorders. It is possible to consider the Big Five factors as risk factors for specific personality disorders. For example, high agreeableness may be a risk factor for dysfunctional forms of submissiveness and low agreeableness could be a risk factor for dysfunctional forms of dominance. Importantly, high or low agreeableness alone is not sufficient to be considered a disorder. This model is consistent with the substantial amount of variance in CAT-PD scales that is not explained by variation in normal personality. The key difference between this model and Ringwald’et al’s model is that covariance among CAT-PD scales does not reflect a broader disorder, but normal variation in personality. One advantage of this model is that it can explain the weak correlations between Big Five traits, with the exception of Neuroticism, and well-being (Schimmack, 2022). If the Big Five were broader pathological traits, we should expect that they lower quality of life. It is more likely that personality traits are risk factors, but that the actual manifestation of a disorder lowers well–being. This model predicts that the unique variance in CAT-PD scales is related to lower well-being and mental health problems. This needs to be examined in future studies.

General Factor

The other main finding was that a general factor explains a large amount of variance in many CAT-PD scales and that this factor is strongly correlated with the halo factor in self-ratings of normal personality. Some researchers interpret this factor as a substantive factor, whereas others view it as a response artifact. The present findings create some problems for the interpretation of this general factor that produces co-morbidity among personality disorders because it is related to opposing disorders. Taking Domineering and Submissiveness as an example, the general factor is positively related to Domineering, .65, and Submissiveness, .61. it is unclear, how a substantial trait could make somebody dominant and forceful in interpersonal relationships and over-accommodating of others’ needs. A more plausible explanation is that some respondents respond to the negative description of these traits and present themselves in an overly positive manner. This is consistent with multi-rater studies of normal personality that show low correlations for the general factors of different raters (Anusic et al., 2009; Biesanz & West, 2004; DeYoung, 2006). Similar studies with measures of personality disorders are lacking. Markon, Quilty, Bagby, and Krueger compared self-ratings and informant ratings and found moderate agreement. However, the sources of disagreement remained unknown. A multi-trait-multi-rater analysis of these data could reveal the amount of rater-agreement for the general factor in PD ratings.


I presented evidence that the halo-Big-Five model fits self-ratings of normal personality and ratings of personality disorders and that the corresponding factors are very highly (r > .8) correlated with each other. This finding raises concerns about hierarchical models of personality disorders. I present an alternative model that considers normal personality as a risk factor for specific personality disorders and the halo factor as a rating bias in self-ratings. Future research needs to go beyond self-ratings to separate substance from style. Furthermore, indicators of mental health and well-being are needed to distinguish normal personality from personality disorders.

The Levels of Personality Functioning Scale Lacks Construct Validity


An influential model of personality disorders assumes a general factor of personality functioning that underlies the presence of personality disorder symptoms. To measure this factor, Morey (2017) developed the Level of Personality Functioning scale. The construct and the measure of general personality functioning, however, remains controversial. Here I analyze data that were used to claim validity of the LPFS using structural equation modeling. I demonstrate that two factors account for 88% of the variance in LPFS scores. One factor reflects desirability of items (70%) and the other factor reflects scoring of the items (12%). I then show that the evaluative factor in the LPFS corelates highly, r = .9, with a similar evaluative factor in ratings of normal personality, when all items are scored in terms of desirability. Based on previous evidence from multi-method studies of normal personality, I interpret this factor as a response style that is unique to individual raters. Thus, most of the variance in LPFS scores reflects evaluative rating biases rather than levels of personality functioning. I also identified 10 items from the LPFS that are mostly free of actual personality variance, but correlate strongly with the evaluative factor. These items can be used as an independent measure of evaluative biases in self-ratings. The main conclusion of this article is that theories of personality disorders lack a clear concept and that self-report measures of personality disorders lack construct validity. Future research on personality disorders need to conduct more rigorous construct validation research with philosophically justifiable definitions of disorders and multi-method validation studies.


A major problem in psychology is that it is too easy to make up concepts and theories about human behaviors that are based on overgeneralizations from single incidences or individuals to humans in general. A second problem is that pre-existing theories and beliefs often guide research and produce results that appear to confirm those pre-exist believes. A third problem is that psychology lacks a coherent set of rules to validate measures of psychological constructs (Markus & Borsboom, 2013). As a result, it is possible that large literatures are based on invalid measures (e.g., Schimmack, 2021). In this blog post, I will present evidence that an influential model of personality disorders is equally based on flawed measures.

What Are Personality Disorders?

The notion of personality disorders has a long history that predates modern conceptions of personality (Zacher, 2017). An outdated view, equated personality disorders with extreme – statistically abnormal – scores on measures of personality (Schneider, 1923). The problem with this definition of disorders is that abnormality can even be a sign of perfect functioning as in the performance of a Formula 1 race car or an Olympic athlete.

Personality disorders were formalized in the third Diagnostic and Statistical Manual of Mental Disorders, but the diagnosis of personality disorders remained controversial; at least, much more controversial than diagnosis of mental disorders with clear symptoms of dysfunction such as delusions and hallucinations. The current DSM-5 contains two competing models of personality disorders. Without a clear conception of personality disorders, the diagnosis of personality disorders remains controversial (Zacher, 2017).

A main obstacle in developing a scientific model of personality disorder is that historic models of personality disorders are difficult to reconcile with contemporary models of normal functioning personality that has emerged in the past decades. To achieve this goal, it may be necessary to start with a blank slate and rethink the concept of personality disorders.

Distinguishing Personality Disorders from (Normal) Personality

There is no generally accepted theory of personality. However, an influential model of personality assumes that individuals have different dispositions to respond to the same situation. These dispositions develop during childhood and adolescence in complex interactions between genes and environments that are poorly understood. By the beginning of early adulthood, these dispositions are fairly stable and change only relatively little throughout adulthood. While there are hundreds of dispositions that influence specific behaviors in specific situations, these dispositions are related to one or more of five broad personality dispositions that are called the Big Five. Neuroticism is a general dispositions to experience more negative feelings such as anxiety, anger, or sadness. Extraversion is a broad disposition to be more engaged that is reflected in sociability, assertiveness, and vigor. Openness is a general disposition to engage in mental activities. Agreeableness is a general disposition to care about others. Finally, conscientiousness is a general disposition to control impulses and persist in the pursuit of long-term goals. Variation along these personality traits is considered to be normal. Variation along these traits exists either because it has no major effect on life outcomes, the genetic effects are too complex to be subjected to selection, or because traits have different costs and benefits. This short description of normal personality is sufficient to discuss various models of personality disorders (Zachar & Krueger, 2013).

The vulnerabiltiy model of personality disorders can be illustrated with high neuroticism. High neuroticism is a predictor of lower well-being and a risk factor for the development of mood disorders. Even during times when individuals are not have clinical levels of anxiety or depression, they report elevated levels of negative moods. Thus, one could argue that high neuroticism is a personality disorder because it makes individuals vulnerable to suffer mental health problems. However, even in this example it is not clear whether neuroticism should be considered a risk factor for a disorder or a disorder itself. As many mood disorders are episodic, while neuroticism is stable, one could argue that neuroticism is a risk factor that only in combination with other factors (e.g., stress) triggers a disorder. The same is even more true for other personality traits. For example, low conscientiousness is one of several predictors of some criminal behaviors. This finding might be used to argue that low conscientiousness is a criterion to diagnose a personality disorder (e.g., psychopathy). However, it is also possible to think about low conscientiousness as a risk factor rather than a diagnostic feature of a personality disorder. In line with this argument, Zachner and Kruger (2013) suggest that “vulnerabilities are not disorders” (p. 1020). A simple analogy may suffice. White skin is a risk factor for skin cancer. This does not mean that White skin is a skin disease and it is possible to avoid the clinically relevant outcome of skin cancer by staying out of the sun, proper closing, or applying sun blockers. Even if we would recognize that personality can be a risk factor for various disorders, it would not justify the label of a personality disorder. The term implies that something about a person’s personality impedes their proper functioning. In contrast, the term risk factor merely implies that personality can contribute to the disfunction of something else.

The pathoplasticity model uses the term personality disorder for personality traits that influence the outcome of other psychiatric disorders. Zachar and Kruger (2013) suggest that people with a personality disorder develop mental health problems earlier in life or more often. This merely makes them risk factors, which were already discussed under the vulnerability model. More broadly personality traits may influence specific behaviors of patients suffering from mental health problems. For example, personality may influence whether depressed patients commit suicide or not. For example, men are more likely to commit suicide than women despite similar levels of depression. Understanding these personality effects is surely important for the treatment of patients, but it does not justify the label of personality disorders. In this example, the disorder is depression and treatment has to assess suicidality. The personality factors that influence suicidality are not part of the disorder.

The spectrum model views personality disorders as milder manifestations of more severe mental health problems that share a common cause. This model blurs the distinction between normal and disordered personality. At what level is anxiety still normal and at what level is it a mild manifestation of an anxiety disorder. A more reasonable distinction between normal and clinical anxiety is whether anxiety is rational (e.g., gun fire at a mall) or irrational (fear of being abducted by aliens). Models of normal personality traits are not able to capture these distinctions.

The decline-in-functioning model assumes that personality disorders are the result of traumatic brain injury, severe emotional trauma, or severe psychiatric disorder. As all behavior is regulated by the brain, brain damages can lead to dramatic changes in behavior. However, it seems odd to call these changes in behaviors a personality disorder. With regards to traumatic life events, it is not clear that they reliably produce major changes in personality. Avoidance after a traumatic injury is typically situation specific rather than a change in a broader general disposition. This model also ignores that the presence of a brain injury, other mental illnesses or drugs is used as an exclusion criterion to diagnose a personality disorder (Skodol et al., 2011).

The impairment-distress model more directly links personality to disorder or dysfunction. The basic assumption is that personality is associated with clinically significant impairment or distress. I think association is insufficient. For example, gender is corelated with neuroticism and the prevalence of anxiety disorders. It would be difficult to argue that this makes gender a personality disorder. To justify the notion of a personality disorder, personality needs to be a cause of distress and treatment of personality disorders should alleviate distress. Once more, high neuroticism might be the best candidate for a personality disorder. High neuroticism predicts higher levels of distress and treatment with anti-depressant medication or psychotherapy can lower neuroticism levels and distress levels. However, the impairment-distress model does not solve the problems of the vulnerability model. Is high neuroticism sufficient to be considered an impairment or is it merely a risk factor that can lead to impairment in combination with other factors?

This leaves the capacity-failure model as the most viable conceptualization of a personality disorder (Zachar, 2017). The capacity-failure model postulates that personality disorders represent dysfunctional deviations from the normal functions of personality. This model is a straightforward extension of conceptions of bodily functioning to personality. Organs and other body parts have clear functions and can be assessed in terms of their ability to carry out these functions (e.g., hearts pump blood). When organs are unable to perform these functions, patients are sick and suffer. Zachar (2017) points out a key problem of the extension of biological functions to personality. “The difficulty with all capacity failure models is that they rely on speculative inferences about normal, healthy functioning” (p. 1020). The reason is that personality refers to variation in systems and processes that serve a specific function. While the processes have a clear function, it is often less clear what function variation in these processes serves. Take anxiety as an example. Anxiety is a universal human emotion that evolved to alert people to potential danger. Humans without this mechanism might be considered to have a disorder. However, neuroticism reflects variation in the process that elicits anxiety. Some people are more sensitive and others are less sensitive to danger. To justify the notion of a personality disorder, it is not sufficient to specify the function of anxiety. It is also necessary to specify the function of variation in anxiety across individuals. This is a challenging task and current research on personality disorders has failed to specify personality functions to measure and diagnose personality disorders from a capacity-failure model.

To summarize, the reviewed conceptualizations of personality disorders provide insufficient justification for a distinction between normal personality and personality disorders. While some personality types may be associated with some negative outcomes, these correlations do not provide an empirical basis for a categorical distinction between personality and personality disorders. This leaves the capacity-failure model as the last option (Zachar, 2017). The capacity-failure model postulates that personality disorders represent dysfunctional deviations from the normal functions of personality. This model is a straightforward extension of conceptions of bodily functioning to personality. Organs and other body parts have clear functions and can be assessed in terms of their ability to carry out these functions (e.g., hearts pump blood). When organs are unable to perform these functions, patients are sick and suffer. Zachar (2017) points out a key problem of the extension of biological functions to personality. “The difficulty with all capacity failure models is that they rely on speculative inferences about normal, healthy functioning” (p. 1020). That is, while it is relatively easy to specify the function of body parts, it is difficult to specify the functions of personality traits. What is the function of extraversion or introversion? The key problem is that personality refers to variation in basic psychological processes. While we can specify the function of being selfish or altruistic, it is much harder to specify the function of having a disposition to be more selfish or more altruistic (agreeableness). However, without a clear function of these personality dispositions, it is impossible to define personality dysfunction. This is a challenging task and current research on personality disorders has failed to specify personality functions that could serve as a foundation for theories of personality disorders.

The Criterion-A Model of Personality Disoders

Given the lack of a theory of personality disorders, it is not surprising that personality disorder have conflicting views about the measurement of personality disorders (it is difficult to measure something, if you do not know what you are trying to measure). One group of researchers argues for a one-dimensional model of personality disorders that is called personality pathology severity (Morey, 2017; Morey et al., 2022). This model is based on the assumption that specific items or symptoms that are used to diagnose personality disorders are correlated and “show a substantial first or general factor” (p. 650). To measure this general dimension of personality disorder with self-ratings, Morey (2017) developed the Levels of Personality Functioning Scale–Self Report (LPFS–SR).

A major problem of this measure is the lack of a sound conceptual basis. That is, it is not clear what levels of personality functioning are. As noted before, it is not even clear what function individual personality traits have. It is much less clear what personality functioning is because personality is not a unidimensional trait. Take a car as an analogy. One could evaluate the functioning of a car and order cars in terms of their level of functioning. However, to do so, we would evaluate the functioning of all of the cars parts and the level of functioning would be a weighted sum of the checks for each individual part. The level of functioning does not exist independent of the functioning of the parts. For the diagnosis of cars it is entirely irrelevant whether functioning of one part is related to functioning of another part. A general factor of dysfunction might be present (newer cars are more likely to have functioning parts than older cars), but the general factor is not the construct of interest. The construct of dysfunction requires assessing the functioning of all parts that are essential for a car to carry out its function.

In short, the concept of levels of personality functioning is fundamentally flawed. Yet, validation studies claim that the levels of personality function scale is a valid measure of the severity of personality disorders (Hopwood et al., 2018). Unfortunately, validation research by authors who developed a test is often invalid because they only look for information that confirms their beliefs (Cronbach, 1989; Zimmermann, 2022). Ideally, validation research would be carried out by measurement experts who do not have a conflict of interest because they are not attached to a particular theory. In this spirit, I examined the construct validity of the level of psychological functioning scale, using Hopwood et al.‘s (2018) data (

Structure of the LPFS-SR

Hopwood et al. (2018) did not conduct a factor analysis of the 80 LPFS-SR items. The omission of such a basic psychometric analysis is problematic even by the low standards of test validation in psychology (Markus & Borsboom, 2013). The reason might be that other researchers have already demonstrated that the assumed structure of the questionnaire does not fit the data (Sleep et al., 2020). Sleep et al. were also unable to find a model that fits the data. Thus, my analyses provide the first viable of the correlations among the LPFS-SR items. Viable, of course, does not mean perfect or true. However, the model provides important insights into the structure of the LPFS-SR and shows that many of the assumptions made by Morey (2017) are not supported by evidence.

I started with an exploratory factor analysis to examine the dimensionality of the LPFS-SR. Consistent with other analyses, I found that the LPFS-SR is multidimensional (Sleep et al., 2020). However, whereas Sleep et al. (2020) suggest that three or four factors might be sufficient, I found that even the Bayesian Information Criterion suggested 7 factors. Less parsimonious criteria suggested even more factors (Table 1).

I next examined whether the four-factor model corresponds to the theoretical assignment of items to the four scales. The criterion for model fit was that an item had the highest loading on the predicted factor and the factor loading was greater than .3. Using this criterion, only 33 of the 80 items had the expected factor loadings. Moreover, the correlations among the four factors were low. One factor had nearly zero correlations with the other three factors, r = .05 to .13. The correlations among the other three factors were moderate, r = .30 to .56, but do not support the notion of a strong general factor.

Exploratory factor analysis has serious limitations as a validation tool. For example, it is unable to model hierarchical structures, although Morey (2017) assumed a hierarchical structure with four primary and one higher-order factor. The most direct test of this model would require structural equation modeling (Confirmatory Factor Analysis). EFA also has problems separating content and method factors. As some of the items are reverse scored, it is most likely that acquiescence bias distorts the pattern of correlations. SEM can be used to specify an independent acquiescence factor to control for this bias (Anusic et al., 2009). Thus, I conducted more informative analysis with structural equation modeling (SEM) that are often called confirmatory factor analysis. However, the label confirmatory is misleading because it is seems to imply that SEM can only be used to confirm theoretical structures. However, the main advantage of SEM is that it is a highly flexible tool that can represent hierarchies, model method factors, and reveal residual correlations among items with similar content. This statistical tool can be used to explore data and to confirm models. A danger in exploratory use of CFA is overfitting. However, overfitting is mainly a problem for weak parameters that have little effect on the main conclusions. In my explorations, I set the minimum modification index to 20, which limits the type-I error probability to 1/129,128. Most parameters in the final model meet the 5-sigma criterion (z = 5, chi-square(1) = 25) that is used in particle physics to guard against type-I errors. Moreover, I posted all exploratory models ( and I encourage others to improve on my model.

The final model ( had acceptable fit according to the standard of .06 for the Root Mean Square Error of Approximation, RMSEA = .030. However, the Comparative Fit Index was below the criterion value of .95 that is often used to evaluate overall model fit, CFI = .922. Another way to evaluate the model is to compare it to the fit of the EFA models in Table 1. Accordingly, the model had better fit in a comparison of the Bayesian Information Criterion (179,033.304 vs. 181,643.255), Aikan’s Information Criterion (177,270,345 vs. 177,485.265), and RMSEA (.030 vs. 031), but not the CFI (.922 vs. 932). The difference between fit indices is explained by the trade-off between parsimony and precision. The CFA model is more parsimonious (2958 degrees of freedom) than the EFA model with 10-factors (2405 degrees of freedom). Using the remaining 554 degrees of freedom would produce even better fit, but at the risk of overfitting and none of the smaller MI suggested substantial changes to the model. The final model had 12 factors. that I will describe in order of their contribution to the variance in LPFS scale scores.

The most important factor is a general factor that showed notable positive loadings (> .3) for 64 of the 80 items (80%). This factor correlated r = .837 with the LPFS scale scores. Thus, 70% of the variance in scale scores reflects a single factor. This finding is consistent with the aim of the LPFS to measure predominantly a single construct of severity of personality functioning (y (Morey, 2017; Morey et al., 2022).). However, the presence of this factor does not automatically validate the measure because it is not clear whether this factor represents core personality functioning. An alternative interpretation of this factor assumes that it reflects a response style to agree more with desirable items that is known as socially desirable responding or halo bias (Anusic et al., 2009). I will examine this question later on when I relate LPFS factors to factors of normal personality.

The second factor reflects scoring of the items. All items were coded as directly coded (68) or reverse coded (12). For the sake of parsimony and identifiability, loadings on this factor were fixed to 1 or -1. Thus, all items loaded on this factor by definition. More important, this factor corelated r = .428 with LPFS scores. Thus, response sets explained another 18% of the variance in LPFS scores. Together, these two factors explained 70 + 18 = 82% of the total variance in LPFS scores.

The first content factor had 13 notable loadings (> .3). The highest loadings were for the items “Sometimes I am too harsh on myself” (.61), “The standards that I set for myself often seem to be too demanding, or not demanding enough” (.51)., and “I tend to feel either really good or really bad about myself.” (.483). This factor corelated only r = .154 with the LPFS scale scores. Thus, it adds at most 2% to the explained variance in LPFS scale scores. The contribution could be less because this factor is corelated with other content factors.

The second content factor had 8 notable loadings (> .3). The highest loadings were for the items “I have many satisfying relationships, both personally and on the job” (.487), “I work on my social relationships because they are important to me” (.445), and “Getting close to others just leaves me vulnerable and and isn’t worth the risk” (.440). This factor seems to capture investment in social relationships. The correlation of this factor with LPFS scores is r = .120 and the factor contributes at most 1.4% to the total variance of LPFS scores.

The third content factor had 6 notable loadings (> .3). The highest loadings were for the items “The key to a successful relationship is whether I get my needs met” (.490), “I’m only interested in relationships that can provide me with some comfort” (.476), “I can only get close to someone who can acknowledge and address my needs” (.416). This factor seems to reflect a focus on exchange versus communal relationships. It correlated r = .098 with LPFS scale scores and contributes less than 1% of the total variance in LPFS scores.

The 4th content factor had 7 notable loadings (> .3). The highest loadings were for the items “I have some difficulty setting goals” (.683), “I have difficulties setting and completing goals” (.639), and “I have trouble deciding between different goals” (.534). The item content suggests that this factor reflects problems with implementing goals. It correlates r = .070 with LPFS scores and explains less than 1% of the total variance in LPFS scores.

The 5th factor had only 3 notable loadings (> .3). The three items were “When others disapprove of me, it’s difficult to keep my emotions under control” (.572), “I have a strong need for others to approve of me” (..498), “In close relationships, it is as if I cannot live with the other person” (.334). This factor might be related to need for approval or anxious attachment. It correlates r = .057 with LPFS scores and explains less than 1% of the total variance in these scores.

The 6th factor had 4 notable loadings (> .3). The highest loadings were for the items “Feedback from others plays a big role in determining what is important to me” (.427), “My personal standards change quite a bit depending upon circumstances.” (.365), and “My motives are mainly imposed upon me, rather than being a personal choice.” (.322). This factor seems to capture a strong dependence on others. It correlates r = .050 with LPFS scores and contributes less than 1% of the total variance.

The 7th factor was a mini-factor with only three items and only one item had a loading greater than .3. The item was “My life is basically controlled by others.” The items of this factor all had secondary loadings on the previous factor, suggesting that it may be a method artifact and not a specific content factor. It correlated only r = .037 with LPFS scale scores and has a negligible contribution to the total variance in LPFS scores.

The 8th factor is also a mini-factor with only three items. Two items had notable loadings (> .3), namely “I can appreciate the viewpoint of other people even when I disagree with them” (.484) and “I can’t stand it when there are sharp differences of opinion” (. 379).

The 9th factor had 4 items with notable loadings (> .3), but two loadings were negative. The two items with positive loadings were “I don’t pay much attention to, or care very much about, the effect I have on other people” (.351) and “I don’t waste time thinking about my experiences, feelings, and actions” (.301). The two items with negative loadings were “My emotions rapidly shift around” (-.381) and “although I value close relationships, sometimes strong emotions get in the way” (-.319). This factor seems to capture emotionality. The correlation with LPFS scores is trivial, r = .008.

The 10th factor is also a mini-factor with only three items. Two items had notable loadings, namely “People think I am pretty good at reading the feelings and motives of others in
most situations” (-.567) and “I typically understand other peoples’ feelings better than they do (-.633). The content of these items suggests that the factor is related to emotional intelligence. Its correlation with LPFS scores is trivial, r = -.007.

In addition, there were 41 correlated residuals. Correlated residuals are essentially mini-factors with two items, but it is impossible to determine the loadings of items on these factors. Most of these correlated residuals were small (.1 to .2). Only two item pairs had correlated residuals greater than .3,, namely “I don’t have a clue about why other people do what they do” correlated with “I don’t understand what motivates other people at all” (.453) and “I can only get close to somebody who understands me very well” correlated with “I can only get close to someone who can acknowledge and address my needs” (..367). Whether these correlated residuals reflect important content that requires more items or whether they are merely method factors due to similar wording is an open question, but it does not affect the interpretation of the LPFS scores because these mini factors do not substantially contribute to the variance in LPFS scores.

The main finding is that the factor analysis of the LPFS items revealed 2 major factors and many minor factors. One of the major factors is a method factor that reflects scoring of the items. The other factor reflects a general disposition to score higher or lower on desirable attributes. This factor account for 70% of the total variance in LPFS scores. The important question is whether this factor reflects actual personality functioning – whatever this might be – or a response style to agree more strongly with desirable items and to disagree more with undesirable items.

Validation of the General Factor of the LPFS

A basic step in construct validation research is to demonstrate that correlations with other measures are consistent with theoretical expectations (Cronbach & Meehl, 1955; Markus & Borsboom, 2013; Schimmack, 2021). The focus is not only on positive correlations with related measures, but also the absence of correlations with measures that are not expected to be correlated. This is often called convergent and discriminant validity (Campbell & Fiske, 1959). Moreover, validity is a quantitative construct and the magnitude of correlations is also important. If the LPFS is a measure of core personality functioning it should corelate with life outcomes (convergent validity). This hypothesis could not be examined with these data because no life outcomes were measured. Anther prediction is that LPFS scores should not corelate with measures of response styles (discriminant validity). This hypothesis could be examined because the dataset contained a measure of the Big Five personality traits and it is possible to separate content and response styles in Big Five measures because multi-method studies show that the Big Five are largely independent (Anusic et al., 2009; Biesanz & West, 2004; Chang, Connelly, & Geeza, 2012; DeYoung, 2006). Additional evidence shows that the evaluative factor in personality ratings predicts self-ratings of well-being, but is a weak or no predictor of informant ratings of well-being (Kim, Schimmack, & Oishi, 2012; Schimmack & Kim, 2020). This is a problem for the interpretation of this factor as a measure of personality functioning because low functioning should produce distress that is notable to others. Thus, a high correlation between the evaluative factor in ratings of personality and personality disorder would suggest that the factor reflects a rating bias rather than personality functioning.

I first fitted a measurement model to the Big Five Inventory – 2 (Soto & John, 2017). In this case, it was possible to use a confirmatory approach because the structure of the BFI-2 is well-known. I modeled 15 primary factors with loadings on the Big Five factors as higher-order factors. In addition, the model included one factor for evaluative bias and one factor for acquiescence bias based on the scoring of items. This model had reasonable fit, but some problems were apparent. The conscientiousness facet “Responsibility” seemed to combine two separate facets that were represented by two items each. I also had problems with the first two items of the Agreeableness-facet Trust. Thus, these items were omitted from the model. These modifications are not substantial and do not undermine the interpretation of the factors in the model. The model also included several well-known secondary relationships. Namely, anxiety (N) and depression (N) had negative loadings on extraversion, respectfulness (A) had a negative loading on Extraversion, Assertiveness (E) had a negatively loading on Agreeableness, Compassion (A) had a positive loading on N, and Productiveness (C) had a positive loading on E. Finally, there were 5 pairs of correlated residuals due to similar item content. The fit of this final model ( was acceptable, CFI = .906, RMSEA = .045.Only two primary loadings on the 15 facet factors were less than .4, but still greater than .3.

I then combined the two models without making any modifications to either model. The only additional parameters were used to relate the two models to each other. One parameter regressed the general factor of the LPFS model on the evaluative bias factor in the BFI model. Another one did the same for the two response style factors. Modification indices suggested several additional relationships that were added to the model. The fit of the final model ( was acceptable, CFI = .875, RMSEA = .032. Difficulties with goal setting (LPFS content factor 4) was strongly negatively related to the productivity facet of conscientiousness, r = -.81, and slightly positively related to the compassion facet of agreeableness, r = .178. The Emotionality factor (LPFS content factor 9) was strongly correlated with Neuroticism, r = .776. The first content factor was also strongly correlated with the depression facet of neuroticism, r = .72, and moderately negatively correlated with agreeableness, r = -.264. The need for approval factor (content factor 5) was also strongly corelated with neuroticism, r = .608, and moderately negatively related to the assertiveness facet of agreeableness, r = -.249. Content factor 2 (“close relationships) was moderately negatively related to the trust facet of agreeableness, r = -.408, and weakly negatively related to the assertiveness facet of extraversion, r = -.117. A focus on exchange relationships (content factor 3) was moderately negatively correlated with agreeableness, r = -.379. Finally, content factor 10 had a moderate correlation with extraversion. In addition, 14 LPFS items had small to moderate loadings on some Big Five factors. Only three items had loadings greater than .3, namely “my emotions rapidly shift around” on Neuroticism, r = .404, “Sometimes I’m not very cooperative because other people don’t live up to my standards” on Agreeableness, and “It seems as if most other people have their life together more than I” on the depression facet of Neuroticism, r = .310.

These relationships imply that some of the variance in LPFS scores can be predicted from the BFI factors, but the effect sizes are small. Neuroticism correlates only r = .123 and explains only 1.5% of the variance in LPFS total scores. Correlations are also weak for Extraversion, r = -.104, Agreeableness, r = -.096, and Conscientiousness, r = -.045. Thus, if the LPFS is a measure of core personality functioning, we would have to assume that core personality functioning is largely independent of variation along the Big Five factors of normal personality.

In contrast to these weak relationships, the evaluative bias factor in self-ratings of normal personality is strongly correlated with the general factor of the LPFS scored in terms of higher desirability, r = .901. Given the strong contribution of the general factor to LPFS scores, it is not surprising that the evaluative factor of the Big Five explains a large amount of the variance in LPFS scores, r = .748. In this case, it is not clear whether the correlation coefficient should be squared because evaluative bias in BFI ratings is not a pure measure of evaluative bias. A model with more than two measures of evaluative bias would be needed to quantify how much a general – questionnaire independent – evaluative bias factor contributes to LPFS scores. Nevertheless, the present results confirm that the evaluative factor in ratings of normal personality is strongly related to the evaluative factor in ratings of personality disorders (McCabe, Oltmanns, & Widiger, 2022).

Making Lemonade: A New Evaluative Bias Measure

My analyses provide clear evidence that most of the variance in LPFS scores reflects a general evaluative factor that corelates strongly with an evaluative factor in ratings of normal personality. In addition, the analyses showed that only some items in the LPFS are substantially related to normal personality. This implies that many LPFS items measure desirability without measuring normal personality. This provides an opportunity to develop a measure of evaluative bias that is independent of normal personality. This measure can be used to control for evaluative bias in self-ratings. A new measure of evaluative bias would be highly welcome (to avoid the pun desirable) because existing social desirability scales lack validity, in part because they confound bias and actual personality content.

To minimize the influence of acquiescence bias, I tried to find an equal number of direct and reverse coded items. I selected items with high loadings on the evaluative factor and low loadings on the LPFS content factors or the Big Five factors. This produced a 10-item scale with 6 negative and 4 positive items.

Almost no close relationship turns out well in the end.
I can’t even imagine living a life that I would find satisfying.
I don’t have many positive interactions with other people.
I have little understanding of how I feel or what I do.
I tend to let others set my goals for me, rather than come up with them on my own.
I’m not sure exactly what standards I’ve set for myself.

I can appreciate the viewpoint of other people even when I disagree with them.
I work on my close relationships, because they are important to me.
I’m very aware of the impact I’m having on other people.
I’ve got goals that are reasonable given my abilities.

I added these 10 items to the Big Five model and specified a social desirability factor and an acquiescence factor. This model ( had acceptable fit, CFI = .892, RMSEA = .043. Three items had weak (< .3) loadings on one of the Big Five factors, indicating that the SD items were mostly independent of actual Big Five content. Thus, SD scores are practically independent of variance in normal personality as measured with the BFI-2. The correlation between the evaluative factor and the SD factor was r = .877 and the correlation with the SD scale as r = .79. This finding suggests that it is possible to capture a large portion of the evaluative variance in self-ratings of personality with the new 10-item social desirability scale. Future research with other measures of evaluative bias (cf. Anusic et al., 2009) and multi-method assessment of personality is needed before this measure can be used to control for socially desirable responding.


Morey (2017) introduced the Levels of Personality Functioning Scale (LPFS) as a self-report measure of general personality pathology, core personality functioning, or the severity of personality dysfunction. Hopewood et al. (2018) conducted a validation study of the LPFS and concluded that their results support the validity of the LPFS. More recently, Morey et al. (2022) reiterate the claim that the LPFS has demonstrated strong validity. However, several commentaries pointed out problems with these claims (Sleep & Lynam, 2022). Sleep and Lynam (2022) suggested that the “LPFS may be assessing little more than general distress” (p. 326). They also suggested that overlap between LPFS content and normal personality content is a problem. As shown here as well, some LPFS items relate to neuroticism, conscientiousness, or agreeableness. However, it is not clear why this is a problem. It would be rather odd if core personality functioning were unrelated to normal personality. Moreover, the fact that some items are related to Big Five factors does not imply that the LPFS measures little more than normal personality. The present results show that LPFS scores are only weakly related to the Big Five factors. The real problem is that LPFS scores are much more strongly related to the evaluative factor in normal personality ratings than to measures of distress such as neuroticism or its depression facet.

A major shortcoming in the debate among clinical researchers interested in personality disorder is the omission of research on the measurement of normal personality. Progress in the measurement of normal personality was made in the early 2000s. when some articles combined multi-method measurement with latent variable modeling (Anusic et al., 2009; Biesanz & West, 2004; deYoung, 2006). These studies show that the general evaluative factor is unique to individual raters. Thus, it lacks convergent validity as a measure of a personality trait that is reflected in observable behaviors. The high correlation between this factor and the general factor in measures of personality disorders provides further evidence that the factor is a rater-specific bias rather than an disposition to display symptoms of sever personality disorders because dysfunction of personality is visible in social situations.

One limitation of the present study is that it used only self-report data. The interpretation of the general factor in self-ratings of normal personality is based on previous validation studies with multiple raters, but it would be preferable to conduct a multi-method study of the LPFS. The main prediction is that the general factor in the LPFS should show low convergent validity across raters. One study with self and informant ratings of personality disorders provided initial evidence for this hypothesis, but structural equation modeling would be needed to quantify the amount of convergent validity in evaluative variance across raters (Quilty, Cosentino, & Bagby, 2018).

In conclusion, while it is too early to dismiss the presence of a general factor of personality disorders, the present results raise serious concerns about the construct validity of the Level of Personality Functioning Scale. While LPFS scores reflect a general factor, it is not clear that this general factor corresponds to a general disposition of personality functioning. First, conceptual analysis questions the construct of personality functioning. Second, empirical analysis show that the general factor correlates highly with evaluative bias in personality ratings. As a result, researchers interested in personality disorders need to rethink the concept of personality disorders, use a multi-method approach to the measurement of personality disorders, and develop measurement models that separate substantive variance from response artifacts. They also need to work more closely with personality researches because a viable theory of personality disorders has to be grounded in a theory of normal personality functioning.


Biesanz, J. C., & West, S. G. (2004). Towards Understanding Assessments of the Big Five: Multitrait-Multimethod Analyses of Convergent and Discriminant Validity Across Measurement Occasion and Type of Observer. Journal of Personality, 72(4), 845–876.

Cronbach, L. J. (1989). Construct validation after thirty years. In R. L. Linn (Ed.), Intelligence: Measurement theory and public policy: Proceedings of a symposium in honor of Lloyd G. Humphreys (pp. 147–171). Urbana: University of Illinois Press.

Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81–105.

Chang, L., Connelly, B. S., & Geeza, A. A. (2012). Separating method factors and higher order traits of the Big Five: A meta-analytic multitrait–multimethod approach. Journal of Personality and Social Psychology, 102(2), 408–426.

DeYoung, C. G. (2006). Higher-order factors of the Big Five in a multi-informant sample. Journal of Personality and Social Psychology, 91(6), 1138–1151.

Hopwood, C. J., Good, E. W., & Leslie C. Morey (2018) Validity of the DSM–5 Levels of Personality Functioning Scale–Self Report, Journal of Personality Assessment, 100:6, 650-659, DOI: 10.1080/00223891.2017.1420660

Quilty, L. C., Cosentino, N., & Bagby, R. M. (2018). Response bias and the personality inventory for DSM-5: Contrasting self- and informant-report. Personality disorders9(4), 346–353.

Kim, H., Schimmack, U., & Oishi, S. (2012). Cultural differences in self- and other-evaluations and well-being: A study of European and Asian Canadians. Journal of Personality and Social Psychology, 102(4), 856–873.

Markus, K. A., & Borsboom, D. (2013). Frontiers of test validity theory: Measurement, causation, and meaning. Routledge/Taylor & Francis Group.

McCabe, G. A., Oltmanns, J. R., & Widiger, T. A. (2022). The General Factors of Personality Disorder, Psychopathology, and Personality. Journal of personality disorders36(2), 129–156.

Morey, L. C. (2017). Development and initial evaluation of a self-report form of the DSM–5 Level of Personality Functioning Scale. Psychological Assessment, 29(10), 1302–1308.

Morey, L. C., McCredie, M. N., Bender, D. S., & Skodol, A. E. (2022). Criterion A: Level of personality functioning in the alternative DSM–5 model for personality disorders. Personality Disorders: Theory, Research, and Treatment, 13(4), 305–315.

Schimmack, U., & Kim, H. (2020). An integrated model of social psychological and personality psychological perspectives on personality and wellbeing. Journal of Research in Personality, 84, Article 103888.

Sleep, C. E., & Lynam, D. R. (2022). The problems with Criterion A: A comment on Morey et al. (2022). Personality Disorders: Theory, Research, and Treatment, 13(4), 325–327.

Sleep, C. E., Weiss, B., Lynam, D. R., & Miller, J. D. (2020). The DSM-5 section III personality disorder criterion a in relation to both pathological and general personality traits. Personality Disorders: Theory, Research, and Treatment, 11(3), 202–212.

Skodol, A.E. (2011), Scientific issues in the revision of personality disorders for DSM-5. Personality and Mental Health, 5: 97-111

Zachar, P. (2017). Personality Disorder: Philosophical Problems. In: Schramme, T., Edwards, S. (eds) Handbook of the Philosophy of Medicine. Springer, Dordrecht.

Zachar, P., & Krueger, R. F. (2013). Personality disorder and validity: A history of controversy. In K. W. M. Fulford, M. Davies, R. G. T. Gipps, G. Graham, J. Z. Sadler, G. Stanghellini, & T. Thornton (Eds.), The Oxford handbook of philosophy and psychiatry (pp. 889–910). Oxford University Press.

Zimmermann, J. (2022). Beyond defending or abolishing Criterion A: Comment on Morey et al. (2022). Personality Disorders: Theory, Research, and Treatment, 13(4), 321–324.

Beyond Hedonism: A Cross-Cultural Study of Subjective Life-Evaluations

Abstract (summary)

In a previous blog post (Schimmack, 2022), I estimated that affective balance (pleasure vs. pain) accounts for about 50% of the variance in subjective life-evaluations (life-satisfaction judgments). This suggests that respondents also use other information to evaluate their lives, but it is currently unclear what additional information respondents use to make life-satisfaction judgments. In this blog post, I analyzed data from Diener’s Second International Student Survey and found two additional predictors of life-satisfaction judgments, namely a general satisfaction factor (a disposition to report higher levels of satisfaction) and a weighted average of satisfaction with several life domains (financial satisfaction, relationship satisfaction, etc.). This key finding was robust across eight world regions. Another notable finding was that East Asians score much lower and Latin Americans score much higher on the general satisfaction factor than students from other world regions. Future research needs to uncover the causes of individual and cultural variation in general satisfaction.


Philosophers have tried to define happiness for thousands of years (Sumner, 1996). These theories of the good life were objective theories that aimed to find universal criteria that make lives good. Despite some influential theories, this enterprise has failed to produce a consensual theory of the good life. One possible explanation for this disappointing outcome is that there is no universal and objective way to evaluate lives, especially in modern, pluralistic societies.

It may not be a coincidence that social scientists in the United States in the 1960s looked for alternative ways to study the good life. Rather than imposing a questionable objective definition of the good life on survey participants, they left it to their participants to define for themselves how their ideal life would look like. The first widely used subjective measure of well-being asked participants to rate their lives on a scale from 0 = worst possible life to 10 = best possible life. This measure is still used and is used in the Gallup World Poll to rank countries in terms of citizens’ average well-being.

Empirical research on subjective well-being might provide some useful information into philosophical attempts to define the good life (Kesebir & Diener, 2008). For example, hedonistic theories of well-being would predict that life-evaluations are largely determined by the amount of pleasure and pain that individuals experiences in their daily lives (Kahneman, 1999). In contrast, eudaimonic theories would receive some support from evidence that individuals’ subjective life-evaluations are based on doing good even if these good deeds do not increase pleasure. Of course, empirical data do not provide a simple answer to difficult and maybe unsolvable philosophical question, but it is equally implausible that a valid theory of well-being is unrelated to people’s evaluations of their lives (Sumner, 1996).

Although philosophers could benefit from empirical data and social scientists could benefit from the conceptual clarity of philosophy, attempts to relate the two are rare (Kesebir & Diener, 2008). This is not the place to examine the reasons for this lack of collaboration. Rather, I want to contribute to this important question by examining the predictors of life-satisfaction judgments. In a previous blog post, I reviewed 60-years of research to examine how much of the variance in subjective life-evaluations is explained by positive affect (PA) and negative affect (NA), the modern terms for the hedonic tone (good vs. bad) of everyday experiences (Schimmack, 2022). After taking measurement error into account, I found a correlation of r = .7 between affective balance (Positive Affect – Negative Affect) and subjective life-evaluations. By conventional standards in the social sciences, this is a strong correlation, suggesting that a good life is a happy life (Kesbir & Diener, 2003). However, a correlation of r = .7 implies that feelings explain only about half of the variance (we have to square .7 to get the amount of explained variance) in life-evaluations. This suggests that there is more to a good life than just feeling good. However, it is unclear what additional aspects of human lives contribute to subjective life-evaluations. To examine this question, I analyzed data from Diener’s Second International Student Survey (see, e.g., Kuppens, Realo, & Diener, 2008). Over 9,000 students from 48 different nations contributed to this study. Subjective life-evaluations were measured with Diener et al.’s (1985) Satisfaction with Life Scale. I only used the first three items because the last two items have lower validity, especially in cross-cultural comparisons (Oishi, 2006). Positive Affect was measured with two items (feeling happy, feeling cheerful). Negative Affect was measured with three items (angry, sad, and worried). The main additional predictors that might explain additional variance in life-satisfaction judgments were 18 questions about domain satisfaction. Domains ranged from satisfaction to self to satisfaction with textbooks. The main empirical question is whether domain satisfaction only predicts life-satisfaction because it increases affective balance. For example, good social relationships may increase PA and decrease NA. In this case, effects of social relationships on life-satisfaction would be explained by higher PA and lower NA, and satisfaction with social relationships would not make a unique prediction to life-satisfaction. However, satisfaction with grades might be different. Students might be satisfied with their lives if they get good grades , even if getting good grades does not increase PA or may even increase NA because studying and working hard is not always pleasurable.

The Structure of Domain Satisfaction

A common observation in studies of domain satisfaction is that satisfaction judgments in one domain tend to be positively correlated with satisfaction judgments in other domains. There are two explanations for this finding. One explanation is that personality factors influence satisfaction (Heller et al., 2004; Payne & Schimmack, 2021; Schneider & Schimmack, 2010). Individuals high in neuroticism or negative affectivity tend to be less satisfied with most life domains, especially those who are prone to depression (rather than anxiety). On the flip side, individuals who are prone to positive illusions tend to be more satisfied, presumably because they have overly positive perceptions of their lives (Schimmack & Kim, 2020). However, another factor that contributes to positive correlations among domain satisfaction ratings are response styles. Two individuals with the same level of satisfaction will use different numbers on the response scale. To separate personality effects and response styles is difficult and requires a measure of response styles or personality. This was not the case in this dataset. Thus, I was only able to identify a factor that reflects a general tendency to provide higher or lower satisfaction ratings without being able to identify the nature of this factor.

A simple way to identify a general satisfaction factor is to fit a bi-factor model to the data. I constrained the unstandardized loadings for all 18 domains to be equal. This model had good fit and only one modification index for financial satisfaction suggested a change to the model. Freeing this parameter showed a weaker loading for financial satisfaction. However, the general satisfaction factor was clearly identified. The remaining variances in the 18 domains still showed a complex pattern of correlations. The pattern of these correlations, however, is not particularly relevant for the present topic because the key question is how much of this remaining variance in domain satisfaction judgments contributes to subjective life-evaluations.

To examine this question, I used a formative measurement model. A formative measurement model is merely a weighted average of domains. The weights are empirically derived to maximize prediction of subjective life-evaluations. Thus, the 18 domain satisfaction judgments are used to create two predictors of subjective life-evaluations. One predictor is a general satisfaction factor that reflects a general tendency to report higher levels of satisfaction. The other predictor is the satisfaction in life domains after removing the influence of the general satisfaction factor.

Predicting Subjective Life-Evaluations

To examine whether the two domain satisfaction predictors add to the prediction of subjective life-evaluations, above and beyond PA and NA, I regressed LS on affective balance, general satisfaction, and domain satisfaction. I allowed for different coefficients across 7 world regions (Northern Europe/Anglo, Southern Europe, Eastern Europe, East Asia, South Asia, Latin America, & Africa). Table 1 shows the results.

The first finding is that all three predictors explain unique variance in subjective life-evaluations. This shows that the two domain satisfaction factors contribute to life-satisfaction judgments above and beyond affective balance. The second observation is that the general satisfaction factor is a stronger predictor than affective balance and the difference is significant in several regions (i.e., the 95% confidence intervals do not overlap, p < .01). Thus, it is important to study this powerful predictor of subjective life-evaluations in future research. Does it reveal personality effects or is it a mere response style? Finally, the weighted average of domain satisfaction is also a stronger predictor than affective balance except for Africa. This suggests that bottom-up effects of life domains contribute to life-evaluations. An important question for future research is to understand how life domains can be satisfying even if they do not produce high levels of pleasure or low levels of pain. Finally, there is considerable unexplained variance. Thus, future studies need to examine additional predictors of life-satisfaction judgments that produce this variation.

Table 2 shows the relationship of the general satisfaction factor with PA, NA, and affective balance. The key finding is that the general satisfaction factor was positively related to PA, negatively related to NA, and positively related to affective balance. This finding shows that the general satisfaction factor not only predicts unique variance in life-satisfaction judgments, but also predicts variance that is shared with affective balance. Thus, even well-being researchers who focus only on the shared variance between affective balance and life-satisfaction have to take the general satisfaction factor into account. The general satisfaction factor also contributes to the correlation between PA and NA. For example, for Anglo nations, the correlations of r = .50 with PA and r = -.55 imply a negative correlation of r = -.28 between PA and NA. An important question is how much of this relationship reflects real personality effects versus simple response styles.

Table 3 shows the results for the weighted average of domain satisfaction after removing the variance due to the general satisfaction factor. The pattern is similar, but the effect sizes are weaker, indicating that the general factor is more strongly related to affective balance than specific life domains.

In conclusion, domain satisfaction judgments can be divided into two components. One component represents a general disposition to provide higher satisfaction ratings. The other component represents satisfaction with specific life domains. Both components predict affective balance. In addition, both components predict subjective life-evaluations above and beyond affective balance. However, there remains substantial unexplained variance in life-satisfaction judgments that is unrelated to affective balance and satisfaction with life domains.

The contribution of Life Domains to the Weighted Average of Domain Satisfaction

Table 4 shows the domains that made a statistically significant contribution to the prediction of subjective life evaluations.

Strong effects (r > .3) are highlighted in green, whereas non-significant results are highlighted in red. The first observation is that subjective life-evaluations are influenced by many life domains with a small influence rather than a few life domains with a strong influence. This finding suggests that subjective life-evaluations do take a general picture rather than being influenced by a few, easily accessible life domains. The only exception was Africa where only two domains dominated the prediction of subjective life-evaluations. Whether this is a true cultural differences or a method problem remains to be examined in future research.

The second observation is that financial satisfaction and satisfaction with social relationships were the strongest and most consistent predictors of life-satisfaction judgments across world regions. These effects are consistent with evidence that changes in social relationships or income predict changes in life-satisfaction judgments (Diener, Lucas, & Scollon, 2006).

It is also important to remember that the difference between a statistically significant and a non-significant result is not itself statistically significant. Many of the confidence intervals are wide and overlap. Overall, the results suggest more similarity than differences across students from different world regions. Future research needs to examine whether some of the cultural differences are replicable. For example, academic abilities seem to be more important in both East and South Asia than in Latin America.

Regional Differences in Predictors of Subjective Well-Being

Table 5 shows the differences between world regions in the components that contribute to subjective life-evaluations. In this table values for global satisfaction are means, whereas the other values are intercepts that remove the influence of global satisfaction differences and domain specific differences for PA and NA and the influence of all predictors for life-satisfaction.

Red highlights show differences that imply lower well-being in comparison to the reference region Northern Europe/Anglo. The results are consistent with overall lower well-being in the other regions which is consistent with national representative surveys by Gallup.

Probably the most interesting finding is that East Asia has a very large negative difference for the global satisfaction factor. The complementary finding is Latin America’s high score on the general satisfaction factor. These finding are consistent with evidence that East Asia has lower well-being and Latin American nations have higher well-being than objective indicators of well-being like income predict. Thus, general satisfaction is likely to be a unique predictor of well-being above and beyond income and objective living conditions. The important question is whether this is merely a method artifact, as some have argued, or whether it is a real personality differences between cultures.

Homo Hedonimus: Is there more to life than maximizing pleasure and minimizing pain?


Social scientists started measuring subjective life-evaluations as well as positive and negative affective experiences in the 1960s. Sixty years of research have established that life-satisfaction judgments and the balance of PA and NA are strongly correlated in Western countries. The choice of affect items has a relatively small effect on the magnitude of the correlation. In contrast, systematic measurement error plays a stronger role. Systematic measurement error can inflate and attenuate true correlations. The existing results suggest that two sources of systematic measurement error have opposite effects. Evaluative bias inflates the observed correlation, but rater-specific measurement error attenuates the true correlation. The latter effect is stronger. As a result, multi-method studies produce stronger correlations. At present, I would interpret the data as evidence that the true correlation is around r =.7 +/- .2. (.5 to .9). This implies that affective balance explains about half of the variance in life-evaluations. Cross-cultural studies suggest that the true correlation might be lower in Asian cultures, but the difference is relatively small (.6 vs. .5, without controlling for systematic measurement error).

The finding that affective balance explains only some of the variance in life-satisfaction judgments raises an interesting new question that has not received much attention. What does lead to positive life-evaluations in addition to pleasure and pain? An exploration of this question requires the measurement of LS, PA and NA, and the specification of causal model with affective balance as a predictor of life-satisfaction. The few studies that have examined this question have found that domain satisfaction (Schimmack et al., 2002), intentional living (Busseri, 2015), and environmental mastery (Payne & Schimmack, 2021) are substantial unique predictors of subjective life-evaluations. These results are preliminary. Existing datasets and new studies can reveal additional predictors. Evidence of cultural variation in the importance of affective experiences needs to be replicated and additional moderators should be explored. Identifying a reliable set of predictors of life-satisfaction judgments can provide insights into individuals implicit definition of the good life. This information may be useful to evaluate objective theories of well-being and to evaluate the validity of life-satisfaction judgments as measures of subjective well-being. The present results are inconsistent with a view of humans as homo hedonimus, who only cares about affective experiences, but the results do suggest that pleasure and pain cannot be ignored in a theory of human well-being.

Literature Review

Positive Affect (PA) and Negative Affect (NA) are scientific constructs. People have expressed their feelings for thousands of years. Across many cultures, some emotion terms have similar meanings and are related to similar antecedents and consequences. However, I am not aware of any everyday expressions of feelings that use the terms Positive Affect or Negative Affect. Yet, the scientific concepts of PA and NA were created to make scientific claims about everyday experiences like happiness, sadness, fear, satisfaction, or frustration. The distinction between PA and NA implies that a major distinction between affects is that some affects are positive and others are negative. Yet, psychologists do not have a consensual definition of Positive Affect and Negative Affect.

While PA and NA were used occasionally in the scientific literature, the terms became popular after Bradburn developed the first measures of PA and NA and reported the results of empirical studies with Bradburn’s PA and NA scales . The first report did not even use the term affect and referred to the sales as measures of positive and negative feelings (Bradburn & Caplovitz, 1965). The terms positive affect and negative affect were introduced in the follow-up report (Bradburn, 1969).

To understand Bradburn’s concepts of PA and NA, it is useful to examine the social and historical context that led to the development of the first PA and NA scales. The scales were developed to “provide periodic inventories of the psychological well-being of the nations’ [USA] psychological well-being” (p. 1). However, the introduction also mentions the goal to “better understand the patterning of psychological adjustment” (p. 2) and “to determine the nature of mental health, as well as to determine the causes of mental illness” (p. 2). This sweeping agenda creates conceptual confusion because it is no longer clear how PA and NA are related to well-being and mental health. Although it is likely that PA and NA are related to some extent to well-being and mental health, it is unlikely that well-being or mental health can be defined in terms of PA and NA. Even if this were possible, it would only clarify the meaning of well-being and mental health, but not the meaning of PA and NA.

More helpful is Bradburn’s stated objected for developing his PA and NA scales. The goal was to “measure a wide range of pleasurable and unpleasurable experiences apt to be common in a heterogeneous population” (Bradburn & Caplovitz, 1965; p. 16). This statement of the objective makes it clear that Bradburn used the term positive affect to refer to pleasurable experiences and the term negative affect to refer to unpleasant experiences. Bradburn (1969) is even more explicit. His assumption for the validity of the self-report measure was that “people tend to code their experiences in terms of (among other things) their affective tone – positive, neutral, or negative. For our purposes, the particular content of the experience is not important. We are concerned with the pleasurable or unpleasurable character associated with the experience” (p. 54). Other passages also make it clear that Bradburn’s goal was to measure the hedonic tone of everyday experiences. In short, the distinction between PA and NA is based on the hedonic tone of the affective experiences. PA feels good and NA feels bad.

Bradburn’s (1969) final chapter provides the most important information about his sometimes implicit assumptions underlying his approach to the study of psychological well-being, mental health, or happiness. “We are implicitly stating our belief that the modern concept of mental health is really a concerns about the subjective sense of well-being, or what the Greeks called eudaimonia” (p. 225). It is also noteworthy that Bradburn did not reduce happiness to the balance of PA and NA. “By naming our forest “psychological well-being,” we have not meant to imply that concepts such as self-actualization, self-esteem, ego-strength, or autonomy, …., are irrelevant to our study… While we have said relatively little about these particular trees, we do not doubt that they are an integral and important part of the whole” (p. 224). Accordingly, Bradburn rejects the hedonistic idea that well-being can be reduced to the balance of pleasure and pain, but he assumed that PA and NA are important to the conception of a good life.

However, defining well-being in terms of PA, NA, and other good things in life is not a satisfactory definition of well-being. A complete theory of well-being would have to list the additional ingredients and justify their inclusion in a definition of well-being. Philosophers and some psychologists have tried to defend different conceptions of the good life (Sumner, 1996). The main limitation of these proposals is that it is difficult to defend one conception of the good life as superior to another. The key problem is that it is difficult to find a universal, objective criterion that can be used to evaluate individuals’ lives (Sumner, 1996).

One solution to this problem is to take a subjective perspective. Accordingly, individuals can chose their own ideals and evaluate their lives accordingly. In the 1960s, social scientists developed subjective measures of well-being. One of the first measures was Cantril’s ladder that asked respondents to place their actual lives on a ladder from 0 = worst possible life to 10 = best possible life. This measure does not impose any criteria on the life-evaluations. This measure continues to be used to this day. The measure is a subjective measure of well-being because respondents can use any information that they consider to be important to rate their lives. In theory, they could rely exclusively on the hedonic tone of their everyday experiences. In this case, we would expect a strong correlation between affective balance and life-evaluations. However, it is also possible that individuals follow other goals that do not aim to maximize pleasure and to minimize pain. In this case, the correlation between affective balance and life-evaluations would be attenuated. It is therefore interesting to examine empirically how much of the variance in life-evaluations or life-satisfaction judgments is explained by the hedonic tone of everyday experiences. Subsequently, I review the relevant studies that have examined this question over the past 50 years.

Bradburn (1969) simply states that “the difference between the numbers of positive and negative feelings is a good predictor of a person’s overall ratings of his own happiness” (p. 225), but he did not provide quantitative information about the amount of explained versus unexplained variance.

The next milestone in well-being research was Andrews and Whitey’s (1976) examination of the validity of well-being measures. They included Bradburn’s items, but modified the response format from a dichotomous yes/no format to a frequency format. They assumed that this might produce negative correlations between the PA and NA scales, but this expectation was not confirmed. More interesting is how much the balance of PA and NA correlated with subjective well-being ratings. The key finding was that affect balance scores correlated only r = .43 with a 7-point life-satisfaction rating, and r = .47 with a 7-point happiness scale, while the two global ratings correlated r = .63 with each other. Corrected for unreliability, this suggest that affective balance is strongly correlated with global life-evaluations, ((.43 + .47)/2)/sqrt(.63) = .57. Nevertheless, a substantial portion of the variance in global life-satisfaction judgments remains unexplained, 1-.57^2 = 68%. This finding undermines theories of well-being that define well-being exclusively in terms of the amount of PA and NA (Bentham, Kahneman, 1999). However, the evidence is by no means conclusive. Systematic measurement error in the PA and NA scales might severely attenuate the true influence of PA and NA on global life-evaluations, given the low convergent validity between self-ratings and informant ratings of affective experiences (Schneider & Schimmack, 2009).

Nearly a decade later, Diener (1984) published a highly influential review article on the field of subjective well-being research. In this article, he coined the term subjective well-being (SWB) for research on global life-satisfaction judgments and affective balance. SWB was defined as high life-satisfaction, high PA and low NA. Diener noted that the relationship among the three components of his SWB construct is an empirical question. He also pointed out that the relationship between PA and NA had received a lot of attention, whereas the relationship between affective balance and life-satisfaction “has not been as thoroughly researched” (p. 547). Surprisingly, this statement still rings true nearly 40 years later, despite a few attempts by Diener and his students, including myself, to study this relationship.

For the next twenty years, the relationship between PA and NA became the focus of attention and fueled a heated debate with proponents of independence (Watson, Clark, & Tellegen, 1988), bipolarity (Russell, 1980), and models of separate, yet negatively correlated dimensions (Diener, Smith, & Fujita, 1995). A general agreement is that time frame, response formats, and item selection influences the correlations among PA and NA measures (Watson, 1988). This raises a question about the validity of different PA and NA scales. If different scales produce different correlations between PA and NA, different scales may also produce different correlations between life-evaluations and affective balance. However, this question has not been systematically examined to this day.

To make matters worse, the debate about the structure of affect also led to confusion about the meaning of the terms PA and NA. Starting in the 1980s, Watson and colleagues started to use the terms as labels for the VARIMAX rotated first-two factors in exploratory factor analyses of correlations among affect ratings (Watson & Tellegen, 1985). They also used these labels for their Positive Affect and Negative Affect scales that were designed to measure these two factors (Watson, Clark, & Tellegen, 1988). They defined Positive Affect as a state of high energy, full concentration, and pleasurable engagement and Negative Affect as a state of subjective distress and unpleasurable engagement. An alternative model based on the unrotated factors, however, identifies a first factor that distinguishes affects based on their hedonic tone. Watson et al. (1988) refer to this factor as pleasantness-unpleasantness factor. Thus, PA is no longer equivalent with pleasant affect, and NA is no longer equivalent with unpleasant affect.

To avoid conceptual confusion, different labels have been proposed for measures that focus on hedonic tone and measures that focus on the PANAS dimensions. Some researchers have suggested to use pleasant affect and unpleasant affect for measures of hedonic tone. Others have proposed to label Watson and Tellegen factors Positive Activation and Negative Activation. In the broader context of research on well-being, PA and NA are often used in Bradburn’s tradition to refer to the hedonic tone of affective experiences, and I will follow in this tradition. I will refer to the PANAS scales as measures of Positive Activation and Negative Activation.

While it is self-evident that the PANAS scales are different from measures of hedonic tone, it is still possible that the difference between Positive Activation and Negative Activation is a good measure of affective balance. That is, individuals who often experience positive activation and rarely experience negative activation are in a pleasant affective state most of the time. In contrast, individuals who experience a lot of Negative Activation and rarely experience Positive Activation are expected to feel bad most of the time. Whether the PANAS scales are capable of measuring hedonic tone as well as other measures is an empirical question that has not been examined.

The next important article was published by Lucas, Diener, and Suh (1996). The authors aimed to examine the relationship between the cognitive component of SWB (i.e., life-satisfaction) and the affective component of SWB (i.e., PA and NA) using a multi-trait-multi-method approach (Campbell & Fiske, 1959). Study 1 used self-ratings and informant ratings of life-satisfaction on the Satisfaction with Life Scale and PANAS scores to examine this question. The key finding was that same-construct correlations were higher (i.e., LS r = .48, PA r = .43, NA r = .26) than different-construct correlations (i.e., LS-PA rs = .28, .31, LS-NA r = -.16, -.21, PA-NA r = -.02, -.14). This finding was interpreted as evidence that “life satisfaction is discriminable from positive and negative affect” (p. 616). The main problem with this conclusion is that the results do not directly examine the discriminant validity of life-satisfaction and affective balance. As affective balance is made up of two distinct components, PA and NA, it is self-evident that LS cannot be reduced to PA or NA alone. However, it is possible that life-satisfaction is strongly related to the balance of PA and NA. To examine this question it would have been necessary to compute an affective balance score or to use a latent variable model to regress life-satisfaction onto PA and NA. The latter approach can be applied to the published correlation matrix. I conducted a multiverse analysis with five different models that make different assumptions about the validity of self-ratings and informant ratings. The results were very similar and suggested that affective balance explains about half of the variance in life-satisfaction judgments, rs = .68 to .75. The higher amount of explained variance is partially explained by the lower validity of Bradburn’s scales (Watson, 1988) and partially due to the use of a multi-method approach as mono-method relationships were only r = .6, for self-ratings at Time 1, and r = .5, for self-ratings at time 2 (Lucas et al., 1996). In conclusion, Lucas et al.’s study provided evidence that life-satisfaction judgments are not redundant with affective balance when affective balance is measured with the PANAS scales. However, it is possible that other measures of PA and NA might be more valid and explain more variance in life-evaluations.

A couple of years later, Diener and colleagues presented the first article that focused on the influence of affective balance on life-satisfaction judgments (Suh, Diener, Oishi, & Triandis, 1998). The main focus of the article was cultural variation in the relationship between life-satisfaction and affective balance. Study 1 examined correlations in the World Value Survey that used Bradburn’s scales. Correlations with a single-item life-satisfaction judgment ranged from a maximum of r = .57 in West Germany to a minimum of r = .20 in Nigeria. The correlation for the US sample was r = .48, which closely replicates Andrews and Whitey’s results. Study 2 used the more reliable Satisfaction with Life Scale and hedonic items with an amount of time response format. This produced stronger correlations. The correlation for the US sample was r = .64. This is consistent with Lucas et al.’s (1996) mono-method results. This article suggested that affect contributes to subjective well-being, but does not determine it, and that culture moderates the use of affect in life-evaluations.

Diener and colleagues followed up on this finding, by suggesting that the influence of neuroticism and extraversion on subjective well-being is mediated by affective balance (Schimmack, Diener, & Oishi, 2002). The article also explored whether domain satisfaction might explain additional variance in life-satisfaction judgments. The key finding was that affective balance made a unique contribution to life-satisfaction judgments (b = .45), but two life-domains also made unique contributions (i.e., academic satisfaction, b = .27, romantic satisfaction, r = .23). Affective balance mediated the effects of extraversion and neuroticism. Schimmack et al. (2002) followed up on these findings by examining the mediating role of affective balance across cultures. They replicated Suh et al.’s (1998) finding that culture moderates the relationship between affective balance and life-satisfaction and found a strong relationship in the two Western cultures (US, German) in a structural equation model that controlled for random measurement error, r = .76. The stronger relationship might be due to the use of affect items that focus on hedonic tone.

The next big development in well-being research was the creation of Positive Psychology; the study of all things positive. Positive psychology promoted eudaimonic conceptions of well-being that are rooted in objective theories of well-being (Sumner, 1996). These theories clash with subjective theories of well-being that leave it to individuals to choose how they want to live their lives. An influential article by Keyes, Shmotkin, & Ryff (2002) pitted these two conceptions of well-being against each other, using the Midlife in the U.S. (MIDUS) sample (N = 3,032). The life-satisfaction item was Cantril’s ladder. The PA and NA items were ad-hoc items with an amount of time response format. This explains why the MIDUS PA and NA scales are strongly negatively correlated, r = -.62. PA and NA were also strongly correlated with LS, PA r = .52, NA r = -.46. The article did not examine the relationship between life-satisfaction and affective balance because the authors treated LS, PA, and NA as indicators of a latent variable. According to this model, neither life-satisfaction nor affective balance measure well-being. Instead, well-being is an unobserved construct that is reflected in the shared variance among LS, PA, and NA. Using the published correlations and assuming a reliability of .7 for the single-item life-satisfaction item (Busseri, 2015), I obtained a correlation of r = .66 between life-satisfaction and affective balance. This correlation is stronger than the correlation with the PANAS scales in Lucas et al.’s (1996) study, suggesting that hedonic PA and NA scales are more valid measures of hedonic tone of everyday experiences and produce correlations around r = .7 with life-satisfaction judgments in the United States.

In the 21st century, psychologists’ interest in the determinants of life-satisfaction judgments decreased for a number of reasons. Positive psychologists were more interested in exploring eudaimonic conceptions of well-being. They also treated life-satisfaction judgments as indicators of hedonic well-being and treated life-satisfaction judgments and affective measures as interchangeable indicators of hedonic well-being. Another blow to research on life-satisfaction was Kahneman’s suggestion that life-satisfaction judgments are unreliable and invalid (Kahneman, 1999; Schwarz & Strack, 1999) and his suggestion to focus on affective balance as the only criterion for well-being. Kahneman et al. (2006) reported that income predicted life-satisfaction judgments, but not measures of affective balance. However, this finding was not interpreted as a discovery that income influences well-being independently of affect, but rather as evidence that life-satisfaction judgments are invalid measures of well-being.

In contrast, sociologists continued to focus on subjective well-being and used life-satisfaction judgments as key indicators of well-being in important panel studies such as the General Social Survey, the German Socio-Economic Panel (SOEP), and the World Value Survey. Economists rediscovered happiness, but relied on life-satisfaction judgments to make policy recommendations (Diener, Lucas, Schimmack, & Helliwell, 2008). Although Gallup measures all three components of SWB, it relies exclusively on life-satisfaction judgments to rank nations in terms of happiness (World Happiness Reports,

In 2008, I used data from a pilot study for the SOEP to replicate the finding that affective balance mediated the effects of extraversion and neuroticism (Schimmack, Schupp, & Wagner, 2008). The study also controlled for evaluative biases in self-ratings. In addition, unemployment and regional differences between former East and West Germany were unique predictors of life-satisfaction judgments. The unique effect of affective balance on life-satisfaction was r = .50. One reason for the weaker relationship is that the model controlled for shared method variance among life-satisfaction and affect ratings.

Kuppens, Realo, and Diener (2008) followed up on Suh et al.’s (1996) finding that culture moderates the relationship between affective balance and life-satisfaction. While they replicated that culture moderates the relationship, the use of a multi-level model with unstandardized scores made it difficult to assess the magnitude of these moderator effects. Furthermore, the authors examined moderation for the effects of PA and NA separately rather than evaluating cultural variation in the relationship between affective balance and life-satisfaction. Finally, the use of PA and NA scales makes it impossible to evaluate measurement equivalence across nations. Using the same data, I examined the relationship between affective balance and life-satisfaction using a multi-group structural equation model with a largely equivalent measurement model across 7 world regions (Northern Europe/Anglo, Southern Europe, Eastern Europe, East Asia, South Asia, Latin America, and Africa). I replicated that the correlation in Western countries is around r = .6 (Northern Europe/Anglo, r = .64, Southern Europe, r = .59). The weakest relationships were found in East Asia (r = .52) and South Asia (r = .51). While this difference was statistically significant, the effect size is rather small and suggests that affective balance contributes to life-satisfaction judgments in all cultures. A main limitation of this study is that it is unclear how much cultural differences in response styles contribute to the moderator effect. A comparison of the intercept of life-satisfaction (i.e., mean difference after controlling for mean differences in PA and NA) showed that all regions had lower life-satisfaction intercepts than the North-American/Anglo comparison group. This shows that factors unrelated to PA and NA (e.g., income, Kahneman et al., 2006) produce cultural variation in life-satisfaction judgments.

Zou, Schimmack, and Gere (2013) published a replication study of Lucas et al.’s sole multi-method study. The study was not a direct replication. Instead, it addressed several limitations in Lucas et al.’s study. Most importantly, it directly examined the relationship between life-satisfaction and affective balance. It also ensured that correlations are not attenuated by biases in life-satisfaction judgments by adding averaged domain satisfaction judgments as a predictor. The study also used hedonic indicators to measure PA and NA rather than assuming that the rotated Positive Activation and Negative Activation factors fully capture hedonic tone. Finally, the sample size was five times larger than in Lucas et al.’s study and included students and middle aged individuals (i.e., their parents). The results showed convergent and discriminant validity for life evaluations (global & averaged domain satisfaction), PA, and NA. Most important, the correlation between the life-evaluation factor and the affective balance factor was r = .90. While this correlation still leaves 20% unexplained variance in life-evaluations, it does suggest that the hedonic tone of life experiences strongly influences subjective life-evaluations. However, there are reasonable concerns that this correlation overestimates the importance of hedonic experiences. One problem is that judgments of hedonic tone over an extended period of time may be biased by life-evaluations. To address this concern it would be necessary to demonstrate that affect ratings are based on actual affective experiences rather than being inferred from life-evaluations.

Following a critical discussion of Diener’s SWB concept (Busseri & Sadava, 2011), Busseri tackled the issue empirically using the MIDUS data. To do so, Busseri (2015) examined how LS, PA, and NA are related to predictors of SWB. He explicitly examined which predictors may have a unique influence on life-satisfaction judgments above and beyond the influence of PA and NA. The main problem was that the chosen predictors had weak relationships with the well-being components. The main exception was the Intentional Living scale; that is, an average of ratings of how much effort respondents invest into work, finances, relationships, health, and life overall. This scale had a strong unique relationship with life-evaluations, b = .44, that was as strong as the unique effect of PA, b = .42, and stronger than the unique effect of NA, b = -.16. The study also replicated Kahneman et al.’s (2006) finding that income is a unique predictor of LS and unrelated to PA and NA, but even the effect of income is statistically small, b = .05. Using the published correlation matrix and correcting LS for unreliability, I found a correlation of r = .58 for LS and affective balance. The unique relationship after controlling for other predictors was r = .52, suggesting that most of the relationship between affective balance and life-satisfaction is direct and not spurious due to third variables that influence affective balance and life-satisfaction.

Payne and Schimmack (2022) followed up on Zou et al.’s (2013) study with a multiverse analysis. PA and NA were measured with different sets of items ranging from pure hedonic items (good, bad), happiness and sadness items, to models of PA and NA as higher order factors of several positive (joy, love, gratitude) and negative (anger, fear, sadness) affects (Diener et al., 1995). They also compared results for mono-method (only self-ratings) and multi-method (ratings by all three family members) measurement models. Finally, results were analyzed separately for students, mothers, and fathers as targets. They key finding was that item selection had a very small influence, whereas the comparison of mono-method and multi-method studies made a bigger difference. The mono-method results ranged from r = .64, 95%CI = .58 to .71 to r = .69, 95%CI = .63 to .75. The multi-method results ranged from r = .71, 95%CI = .62 to .81, to r = .86, 95%CI = .80 to .92. These estimates are somewhat lower than Zou et al.’s (2013) results and suggest that the true relationship is less than r = .9.

In Study 2, Payne and Schimmack (2022) conducted the first direct comparison of PANAS items with hedonic tone items using an online sample. They found that PANAS NA was virtually identical with other NA measures. This refutes the interpretation of PANAS NA as a measure of negative activation that is distinct from hedonic tone. However, PANAS PA was distinct from other PA measures and was a weaker predictor of life-evaluations. A latent variable model with the PANAS items produced a correlation of r = .78, 95%CI = .73 to .82. An alternative measure that focusses on hedonic tone, the Scale of Positive and Negative Experiences (SPANE, Diener & Bieswas-Diener, 2009) yielded a slightly stronger correlation, r = .83, 95% .79 to .86. In a combined model, the SPANE PA factor was a stronger predictor than the PANAS PA factor. Thus, PANAS scales are likely to underestimate the contribution of affect to life-evaluations, but the difference is small. The correlations might be stronger than in other studies due to the use of an online sample.

To summarize, correlations between affective balance and life-evaluations range from r = .5 to r = .9. Several methodological factors contribute to this variation, and studies that use more valid PA and NA scales and control for measurement error produce stronger correlations. In addition, culture can moderate this relationship but it is not clear whether culture influences response styles or actual differences in the contribution of affect to life-evaluations. A reasonable estimate of the true correlation is r = .7 (+/- .2), which suggests that about 50% of the variance in life-evaluations is accounted for by variation in the hedonic tone of everyday experiences. An important direction of future research is to identify the unique predictors of life-evaluations that explain the remaining variance in life-evaluations. Hopefully, it will not take another 60 years to get a better understanding of the determinants of individuals’ life-evaluations. A better understanding of life-satisfaction judgments is crucial for the construct validation of life-satisfaction judgments before they can be used to make claims about nations’ well-being and to make public policy recommendations.

Democracy and Citizens’ Happiness

For 30 years, I have been interested in cultural differences. I maintained a database of variables that vary across cultures, starting with Hofestede’s seminal rankings of 40 nations. Finding interesting variables was difficult and time consuming. The world has changed. Today it is easy to find interesting data on happiness, income, or type of government. Statistical software is also free (project R). This has changed the social sciences. Nowadays, the new problem is that data can be analyzed in many ways and that results can be inconclusive. As a result, social scientists can disagree even when the analyze the same data. Here I focus on predictors of national differences in happiness.

Happiness has been defined in many ways and any conclusion about national differences in happiness depends on the definition of happiness. The most widely used definition of happiness in the social sciences is subjective well-being. Accordingly, individuals define for themselves what they consider to be a good life and evaluate how close their actual lives are to their ideal lives. The advantage of this concept of well-being is that it does not impose values on the concept of happiness. Individuals in democratic countries could evaluate their lives based on different criteria than individuals in non-democratic countries. Thus, subjective well-being is not biased in favor of democracy, even though subjective conceptions of happiness emerged along with democracy in Western countries.

The most widely used measure of subjective well-being is Cantril’s ladder. Participants rate their lives on a scale from 0 = worst possible life to 10 = best possible life. This measure leaves it to participants to define what the worst or best possible life it. The best possible life in Denmark could be a very different life than the best possible life in Zimbabwe. Ratings on Cantril’s ladder are imperfect measures of subjective well-being and could distort comparisons of countries, but these ratings are currently used to compare the happiness of over 100 countries (WHR).

The Economist’s Intelligence Unit (EUI) has created ratings of countries’ forms of government that provides a measure of democracy (Democracy Index). Correlating the 2020 happiness means of countries with the democracy index produces a strong (linear) correlation of r = .68 (rank correlation r = .71).

This finding has been used to argue that democracies are better societies because they provide more happiness for their citizens (Williamson, 2022).

So the eastward expansion of democracy isn’t some US-led conspiracy to threaten Russia; it reflects the fact that, when given the choice, citizens tend to choose democracy and hope over autocracy and fear. They know instinctively that it brings a greater chance for happiness.

Although I am more than sympathetic to this argument, I am more doubtful that democracy alone is sufficient to produce more happiness. A strong correlation between democracy and happiness is insufficient to make this argument. It is well known that many predictors of nations’ happiness scores are strongly corelated with each other. One well known predictor is nations’ wealth or purchasing power. Money does buy essential goods. The best predictor of happiness is the median income per person that reflects the spending power of average citizens and is not distorted by international trade or rich elites.

While it is known that purchasing power is a predictor of well-being, it is often ignored how strong the relationship is. The linear correlation across nations is r = .79 (rank r = .82). It is often argued that the relationship between income is not linear and that money is more important in poorer countries. However, the correlation with log income is only slightly higher, r = .83.

This might suggest that purchasing power and democracy are both important for happiness. However, purchasing power and democracy are also strongly correlated, (linear r = .72, rank = .75). Multiple regression analysis can be used to see whether both variables independently contribute to the prediction of happiness.

Of course, dollars cannot be directly compared to ratings on a democracy index. To make the results comparable, I scored both variables from 0 for the lowest possible score to 1 for the highest possible score. For purchasing power, this variable ranged from Madagascar ($398) to Luxembourg ($26,321). For democracy, this variable ranged from Myanmar (1.02) to Norway (9.75).

The results show that purchasing power is a much stronger predictor of happiness than democracy.

The model predicts that a country with the lowest standing on purchasing power and democracy has a score of 3.63 on Cantril’s happiness measure. Increasing wealth to the maximum level without changing democracy would increase happiness to 3.63 + 3.13 = 6.76. In contrast, keeping purchasing power at the lowest level and increasing democracy to the highest level would increase happiness only to 3.63 + 0.48 = 4.11. One problem with statistical analyses across nations is that the sample size is limited by the number of nations. As a result, the positive relationship with democracy is not statistically significant and it is possible that the true effect is zero. In contrast, the effect of purchasing power is highly significant and it is unlikely (less than 5%) that the increase is less than 2.5 points.

Do these results imply that democracy is not very important for citizens’ happiness? Not necessarily. A regression analysis ignores the correlation between the predictor variables. It is possible that the correlation between purchasing power and democracy reflects at least in part a causal effect of democracy on wealth. For example, democratic governments may invest more in education and innovation and achieve higher economic growth. Democracies may also produce better working conditions and policies that benefit the working class rather than wealthy elites.

I will not repeat the mistake of many other social scientists to end with a strong conclusion that fits their world views based on weak and inconclusive data. The main aim of this blog post is to warn readers that social science is much more complicated than the natural sciences. Follow the science makes a lot of sense, when large clinical trials show strong effectiveness of drugs or vaccines. The social sciences can provide valuable information, but do not provide simple rules that can be followed to increase human well-being. This does not mean that social science is irrelevant. Ideally, social scientists would provide factual information and leave the interpretation to educated consumers.

Interpreting discrepancies between Self-Perceptions and IAT scores: Who is defensive?

In 1998, Anthony G. Greenwald and colleagues introduced the Implicit Association Test. Since then, Implicit Association Tests have been used in thousands of studies with millions of participants to study stereotypes and attitudes. The most prominent and controversial use of the race IAT that has been used to argue that many White Americans have more negative attitudes towards African Americans than they admit to others or even to themselves.

The popularity of IATs can be attributed to the use of IATs on the Project Implicit website that provides visitors of the website with the opportunity to take an IAT and to receive feedback about their performance. Over 1 million visitors have received feedback about their performance on the race IAT (Howell, Gaither, & Ratliff, 2015).

Providing participants with performance feedback can be valuable and educational. Coaches provide feedback to athletes so that they can improve their performance, and professors provide feedback about performance during midterms so that students can improve their performance on finals. However, the value of feedback depends on the accuracy of the feedback. As psychological researchers know, providing participants with false feedback is unethical and requires extensive debriefing to justify the use of false feedback in research. it is therefore crucial to examine the accuracy of performance feedback on the race IAT.

At face value, IAT feedback is objective and reflects participants’ responses to the stimuli that were presented during an IAT. However, this performance feedback should come with a warning that performance could vary across repeated administration of a test. For example, the retest reliability of performance on the race IAT has been estimated to be between r = .2 and r = .5. Even using a value of r = .5 implies that there is only a 75% probability that somebody with a score above average receives a score above average again on a second test (Rosenthal and Rubin, 1982).

However, the Project Implicit website gives the false impression that performance on IATs is rather consistent, while avoiding quantitative information about reliability.


Unreliability is not the only reason why performance feedback on the Project Implicit website could be misleading. Another problem is that visitors may be given the impression that performance on the race IAT reveals something about themselves that goes beyond performance on this specific task. One possible interpretation of race IAT scores is that they reveal implicit attitudes or evaluations of Black and White Americans. These implicit attitudes can be different from attitudes that individuals think they have that are called explicit attitudes. In fact, Greenwald et al. (1998) introduced IATs as a method that can detect implicit attitudes that can differ from explicit attitudes and this dual-attitude model has fueled interest in IATs.

The Project Implicit website does not provide a clear explanation of what Implicit association Tests test. Regarding the race IAT, visitors are told that it is not a measure of prejudice, but that it does measure their biases, even if these biases are not endorsed or contradict conscious beliefs.


However, other frequently asked question implies that IATs measure implicit stereotypes and attitudes. One question is how IATs measure implicit attitudes, implying that it can measure implicit attitudes (and that implicit attitudes exist).


Another one implies that performance on the race IAT reveals implicit attitudes that reflect cultural biases.

In short, while Project Implicit may not provide a clear explanation of what is being tested with an Implicit Association Test, it is strongly implied that test performance reveals something about participants’ racial biases that may contradict their self-perceptions.

An article by Howell, Gaither, and Ratliff (2015) makes this assumption explicit. This article examines how visitors of the Project Implicit website respond to performance feedback on the race IAT. The key claim of this article is that “people are generally defensive in response to feedback indicating that their implicit attitudes differ from their explicit attitudes” (p. 373). This statement rests on two assumptions. First, it makes the assumption of dual-attitude models that there are explicit and implicit attitudes, as suggested by Greenwald et al. (1998). Second, it implies that performance on a single race IAT provides highly valid information about implicit attitudes. These assumptions are necessary to place researchers in the position of an expert that know individuals’ implicit attitudes, just like a psychoanalyst is in a superior position to understand the true meaning of a dream. If test takers reject the truth, they are considered defensive because they are unwilling to accept the truth.

To measure defensiveness, Howell et al. (2015) used answers to three questions after visitors of the Project Implicit website received performance feedback on the race IAT, namely
(a) the IAT does not reflect anything about my thoughts or feelings unconscious or otherwise,
(b) whether I like my IAT score or not, it captures something important about me (reversed)
(c) the IAT reflects something about my automatic thoughts and feelings concerning this topic (reversed). Responses were made on a 1 = strongly disagree to 4 = strongly agree. On this scale, a score of 2.5 would imply neither agreement nor disagreement with the aforementioned statements.

There was hardly any difference in defensiveness scores between White (M = 2.31, SD = 0.68) Black (M = 2.38, SD = 0.74) or biracial (M = 2.33, SD = 0.73) participants. For White participants, a larger pro-White discrepancy was correlated with higher defensiveness scores, partial r = .16. The same result was found for Black participants, partial r = .13. A similar trend emerged for biracial participants. While these correlations are weak, they suggest that all three racial groups were less likely to believe in the accuracy of the feedback when the IAT scores showed a stronger pro-White bias than the self-ratings implied.

Howell et al. (2015) interpret these results as evidence of defensiveness. Accordingly, “White individuals want to avoid appearing racist (O’Brien et al., 2010) and Black individuals value pro-Black bias (Sniderman & Piazza, 2002)” (p. 378). However, this interpretation of the results rests on the assumption that the race IAT is an unbiased measure of racial attitudes. Howell et al. (2015) ignore a plausible alternative explanation of their results. The alternative explanation is that performance feedback on the race IAT is biased in favor of pro-White attitudes. One source of this bias could be the scoring of IATs which relies on the assumption that neutral attitudes correspond to a zero score. This assumption has been challenged in numerous articles (e.g., Blanton, Jaccard, Strauts, Mitchell, & Tetlock, 2015). It is also noteworthy that other implicit measures of racial attitudes show different results than the race IAT (Judd et al., 1995; Schimmack & Howard, 2021). Another problem is that there is little empirical support for dual-attitude models (Schimmack, 2021). Thus, it is impossible for IAT scores to provide truthful information that is discrepant from individuals’ self-knowledge (Schimmack, 2021).

Of course, people are defensive when they are confronted with unpleasant information and inconvenient truths. A prime example of defensiveness is the response of the researchers behind Project Implicit to valid scientific criticism of their interpretation of IAT scores.

About us

Despite several inquires about questionable or even misleading statements on the frequently asked question page, Project Implicit visitors are not informed that the wider scientific community has challenged the interpretation of performance feedback on the race IAT as valid information about individuals implicit attitudes. The simple fact that a single IAT score provides insufficient information to make valid claims about an individuals’ attitudes or behavioral tendencies is missing. Visitors should be informed that the most plausible and benign reason for a discrepancy between their test scores and their beliefs is that test scores could be biased. However, Project Implicit is unlikely to provide visitors with this information because the website is used for research purposes and willingness to participate in research might decrease when participants are told the truth about the mediocre validity of IATs.

Proponents of IATs often argue that taking an IAT can be educational. However, Howell et al. (2015) point out that even this alleged benefit is elusive because individuals are more likely to believe themselves than the race IAT feedback. Thus, rejection of IAT feedback, whether it is based on defensiveness or valid concerns about the validity of the test results, might undermine educational programs that aim to reduce actual racial biases. It is therefore problematic to use the race IAT in education and intervention programs.