Cross-Cultural Comparisons of Personality: Beware of Method Factors

Ulrich Schimmack
Shigehiro Oishi


Personality ratings on a 25-item Big Five measures by two national samples (US, Japan) were analyzed with an item-level measurement model that separates method factors (acquiescence, halo bias) and trait factors. Results reveal a strong influence of halo bias on US responses that distort cultural comparisons in personality. After correcting for halo bias, Japanese were more conscientious, extraverted, open to experience and less neurotic and agreeable. The results support cultural differences in positive illusions and raises questions about the validity of studies that rely on scale means to examine cultural differences in personality.


Cultural stereotypes imply cross-cultural differences in personality traits. However, cross-cultural studies of personality do not support the validity of these cultural stereotypes (Terracciano et al., 2005). Whenever two measures produce divergent results, it is necessary to examine the sources of these discrepancies. One obvious reason could be that cultural stereotypes are simply wrong. It is also possible that scientific studies of personalty across culture produce misleading results (Perugini & Richetin, 2007). One problem for empirical studies of cross-cultural differences in personality is that cultural differences tend to be small. Culture explains at most 10% of the variance and often the percentages are much smaller. For example, McCrae et al. (2010) found that culture explained only 1.5% of the variance in agreeableness ratings. As some of this variance is method variance, the variance due to actual differences in agreeableness is likely to be less than 1%. With small amounts of valid variance, method factors can have a strong influence on the pattern of mean differences across cultures.

One methodological problem in cross-cultural studies of personality is that personalty measures are developed with a focus on the correlation of items with each other within a population. The item means are not relevant with the exception that items should avoid floor or ceiling effects. However, cross-cultural comparisons rely on differences in the item means. As item means have not been subjected to psychometric evaluations, it is possible that item means lack construct validity. Take “working hard” as an example. How hard people work could be influenced by culture. For example, in poor cultures people have to work harder to make a living. The item “working hard” may correctly reflect variation in conscientiousness within poor cultures and within rich cultures, but the differences between cultures would reflect environmental conditions rather than conscientiousness. As a result, it is necessary to demonstrate that cultural differences in item means are valid measures of cultural differences in personality.

Unfortunately, obtaining data from a large sample of nations is difficult and sample sizes are often rather small. For example, McCrae et al. (2010) examined convergent validity of Big Five scores with 18 nations. The only significant evidence of convergent validity was obtained for neuroticism, r = .44, and extraversion, r = .45. Openness and agreeableness even produced small negative correlations, r = -.27, r = -.05, respectively. The largest cross-cultural studies of personality had 36 overlapping nations (Allik et al., 2017; Schmitt et al., 2007). The highest convergent validity was r = .4 for extraversion and conscientiousness. Low convergent validity, r = .2, was observed for neuroticism and agreeableness, and the convergent validity for openness was 0 (Schimmack, 2020). These results show the difficulty of measuring personality across cultures and the lack of validated measures of cultures’ personality profiles.

Method Factors in Personality Measurement

It is well-known that self-ratings of personality are influenced by method factors. One factor is a stylistic factor in the use of response formats known as acquiescence bias (Cronbach, 1942, 1965). The other factor reflects individual differences in responding to the evaluative meaning of items known as halo bias (Thorndike, 1920). Both method factors can distort cross-cultural comparisons. For example, national stereotypes suggest that Japanese individuals are more conscientious than US American individuals, but mean scores of conscientiousness in cross-cultural studies do not confirm this stereotype (Oishi & Roth, 2009). Both method factors may artificially lower Japan’s mean score because Japanese respondents are less likely to use extreme scores (Min, Cortina, & Miller, 2016) and Asians are less likely to inflate their scores on desirable traits (Kim, Schimmack, & Oishi, 2012). In this article, we used structural equation modeling to separate method variance from trait variance to distinguish cultural differences in response tendencies from cultural differences in personality traits.

Convenience Samples versus National Samples

Another problem for empirical studies of national differences is that psychologists often rely on convenience samples. The problem with convenience samples is that personality can change with age and that there are regional differences in personality within nations (). For example, a sample of students at New York University may differ dramatically from a student sample at Mississippi State University or Iowa State University. Although regional differences tend to be small, national differences are also small. Thus, small regional differences can bias national comparisons. To avoid these biases it is preferable to compare national samples that cover all regions of a nation and a broad age range.

Modeling Approach

The purpose of our study is to advance research on cultural differences in personality by comparing a Japanese and a US national sample that completed the same Big Five personality questionnaire using a measurement model that distinguishes personality factors and method factors. The measurement model is an improved version of Anusic et al.’s (2009) halo-alpha-beta model (Schimmack, 2019). The model is essentially a tri-factor model.

Figure 1

That is, each item loads on three factor, namely (a) a primary loading on one of the Big Five factors, (b) a loading on an acquiescence bias factor, and (c) a loading on the evaluative bias/halo factor. As Big Five measures typically do not show a simple structure, the model also can include secondary loadings on other Big Five factors. This measurement model has been successfully fitted to several Big Five questionnaires (Schimmack, 2019). This is the first time, the model is applied to a multiple-group model to compare measurement models for US and Japanese samples.

We first fitted a very restrictive model that assumed invariance across the two factors. Given the lack of psychometric cross-cultural comparisons, we expected that this model would not have acceptable fit. We then modified the model to allow for cultural differences in some primary factor loadings, secondary factor loadings, and item intercepts. This step makes our work exploratory. However, we believe that this exploratory work is needed as a first step towards psychometrically sound measurement of cultural differences.


Participants (N = 952 Japanese, 891 US) were recruited by Nikkei Research Inc. and its U.S. affiliate using a national probabilistic sampling method based on gender and age. The mean age was 44. The data have been used before to compare the influence of personality on life-satisfaction judgments, but without comparing mean levels in personality and life-satisfaction (Kim, Schimmack, Oishi, & Tsutsui, 2018).


The Big Five items were taken from the International Personality Item Pool (Goldberg et al., 2006). There were five items for each of the Big Five dimensions (Table 1).


We first fitted a model without mean structure to the data. A model with strict invariance for the two samples did not have acceptable fit using RMSEA < .06 and CFI > .95 as criterion values, RMSEA = .064, CFI = .834. However, CFI values should not be expected to reach .95 in models with single-item indicators (Anusic et al., 2009). Therefore, the focus is on RMSEA. We first examined modification indices (MI) of primary loadings. We used MI > 30 as a criterion to free parameters to avoid overfitting the model. We found seven primary loadings that would improve model fit considerably (n4, e3, a1, a2, a3, a4, c4). Freeing these parameter improved the model (RMSEA = .060, CFI = .857). We next examined loadings on the halo factor because it is likely that some items differ in their connotative meaning across languages. However, we found only two notable MIs (o1, c4). Freeing these parameters improved model fit (RMSEA = .057, CFI = .871). We identified six secondary loadings that differed notably across cultures. One was a secondary loading on neuroticism (e4) and four were secondary loadings on agreeableness (n5, e1, e3, o4), and one was a secondary loading on conscientiousness (n3). Freeing these parameters improved model fit (RMSEA = .052, CFI = .894). We were satisfied with this measurement model and continued with the means model. The first model fixed the item intercepts and factor means to be identical. This model had worse fit than the model without a means structure (RMSEA = .070, CFI = .803). The biggest MI was observed for the mean of the halo factor. Allowing for mean differences in halo improved model fit considerably (RMSEA = .060, CFI = .849). MIs next suggested to allow for mean differences in extraversion and agreeableness. We next allowed for mean differences in the other factors. This further improved model fit (RMSEA = .058, CFI = .864), but not as much. MIs suggested seven items with different item intercepts (n1, n5, e3, o3, a5, c3 c5). Relaxing these parameters improved model fit close to the level for the model without a mean structure (RMSEA = .053, CFI = .888).

Table 1 shows the primary loadings and the loadings on the halo factor for the 25 items.

Table 1

The results show very similar primary loadings for most items. This means that factors have similar meaning in the two samples and that it is possible to compare the two cultures. Nevertheless, there are some differences that could bias comparisons based on item-sum-scores. The item “feeling comfortable around people” loads much more strongly on the extraversion factor in the US than in Japan. The agreeableness items “insult people” and “sympathize with others’ feelings” also load more strongly in the US than in Japan. Finally, “making a mess of things” is a conscientiousness item in the US, but not in Japan. The fact that item loadings are more consistent with the theoretical structure can be attributed to the development of the items in the US.

A novel and important finding is that most loadings on the halo factor are also very similar across nations. For example, the item “have excellent ideas” shows a high loading for the US and Japan. This finding contradicts the idea that evaluative biases are culture-specific (Church et al., 2014). The only notable difference is the item “make a mess of things” that has no notable loading on the halo factor in Japan. Even in English, the meaning of this item is ambiguous and future studies should replace this item with a better item. The correlation between the halo loadings for the two samples is high, r = .96.

Table 2 shows the item means and the item intercepts of the model.

Table 2

The item means of the US sample are strongly correlated with the loadings on the halo factor, r = .81. This is a robust finding in Western samples. More desirable items are endorsed more. The reason could be that individuals actually act in desirable ways most of the time and that halo bias influences item means. Surprisingly, there is no notable correlation between item means and loadings on the halo factor for the Japanese sample, r = .08. This pattern of results suggests that US means are much more strongly influenced by halo bias than Japanese means. Further evidence is provided by inspecting the mean differences. For desirable items (low N, high E, O, A, & C) US means are always higher than Japanese’ means. For undesirable items, the US means are always lower than Japanese’ means, except for the item “stay in the background” where the means are identical. The difference scores are also positively correlated with the halo loadings, r = .90. In conclusion, there is strong evidence that halo bias distorts the comparison of personality in these two samples.

The item intercepts show cultural differences in items after taking cultural differences in halo and the other factors into account. Notable differences were observed from some items. Even after controlling for halo and extraversion, US respondents report higher levels of being comfortable around people than Japanese. This difference fits cultural stereotypes. After correcting for halo bias, Japanese now score higher on getting chores done right away than Americans. This also fits cultural stereotypes. However, Americans still report paying more attention to detail than Japanese, which is inconsistent with cultural stereotypes. Extensive validation research is needed to examine whether these results reflect actual cultural differences in personality and behaviours.

Figure 2 shows the mean differences on the Big Five factors and the two bias factors.

Figure 2

Figure 2 shows a very large difference in halo bias. The difference is so large that it seems implausible. Maybe the model is overcorrecting, which would bias the mean differences for the actual traits in the opposite direction. There is little evidence of cultural differences in acquiescence bias. One open question is whether the strong halo effect is entirely due to evaluative biases. It is also possible that a modesty bias plays a role because modesty implies less extreme responses to desirable items and less extreme responses to undesirable items. To separate the two, it would be necessary to include frequent and infrequent behaviours that are not evaluative.

The most interesting result for the Big Five factors is that the Japanese sample scores higher in conscientiousness than the US sample after halo bias is removed. This reverses the mean differences in this sample and previous studies that show higher conscientiousness for US than Japanese samples (). The present results suggest that halo bias masks the actual difference in conscientiousness. However, other results are more surprising. In particular, the present results suggest that Japanese people are more extraverted than Americans. This contradicts cultural stereotypes and previous studies. The problem is that cultural stereotypes could be wrong and that previous studies did not control for halo bias. More research with actual behaviours and less evaluative items is needed to draw strong conclusions about personality differences between cultures.


It has been known for 100 years that self-ratings of personality are biased by connotative meaning. At least in North America it is common to see a strong correlation between the desirability of items and the means of self-ratings. There is also consistent evidence that Americans rate themselves in a more desirable manner than the average American (). However, this does not mean that Americans are seeing themselves as better than everybody else. In fact, self-ratings tend to be slightly less favorable than ratings of friends or family members (), indicating a general evaluative biases to rate oneself and close others favorably.

Given the pervasiveness of evaluative biases in personality ratings it is surprising that halo bias has received so little attention in cross-cultural studies of personality. One reason could be the lack of a good method to measure and remove halo variance from personality ratings. Despite early attempts to detect socially desirable responding, lie scales have shown little validity as bias measures (ref). The problem is that manifest scores on lie scales contain as much valid personality variance as bias variance. Thus, correcting for scores on these scales literally throws out the baby (valid variance) with the bathwater (bias variance). Structural equation modeling (SEM) solves this problem by spitting observed variances into unobserved or latent variances. However, personality psychologists have been reluctant to take advantage of SEM because item models require large samples and theoretical models were too simplistic and produced bad fit. Informed by multi-rater studies that emerged in the 1990s, we developed a measurement model of the Big Five that separates personality variance from evaluative bias variance (Anusic, et al., 2009; Schimmack, Kim, & 2012; Schimmack, 2019). Here we applied this model for the first time to cross-cultural data to examine whether cultures differ in halo bias. The result suggest that halo bias has a strong influence on personality ratings in the US, but not in Japan. The differences in halo bias distort comparisons on the actual personality traits. While raw scores suggest that Japanese people are less conscientious than Americans, the corrected factor means suggest the opposite. Japanese participants also appeared to be less neurotic, more extraverted and open to experiences, which was a surprising result. Correcting for halo bias did not change the cultural differences in agreeableness. Americans were more agreeable than Japanese with and without correction for halo bias. Our results do not provide a conclusive answer about cultural differences in personality, but they shed a new light on several questions in personality research.

Cultural Differences in Self-enhancement

One unresolved question in personality psychology is whether positive biases in self-perceptions also known as self-enhancement are unique to American or Western cultures or whether they are a universal phenomenon (Church et al., 2016). One problem are different approaches to the measurement of self-enhancement. The most widely used method are social comparisons where individuals compare themselves to an average person. These studies tend to show a persistent better-than-average effect in all cultures (ref). However, this finding does not imply that halo biases are equally strong in all cultures. Brown and Kobayashi (2002) found better-than-average effects in the US and Japan, but Japanese ratings of the self and others were less favorable than those in the US. Kim et al. (2012) explain this pattern with a general norm to be positive in North America that influences ratings of the self as well as ratings of others. Our results are consistent with this view and suggests that self-enhancement is not a universal tendency. More research with other cultures is needed to examine which cultural factors moderate halo biases.

Rating Biases or Self-Perception Biases

An open question is whether halo biases are mere rating biases or reflect distorted self-perceptions. One model suggests that participants are well aware of their true personality, but merely present themselves in a more positive light to others. Another model suggests that individuals truly believe that their personality is more desirable than it actually is. It is not easy to distinguish between these two models empirically. zzz

Halo Bias and the Reference Group Effect

In an influential article, Heine et al. (2002) criticized cross-cultural comparisons in personality ratings as invalid. The main argument was that respondents adjust the response categories to cultural norms. This adjustment was called the reference group effect. For example, the item “insult people” is not answered based on the frequency of insults or a comparison of the frequency of insults to other behaviours. Rather it is answered in comparison to the typical frequency of insults in a particular culture. The main prediction made by the reference group effect is that responses in all cultures should cluster around the mid-point of a Likert-scale that represents the typical frequency of insults. As a result, cultures could differ dramatically in the actual frequency of insults, while means on the subjective rating scales are identical.

The present results are inconsistent with a simple reference group effect. Specifically, the US sample showed notable variation in item means that was related to item desirability. As a result, undesirable items like “insult people” had a much lower mean, M = 1.83, than the mid-point of the scale (3), and desirable items “have excellent ideas” had a higher mean (M = 3.73) than the midpoint of the scale. This finding suggests that halo bias rather than a reference group effect threatens the validity of cross-cultural comparisons.

Reference group effects may play a bigger role in Japan. Here item means were not related to item desirabilty and clustered more closely around the mid-point of the scale. The highest mean was 3.56 for worry and the lowest mean was 2.45 for feeling comfortable around people. However, other evidence contradicts this hypothesis. After removing effects of halo and the other personality factors, item intercepts were still highly correlated across the two national samples, r = .91. This finding is inconsistent with culture-specific reference groups that would not produce consistent item intercepts.

Our results also provide a new explanation for the low conscientiousness of Japanese samples. A reference group effect would not predict a significantly lower level of conscientiousness. However, a stronger halo effect in the US explains this finding because conscientiousness is typically assessed with desirable items. Our results are also consistent with the finding that self-esteem and self-enhancement are more pronounced in the US than in Japan (Heine & Buchtel, 2009). These aforementioned biases inflate conscientiousness scores in the US. After removing this bias, Japanese rate themselves as more conscientious than US Americans.

Limitations and Future Directions

We echo previous calls for validation of personality scores of nations (Heine & Buchtel, 2009). The current results are inconsistent across questionnaires and even the low level of convergent validity may be inflated by cultural differences in response styles. Future studies should try to measure personality with items that minimize social desirability and use response formats that avoid the use of reference groups (e.g., frequency estimates). Moreover, results based on ratings should be validated with objective indicators of behaviours.

Future research also needs to take advantage of developments in psychological measurement and use models that can identify and control for response artifacts. The present model shows the ability of separating evaluative biases or halo variance from actual personality variance. Future studies should use this model to compare a larger number of nations.

The main limitation of our study is the relatively small number of items. The larger the number of items, the easier it is to distinguish item-specific variance, method variance, and trait variance. The measure also did not properly take into account that the Big Five are higher-order factors of more basic traits called facets. Measures like the BFI-2 or the NEO-PI3 should be used to study cultural differences at the facet level, which often shows unique influences of culture that are different from effects on the Big Five (Schimmack, 2020).

We conclude with a statement of scientific humility. The present results should not be taken as clear evidence about cultural differences in personality. Our article is merely a little step towards the goal of measuring personality differences across cultures. One obstacle in revealing such differences is that national differences appear to be relatively small compared to the variation in personality within nations. One possible explanation for this is that variation in personality is caused more by biological than cultural factors. For example, twin studies suggest that 40% of the variance in personality traits is caused by genetic variation within a population, whereas cross-cultural studies suggest that at most 10% of the variance is caused by cultural influences on population means. Thus, while uncovering cultural variation in personality is of great scientific interest, evidence of cultural differences between nations should not be used to stereotype individuals from different nations. Finally, it is important to distinguish between personality traits that are captured by Big Five traits and other personality attributes like attitudes, values, or goals that may be more strongly influenced by culture. The key novel contribution of this article is to demonstrate that cultural differences in response styles exists and distort national comparisons of personality with simple scale means. Future studies need to take response styles into account.


Cronbach, L. J. (1942). Studies of acquiescence as a factor in the true-false test. Journal of Educational Psychology, 33(6), 401–415.

Heine, S. J., & Buchtel, E. E. (2009). Personality: The universal and the culturally specific. Annual Review of Psychology, 60, 369–394.

Perugini, M., & Richetin, J. (2007). In the land of the blind, the one-eyed man is king. European Journal of Personality, 21(8), 977–981.

Schimmack, U. (2020). Personality science: The science of human diversity. TopHat, 978-1-77412-253-2.

Terracciano, A. et al. (2005). National character does not reflect mean personality
trait levels in 49 cultures. Science, 310, 96–100.

Leave a Reply