Category Archives: Uncategorized

Why do men report higher levels of self-esteem than woman?

Self-esteem is one of the most popular constructs in personality/social psychology. The common approach to study self-esteem is to give participants a questionnaire with one or more questions (items). To study gender differences, the scores of multiple items are added up or averaged separately for men and women, and then subtracted from each other. If this difference score is not zero, the data show a gender difference. Of course, the difference will never be exactly zero. So, it is impossible to confirm the nil-hypothesis that men and women are exactly the same. A more interesting question is whether gender differences in self-esteem are fairly consistent across different samples and how large gender differences, on average, are. To answer this question, psychologists conduct meta-analyses. A meta-analysis combines findings from small samples into one large sample.

The first comprehensive meta-analysis of self-esteem reported a small difference between men and women, with men reporting slightly higher levels of self-esteem than women (Kling et al., 1999). What does a small difference look like. First, imagine that you have to predict whether 50 men and 50 women are above or below the average (median in self-esteem, but the only information that you have is their gender. If there was no difference between men and women, you have no reliable information about gender and you might just flip a coin and have a 50% chance of guessing correctly. However, given the information that men are slightly more likely to be above average in self-esteem, you guess above-average for men and below average for women. This blatant stereotype helps you to be correct 54% of the time, but you are still incorrect in your guesses 46% of the time.

Another way to get a sense of the magnitude of the effect size is to compare it to well-known, large gender differences. One of the largest gender differences that is also easy to measure is height. Men are 1.33 standard deviations taller than women, while the difference in self-esteem ratings is only 0.21 standard deviations. This means the difference in self-esteem is only 15% of the difference in height.

A more recent meta-analysis found an even smaller difference of d = .11 (Zuckerman & Hall, 2016). A small difference increases the probability that gender differences in self-esteem ratings may be even smaller or even in the opposite direction in some populations. That is, while the difference in height is so large that it can be observed in all human populations, the difference in self-esteem is so small that it may not be universally observed.

Another problem with small effects is that they are more susceptible to the influence of systematic measurement error. Unfortunately, psychologists rarely examine the influence of measurement error on their measures. Thus, this possibility has not been explored.

Another problem is that psychologists tend to rely on convenience samples, which makes it difficult to generalize findings to the general population. For example, psychology undergraduate samples select for specific personality traits that may make male or female psychology students less representative of their respective gender.

It is therefore problematic to draw premature conclusions about gender differences in self-esteem on the basis of meta-analyses of self-esteem ratings in convenience samples.

What Explains Gender Differences in Self-Esteem Ratings?

The most common explanations for gender differences in self-esteem are gender roles (Zuckerman & Hall, 2016) or biological differences (Schmitt et al, 2016). However, there are few direct empirical tests of these hypotheses. Even biologically oriented researchers recognize that self-esteem is influenced by many different factors, including environmental ones. It is therefore unlikely that biological sex differences have a direct influence on self-esteem. A more plausible model would assume that gender differences in self-esteem are mediated by a trait that shows stronger gender differences and that predicts self-esteem. The same holds for social theories. It seems unlikely that women rely on gender stereotypes to evaluate their self-esteem. It is more plausible that they rely on attributes that show gender differences. For example, Western societies have different beauty standards for men and women and women tend to have lower self-esteem in ratings of their attractiveness (Gentile et al., 2009). Thus, a logical next step is to test mediation models. Surprisingly, few studies have explored well-known predictors of self-esteem as potential mediators of gender differences in self-esteem.

Personality Traits and Self-Esteem

Since the 1980s, thousands of studies have measured personality from the perspective of the Five Factor Model. The Big Five capture variation in negative emotionality (Neuroticism), positive energy (Extraversion), curiosity and creativity (Openness), cooperation and empathy (Agreeableness), and goal-striving and impulse-control (Conscientiousness). Given the popularity of self-esteem and the Big Five in personality research, many studies have examined the relationship between the Big Five and self-esteem, while other studies have examined gender differences in the Big Five traits.

Studies of gender differences show the biggest and most consistent differences for neuroticism and agreeableness. Women tend to score higher on both dimensions than men. The results for the Big Five and self-esteem are more complicated. Simple correlations show that higher self-esteem is associated with lower Neuroticism and higher Extraversion, Openness, Agreeableness, and Conscientiousness (Robins et al., 2001). The problem is that Big Five measures have a denotative and an evaluative component. Being neurotic does not only mean to respond more strongly with negative emotions; it also is undesirable. Using structural equation model, Anusic et al. (2009) separated the denotative and evaluative component and found that self-esteem was strongly related to the evaluative component of personality ratings. This evaluative factor in personality ratings was first discovered by Thorndike (1920) one-hundred years ago. The finding that self-esteem is related to overly positive self-ratings of personality is also consistent with a large literature on self-enhancement. Individuals with high self-esteem tend to exaggerate their positive qualities ().

Interestingly, there are very few published studies of gender differences in self-enhancement. One possible explanation for this is that there is only a weak relationship between gender and self-enhancement. The rational is that gender is routinely measured and that many studies of self-enhancement could have examined gender differences. It is also well known that psychologists are biased against null-findings. Thus, ample data without publications suggest that there is no strong relationship. However, a few studies have found stronger self-enhancement for men than for women. For example, one study showed that men overestimate their intelligence more than women (von Stumm et al., 2011). There is also evidence that self-enhancement and halo predict biases in intelligence ratings (Anusic, et al., 2009). However, it is not clear whether gender differences are related to halo or are specific to ratings of intelligence.

In short, a review of the literature on gender and personality and personality and self-esteem suggests three potential mediators of the gender differences in self-esteem. Men may report higher levels of self-esteem because they are lower in neuroticism, lower in agreeableness, or higher in self-enhancement.

Empirical Test of the Mediation Model

I used data from the Gosling–Potter Internet Personality Project (Gosling, Vazire, Srivastava,
& John, 2004
). Participants were visitors of a website who were interested in taking a personality test and receiving feedback about their personality. The advantage of this sampling approach is that it creates a very large dataset with millions of participants. The disadvantage is that men and women who visited this sight might differ in personality traits or self-esteem. The questionnaire included a single-item measure of self-esteem. This item shows the typical gender difference in self-esteem (Bleidorn et al., 2016).

To separate descriptive factors of the Big Five from evaluative bias and acquiescence bias, I fitted a measurement model to the 44-item Big Five Inventory. I demonstrated earlier that this measurement model has good fit for Canadian participants (Schimmack, 2019). To test the mediation model, I added gender and self-esteem to the model. In this study, gender was measured with a simple dichotomous male vs. female question.

Gender was a predictor of all 7 factors (Big Five + Halo + Acquiescence). Exploratory analysis examined whether gender had unique relationships with specific BFI items. These relationships could be due to unique relationships of gender with specific personality traits called facets. However, few notable relationships were observed. Self-esteem was predicted by all seven personality traits and gender. However, openness to experience showed weak relationships with self-esteem. To stabilize the model, this path was fixed to zero.

I fitted the model to data from several nations. I selected nations with (a) a large number of complete data (N = 10,000), familiarity with English as a first or common second language (e.g., India = yes, Japan = no), while trying to sample a diverse range of cultures because gender differences in self-esteem tend to vary across cultures (Bleidorn et al., 2016; Zuckerman & Hall, 2016). I fitted the model to samples from four nations: US, Netherlands, India, and Philippines with N = 10,000 for each nation. Table 1 shows the results.

The first two rows show the fit of the Canadian model to the other four nations. Fit is a bit lower for Asian samples, but still acceptable.

The results for sex differences in the Big Five are a bit surprising. Although all four samples show the typical gender difference in neuroticism, the effect sizes are relatively small. For agreeableness, the gender differences in the two Asian samples are negligible. This raises some concerns about the conclusion that gender differences in personality traits are universal and possibly due to evolved genetic differences (Schmitt et al, 2016). The most interesting novel finding is that there are no notable gender differences in self-enhancement. This also implies that self-enhancement cannot mediate gender differences in self-esteem.

The strongest predictor of self-esteem is self-enhancement. Effect sizes range from d = .27 in the Netherlands to d = .45 in the Philippines. The second strongest predictor is neuroticism. As neuroticism also shows consistent gender differences, neuroticism partially mediates the effect of gender on self-esteem. Although weak, agreeableness is a consistent negative predictor of self-esteem. This replicates Anusic et al.’s (2009) finding that the sign of the relationship reverses when halo bias in agreeableness ratings is removed from measures of agreeableness.

The total effects show the gender differences in the four samples. Consistent with meta-analysis the gender differences in self-esteem are weak with effect sizes ranging from d = .05 to d = .15. Personality explains some of this relationship. The unexplained direct effect of gender is very small.


A large literature and several meta-analysis have documented small, but consistent gender differences in self-ratings of self-esteem. Few studies have examined whether these differences are mere rating biases or tested causal models of these gender differences. This article addressed these questions by examining seven potential mediators; the Big Five traits as well as halo bias and acquiescence bias.

The results replicated previous findings that gender differences in self-esteem are small, d < .2. They also showed that neuroticism is a partial mediator of gender differences in self-esteem. Women tend to be more sensitive to negative information and this disposition predicts lower self-esteem. It makes sense that a general tendency to focus on negative information also extends to evaluations of the self. Women appear to be more self-critical than men. A second mediator was agreeableness. Women tend to be more agreeable and agreeable people tend to have lower self-esteem. However, this relationship was only observed in Western nations and not in Asian nations. This cultural difference explains why gender differences in self-esteem tend to be stronger in Western than in Asian cultures. Finally, a general evaluative bias in self-ratings of personality was the strongest predictor of self-esteem, but showed no notable gender differences. Gender also still had a very small relationship with self-esteem after accounting for personalty mediators.

Overall, these results are more consistent with models that emphasize similarities between men and women (Men and Women are from Earth) than models that emphasize gender differences (Women are from Venus and Men are from Mars). Even if evolutionary theories of gender differences are valid, they explain only a small amount of the variance in personality traits and self-esteem. As one evolutionary psychologists put it “it is undeniably true that men and women are more similar than different genetically, physically and psychologically” (p. 52). The results also undermine claims that women internalize negative stereotypes about them and have notably lower self-esteem as a result. Given the small effect sizes, it is surprising how much empirical and theoretical attention gender differences in self-esteem have received. One reason is that psychologists often ignore effect sizes and only care about the direction of an effect. Given the small effect size of gender on self-esteem, it seems more fruitful to examine factors that produce variation in self-esteem for men and women.

Lies, Damn Lies, and Experiments on Attitude Ratings

Ten years ago, social psychology had a life-time opportunity to realize that most of their research is bullshit. Their esteemed colleague Daryl Bem published a hoax article about extrasensory perception in their esteemed Journal of Personality and Social Psychology. The editors felt compelled to write a soul searching editorial about research practices in their field that could produce such nonsense results. However, 10 years later social psychologists continue to use the same questionable practices to publish bullshit results in JPSP. Moreover, they are willfully ignorant of any criticism of their field that is producing mostly pseudo-scientific garbage. Just now, Wegener and Petty, two social psychologists at Ohio State University wrote an article that downplays the importance of replication failures in social psychology. At the same time, they publish a JPSP article that shows they haven’t learned anything from 10 years of discussion about research practices in psychology. I treat the first author as an innocent victim who is being trained in the dark art of research practices that have given us social priming, ego-depletion, and time-reversed sexual arousal.

The authors report seven studies. We don’t know how many other studies were run. The seven studies are standard experiments with one or two (2 x 2) experimental manipulations between subjects. The studies are quick online studies with Mturk samples. The main goal was to show that some experimental manipulations influence some ratings that are supposed to measure attitudes. Any causal effect on these measures is interpreted as a change in attitudes.

The problem for the author is that their experimental manipulations have small effects on the attitude measures. So, individually studies 1-6 would not show any effects. At no point did they consider this a problem and increase sample sizes. However, they were able to fix the problem by combining studies that were similar enough into one dataset. his was also done by Bem to produce significant results for time-reversed causality. It is not a good practice, but that doesn’t bother editors and reviewers at the top journal of social psychology. After all, they all do not know how to do science.

So, let’s forget about the questionable studies 1-6 and focus on the preregistered replication study with 555 Mturk workers (Study 7). The authors analyze their data with a mediation model and find statistically significant indirect effects. The problem with this approach is that mediation no longer has the internal validity of an experiment. Spurious relationships between mediators and the DV can inflate these indirect effects. So, it is also important to demonstrate that there is an effect by showing that the manipulation changed the DV (Baron & Kenny, 1986). The authors do not report this analysis. The authors also do not provide information about standardized effect sizes to evaluate the practical significance of their manipulation. However, the authors did provide covariance matrices in a supplement and I was able to run the analyses to get this information.

Here are the results.

The main effect for the bias manipulation is d = -.04, p = .38, 95%CI = -.12, .05

The main effect for the untrustworthiness manipulation is d = .01, p = .75, 95%CI = -.07, .10.

Both effects are not significant. Moreover, the effect size is so small and thanks to the large sample size the confidence intervals are so narrow that we can reject the hypothesis that the manipulations have at least a small effect, d = .2.

So, here we see the total failure of social psychology to understand what they are doing and their inability to make a real contribution to the understanding of attitudes and attitude change. This didn’t stop Rich Petty from co-authoring an article about psychology’s contribution to addressing the Covid-19 pandemic. Now, it would be unfair to blame 150,000 deaths on social psychology, but it is a fact that 40 years of trivial experiments have done little to help us change attitudes like attitudes towards wearing masks in the real world.

I can only warn young, idealistic students to consider social psychology as a career path. I speak form experience. I was a young idealistic student eager to learn about social psychology in the 1990s. If I could go back in time, I would have done something else with my life. In 2010, I thought social psychology might actually change for the better, but in 2020 it is clear that most psychologists want to continue with their trivial experiments that tell us nothing about social behaviour. If you just can’t help it and want to study social phenomena I recommend personality psychology or other social sciences.

A Hierarchical Factor Analysis of Openness to Experience

In this blog post I report the results of a hierarchical factor analysis of 16 primary openness to experience factors. The data were obtained and made public by Christensen, Cotter, and Silvia (2019). The dataset contains correlations for 138 openness items taken from four different Big Five measures (NEO-PI3; HEXACO, BFAS, & Woo). The sample size was N = 802.

The authors used network analysis to examine the relationship among the items. In the network graph, the authors identified 10 clusters (communities) of items. Some of these clusters combine overlapping constructs in different questionnaires. For example, aesthetic appreciation is represented in all four questionnaires.

This is a good first step, but Figure 1 leaves many questions unanswered. Mainly, it does not provide quantitative information about the relationship of the clusters to each other. The main reason is that network analysis does not have a representation of the forces that bind items within a cluster together. This information was presented in a traditional correlation table based on sum scores of items. The problem with sum scores is that correlations between sum scores can be distorted by secondary loadings. Moreover, there is no formal test that 10 clusters provide an accurate representation of item-relationships. As a result, there is no test of this model against other plausible models. The advantage of structural equation modeling with latent variables is that it is possible to represent unobserved constructs like Openness and to test the fit of a model to the data.

Despite the advantages of structural equation modeling (SEM), many researchers are reluctant to use structural equation modeling for a number of unfortunate reasons. First, structural equation modeling has been called Confirmatory Factor Analysis (CFA). This has led to the misperception that SEM can only be used to test theoretical models. However, it is not clear how one would derive a theoretical that perfectly fits data without exploration. I use SEM to explore the structure of openness without an a priori theoretical model. This is no more exploratory than visual inspection of a network representation of a correlation matrix. There is no good term for this use of SEM because the term exploratory factor analysis is used for a different mathematical model. So, I simply call it SEM.

Another reason why SEM may not be used is that model fit can show that a specified model does not fit the data. It can be time consuming and require thought to create a model that actually fits the data. In contrast, EFA and network models always provide a solution even if the solution is suboptimal. This makes SEM harder to use than other exploratory methods. However, with some openness to new ideas and persistence, it is also always possible to find a fitting model with SEM. This does not mean it is the correct model, but it is also possible to compare models to each other with fit indices.

SEM is a very flexible tool and its capabilities have often not been fully recognized. While higher-order or two-level models are fairly common, models with more than two levels are rare, but can be easily fit to data that have a hierarchical structure. This is a useful feature of SEM because theoretical models have postulated that personality is hierarchically structured with several levels: The global level, aspects, facets, and even more specific traits called nuances below facets. However, nobody has attempted to fit a hierarchical model to see whether Openness has an aspect, a facet, and a nuance level. Christensen et al.’s data seemed ideally suited to examine this question.

One limitation of SEM is that modeling becomes increasingly more difficult as the number of items increases. On the other hand, three items per construct are sufficient to create a measurement model at the lowest level in the hierarchy. I therefore first conducted simple CFA analysis of items belong to the same scale and retained items with high loadings on the primary factor and no notable residual correlations with other items. I did not use the 20 aspect items because they were not designed to measure clean facets of Openness. This way, I only need to fit a total of 48 items for the 16 primary scales of Openness in the three questionnaires:

NEO: Artistic, Ideas, Fantasy, Feeling, Active, Values
HEXACO: Artistic, Inquisitive, Creative, Unconventional
Woo: Artistic, Culture, Tolerance, Creative, Depth, Intellect

Exploratory analysis showed that the creative scales in the HEXACO and Woo measures did not have unique variance and could be represented by a single primary factor. This was also the case for the artistic construct in the HEXACO and Woo measures. However, the NEO artistic items showed some unique variance and were modeled as a distinct construct, although this could just be some systematic method variance in the NEO items.

The final model (MPLUS syntax) had reasonably good fit to the data, RMSEA = .042, CFI = .903. This fit was obtained after exploratory analyses of the data and simply shows that it was possible to find a model that fits the data. A truly confirmatory test would require new data and fit is expected to decrease because the model may have overfitted the data. To obtain good model fit it was necessary to include secondary loadings of items. Cross-validation can be used to confirm that these secondary loadings are robust. All of this is not particularly important because the model is exploratory and provides a first attempt at fitting a hierarchical factor model to the Openness domain.

In Figure 2, the boxes represent primary factors that represent the shared variance among three items. The first noteworthy different to the network model is that there are 14 primary constructs compared to 10 clusters in the network model. However, Neo-Artistic (N-Artistic) is strongly related to the W/H-Artistic factor and could be combined while allowing some systematic measurement error in the NEO-items. So, conceptually, there are only 13 distinct constructs. This still leaves three more constructs than the network analysis identified. The reason for this discrepancy is that there is no strict criterion at which point a cluster may reflect to related sub-clusters.

Figure 2 shows a hierarchy with four levels. For example, creativity (W/H-Creative) is linked to Openness through an unmeasured facet (Facet-2) and artistic (W/H-Artistic). This also means that creative is only weakly linked to Openness as the indirect path is the product of the three links, .9 * .7 * .5 = .3. This means that Openness explains only 9% of the variance in the creativity factor.

In factor analysis it is common to use loadings greater than .6 for markers that can be used to measure a construct and to interpret its meaning. I highlighted constructs that are related .6 or higher with the Openness factor. The most notable marker is the NEO-Ideas factor with a direct loading of .9. This suggests that the core feature of Openness is to be open to new ideas. Another marker is Woo’s curiosity factor and mediated by the facet-2 factor, the HEXACO inquisitive factor. So, core features of Openness are being open to new ideas, being curious, and inquisitive. Although these labels sound very similar, the actual constructs are not redundant. The other indicators that meet the .6 threshold are artistic and unconventional.

Other primary factors differ greatly in their relatedness to the Openness factor. Openness to Feeling’s relationship is particularly weak, .4 * .4 = .16, and suggests that openness to feelings is not a feature of Openness or that the NEO-Feelings items are poor measures of this construct.

Finally, it is noteworthy that the model provides no support for the Big Five Aspects Model that postulates a level with two factors between Openness and Openness Factors. It is particularly troubling that the intellect aspect is most strongly related to Woo’s intellectual efficiency factor (W-Intellect, effect size r = .6), and only weakly related to the ideas factor (N-Ideas, r = .2), and the curiosity factor (W-Curious, r = .2). As Figure 2 shows, (self-rated) intellectual abilities are a distinct facet and not a broader aspect with several subordinate facets. The Openness facet is most strongly related to artistic (W/H artistic, r = .4), with weaker relationships to feelings, fantasy, and ideas (all r = .2). The problem with the development of the Big Five Aspects Model was that it relied on Exploratory Factor Analysis that is unable to test hierarchical structures in data. Future research on hierarchical structures of personality should use Hierarchical Factor Analysis.

In conclusion, SEM is capable of fitting hierarchical models to data. It is therefore ideally suited to test hierarchical models of personality. Why is nobody doing this. Orthodoxy has delegated SEM to confirmatory analysis of models that never fit the data because we need to explore before we can build theories. It requires high openness to new ideas, being unconventional, and curiosity, and inquisitiveness to break with conventions and to use SEM as a flexible and powerful statistical tool for data exploration.

Open SOEP: Spousal Similarity in Personality

Abstract: I examined spousal similarity in personality using 4-waves of data over a 12-year period in the German Socio-Economic Panel. There is very little spousal similarity in actual personality traits like the Big Five. However, there is a high similarity in the halo rating bias between spouses.

Spousal similarity in personality is an interesting topic for several reasons. First, there are conflicting folk ideas about spousal similarity. One saying assumes that “birds of the same feather flock together;” another says that “opposites attract.” Second, there is large interest in the characteristics people find attractive in a mate. Do extraverts find other extraverts more attractive? Would assertive (low agreeableness) individuals prefer a mate who is as assertive as they are or rather somebody who is submissive (high agreeableness)? Third, we might wonder whether spouses become more similar to each other over time. Finally, twin studies of heritability make the assumption that mating is random; an assumption that can be questionable.

Given so many reasons to study spousal similarity in personality, it is surprising how little attention this topic has received. A literature search retrieved only a few articles with few citations: Watson, Beer, McDade-Montez (2014) [20 citations], Humbad, Donnellan, Iacono McGue, & Burt (2010) [30 citations], Rammstedt & Schupp (2008) [25 citations]. One possible explanation for this lack of interest could be that spouses are not similar in personality traits. It is well-known that psychology has a bias against null-results; that is, the lack of statistical relationships. Another possibility is that spousal similarity is small and difficult to detect in small convenience samples that are typical in psychology. In support of the latter explanation, two of the three studies had large samples and did report spousal similarity in personality.

Humbad et al. (2010) found rather small correlations between husbands’ and wives’ personality scores in a sample of 1,296 married couples. With the exception of traditionalism, r = .49, all correlations were below r = .2, and the median correlation was r = .11. They also found that spousal similarity did not change over time, suggesting that the little similarity there is can be attributed to assortative mating (marrying somebody with similar traits).

Rammstedt and Schupp (2008) used data from the German Socio-Economic Panel (SOEP), an annual survey of representative household samples. In 2005, the SOEP included for the first time a short 15-item measure of the Big Five personality traits. The sample included 6,909 couples. This study produced several correlations greater than r = .2, for agreeableness, r = .25, conscientiousness, r = .31, and openness, r = .33. The lowest correlation was obtained for extraversion, r = .10. A cross-sectional analysis with length of marriage showed that spousal similarity was higher for couples who were married longer. For example, spousal similarity for openness increased from r = .26 for newlyweds (less than 5 years of marriage) to r = .47 for couples married more than 40 years.

A decade later it is possible to build on Rammstedt and Schupp’s results because the SOEP has collected three more waves with personality assessments in 2009, 2013, and 2017. This makes it possible to examine spousal similarity over time and to separate spousal similarity in stable dispositions (traits) and in deviations from the typical level (states).

I start with simple correlations, separately for each of the four waves using all couples that were available at a specific wave. The most notable observation is that the correlations do not increase over time. In fact, they even show a slight trend to decrease. This provides strong evidence that spouses are not becoming more similar to each other over time. An introvert who marries an extravert does not become more extraverted as a result or vice versa.

TraitW1 (N = 6263)W2 (N = 5905)W3 (N = 5404)W4 (N = 7805)

I repeated the analysis using only couples who stayed together and participated in all four waves. The sample size for this analysis was N = 1,860.


The correlations were not stronger and did not increase over time.

The next analysis examined correlations over time. If spousal similarity is driven by assortment on some stable trait, husbands’ scores in 2005 should still be correlated with wives’ scores in 2017 and vice versa. To ensure comparability for different time lags, I only used couples who stayed in the survey for all four waves (N = 1,860).

Trait2005 Trait2009 Trait2013 Trait2017 Trait
2005 Neuroticism.
2005 Extraversion.040-.02-.02
2005 Openness.
2005 Agreeableness.
2005 Conscientiousness.

The results show more similarity on the same occasion (2005/2005) than across time. Across-time correlations are all below .2 and are decreasing. However, there are some small correlations of r = .1 for Openness, Agreeableness, and Conscientiousness, suggesting some spousal similarity in the stable trait variance. Another question is why spouses show similarity in the changing state variance.

There are two possible explanations for spousal similarity in personality state variance. One explanation is that spouses’ personality really changes in sync, just like their well-being changes in the same direction over time (Schimmack & Lucas, 2010). Another explanation is that spouses’ self-ratings are influenced by rating biases and that these rating biases are correlated (Anusic et al., 2009). To test these alternative hypotheses, I fitted a measurement model to the Big Five scales that distinguishes halo bias in personality ratings from actual variance in personality. I did this for the first and the last wave (2005, 2017) to separate similarity in the stable trait variance from similarity in state variance.

The key finding is that there is high spousal similarity in halo bias. Some couples are more likely to exaggerate their positive qualities than others. After removing this bias, there is relatively little spousal similarity for the actual trait variance.

FactorTraitState 2005State 2017

In conclusion, spouses are not very similar in their personality traits. This may explain why this topic has received so little attention in the scientific literature. Null-results are often considered uninteresting. However, these findings do raise some questions. Why don’t extraverts marry extraverts or why don’t conscientious people not marry conscientious people. Wouldn’t they be happier with somebody who is similar in their personality? Research with the SOEP data suggests that that is also not the case. Maybe the Big Five traits are not as important for marital satisfaction as we think. Maybe other traits are more important. Clearly, human mating is not random, but it is also not based on matching personality traits.

We don’t forget and until Bargh apologizes we will not forgive

John Bargh is a controversial social scientists with a knack of getting significant results when others cannot (Bargh in Bartlett, 2012). When somebody failed to replicate his most famous elderly-priming results (he published two exact replication studies, 2a and 2b, that were both successful, p < .05), he wrote a blog post. The blog-post blew up in his face and he removed it. For a while, it looked as if this historic document was lost, but it has been shared online. Here is another link to it : Nothing in their heads

Personality x Situation Interactions: A Cautionary Note

Abstract: No robust and reliable interaction effects of the Big Five personality traits and unemployment on life-satisfaction in the German Socio-Economic Panel.

With the exception of late Walter Mischel, Lee Ross, and Dick Nisbett, we are all interactionists (ok, maybe Costa & Mcrae are guilty of dispositionism). As Lewin told every body in 1934, behaviour is a function of the person and the situation, and the a priori probability that the interaction effect between the two is zero (the nil-hypothesis is false) is pretty much zero. So, our journals should be filled with examples of personality x situation interactions. Right? But they are not. Every once in a while when I try to update my lecture notes and look for good examples of a personality x situation interaction I can’t find good examples. One reason is of course the long history of studying situations and traits separately. However, experience sampling studies emerged in the 1980s and the data are ideally suited to look for interaction effects. Another problem is that interaction effects can be difficult to demonstrate because you need large samples to get significant results.

This time I had a solution to my problems. I have access to the German Socio-Economic Panel (SOEP) data. The SOEP has a large sample (N > 10,000), measured the Big Five four times over a 12-year period and many measures of situations like marriage, child birth, or unemployment. So, I could just run an analysis and find a personality x situation interaction. After all, in large samples, you always get p < .05. Right? If you think so, you might be interested to read on and find out what happened.

The Big Five were measure the first time in 2005 (wave v). I picked unemployment and neuroticism as predictors because it is well-known that neuroticism is a personality predictor of life-satisfaction and unemployment is a situational predictor of life-satisfaction. It also made sense that neurotic people might respond more strongly to a negative life-event. However, contrary to these expectations, the interaction was far from significant (p = .5), while the main effects of unemployment (-1.5) and neuroticism (-.5) were highly significant. The effect of unemployment is equivalent to a change by three standard deviations in neuroticism.

Undeterred, I looked for interactions with the other Big Five dimensions. Surely, I would find an explanation for the interaction when I found one. To make things simple, I added all five interactions to the model and, hooray, a significant interaction with conscientiousness popped up, p = .02.

Was I the first to discover this? I quickly checked for articles and of course somebody else had beat me to the punch. There it was. In 2010, Boyce, Wood, and Brown had used the SOEP data to show that conscientious people respond more strongly to the loss of a job.

Five years later, a follow-up article came to the same conclusion.

A bit skeptical of p-values that are just below .02, I examined whether the interaction effect can be replicated. I ran the same analysis as I did with the 2005 data in 2009.

The effect size was cut in half and the p-value was no longer significant, p = .25. However, the results did replicate that none of the other four Big Five dimensions moderated the effect of unemployment.

So, what about the 2013 wave? Again not significant, although the effect size is again negative.

And what happened in 2017? A significant effect, hooray again, but this time the effect is positive.

Maybe the analyses are just not powerful enough. To increase power, we can include prior life-satisfaction as a predictor variable to control for some of the stable trait variance in life-satisfaction judgments. We are now only trying to predict changes in life-satisfaction in response to unemployment. In addition, we can include prior unemployment to make sure that the effect of unemployment is not due to some stable third variable.

We see that it is current unemployment that has a negative effect on life-satisfaction. Prior unemployment actually has a positive effect, suggesting some adaptation to long-term unemployment. Most important, the interaction between conscientiousness and current unemployment is not significant, p = .68.

The interaction was also non-significant in 2013, p = .69.

And there was no significant interaction in 2017, p = .38.

I am sure that I am not the first to look at this, especially given two published articles that reported a significant interaction. However, I suspect that nobody thought about sharing these results because the norm in psychology is still to report significant results. However, the key finding here appears to be that the Big Five traits do not systematically interact with a situation in explaining an important outcome.

So, I am still looking for a good demonstration of a personality x situation interaction that I can use for my lecture in the fall. Meanwhile, I know better than to use the published studies as an example.

Open Letter about Systemic Racism to the Editor of SPPS

Dear Margo Monteith,

it is very disappointing that you are not willing to retract an openly racist article that was published in your journal Social Psychological and Personality Science (SPPS) when Simine Varzire was editor of the journal and Lee Jussim was the action editor of the article in question (Cesario, Johnson, & Terrill, 2019). I have repeatedly pleaded with you to retract the article that draws conclusions on the basis of false assumptions. I am even more stunned by your decision because you rejected my commentary on this racist article with the justification that a better criticism was submitted. This criticism was just published (Ross et al., 2020). It makes the same observation that I made in my critique; that is, the conclusion that there is no racial bias in policing and the use of force rests entirely on an invalid assumption. The original authors simply assume that police officers only encounter violent criminals or that they only encounter violent criminals when they use deadly force.

Maybe you are not watching the news, but the Black Lives Matter movement started because police often use deadly force against non-violent African Americans. In some cases, this is even documented on video. Please watch the murder of Tamir Rice, George Floyd, Philando Castile, and Eric Garner and then tell their families and friends that police only kills violent criminals. That is what SPPS is telling everybody with the mantel of scientific truth, but is a blatantly false claim based on racists assumptions. So, why are you not retracting this offensive article?

Philando Castile:

Tamir Rice:

Eric Garner:

George Floyd:

So, why are you not retracting an article that makes an obviously false and offensive assumption? Do you think that a retraction would look badly on the reputation of your journal? In that case, you are mistaken. Research shows that journals that retract articles with false conclusions have higher impact factors and are more prestigious than journals that try to maintain a flawless image by avoiding retractions of bad science (Nature). So, your actions are not only offensive, but also hurt the reputation of SPPS and ultimately our science.

Your justification for not retracting the article is unconvincing.

Just how to analyze data such as this is debated, mostly in criminology journals. (One can wonder what psychology was present in Cesario et al.’s study that led to publication in SPPS, but that’s another matter.) Cesario et al. made the important point that benchmarking with population data is problematic. Their methodology was imperfect. Ross et al. made important improvements. If one is interested in this question of police bias with benchmarking, the papers bring successive advances. ”

Your response implies that you did not fully understand Ross et al.’s criticism of the offensive article. The whole approach of “benchmarking” is flawed. So, publishing an article that introduces a flawed statistical approach from criminology to psychology is dangerous. What if we would start using this approach to study other disparities? Ross et al. show that this would be extremely harmful to psychological science. It is important to retract an article that introduces this flawed statistical approach to psychologists. As an editor it is your responsibility to ensure that this does not happen.

It is particular shocking and beyond comprehension that you resist retraction at the very same time many universities and academics are keenly aware of the systemic racism in academia. This article about an issue that affects every African American was based on research funding to White academics, reviewed by White academics, approved by White academics, and now defended and not retracted by a White academic. How does your action promote diversity and inclusion? It is even more surprising that you seem to be blind to this systemic racism in the publication of this racist article given your research on prejudice and the funding you received to study these issues (CV). Can you at least acknowledge that it is very offensive to Black people to attribute their losses of lives entirely to violent crime?

Ulrich Schimmack

Systemic Racism at Michigan State University

This is how three professors at MSU talk about innocent Black people being killed by police (podcast transcript at 25minuts and 40seconds into the clip).

Their discussion of tragic deaths suggests that Black lives don’t matter to Joseph Cesario (MSU), Steve Hsu (MSU), and Corey Washington (MSU)

Here is what those rare events look like. I dear everybody to watch them and then reflect on the words of these privileged professors.

Philando Castile:

Tamir Rice:

Eric Garner:

George Floyd:

And yes, it doesn’t only happen to Black people, but contrary to the statistically flawed work by Cesario, Young Black unarmed men are more often the target of police brutality and the victims of lethal force errors (

See also:

When Right-Wing News Write About Race and Police

The right-wing magazine Quilette just published an article by John McWhorter, who is an associate professor in the linguistics department at Columbia University with the title “Racist Police Violence Reconsidered” Given his training in writing, he knows how to draw his readers in with an emotional story about a White victim of lethal use of force to make the point that police sometimes kill White people, too. This is followed by the statement that “plenty of evidence indicates, however, that racism is less important to understanding police behavior than is commonly supposed”.

In a scientific article, this would be the time to mention the scientific evidence that is supposed to support this claim. But McWhorter is no scientist. He is a writer and cannot be held to the scientific standards of criminologists and other social scientists. With one sentence, a fact has been created. The idea that police is racially biased and kills disproportionately African Americans is wrong. But why does everybody believe it to be true? McWhorter has a ready explanation for this. The biased liberal, reverse-racist media cover police brutality only when the officer is White and the victim is Black. “Had Tony Timpa been black, we would all likely know his name by now. Had George Floyd been white, his name would likely be a footnote, briefly reported in Minneapolis local news and quickly forgotten.”

Well trained in propaganda, McWhorter then presented more cases of White victims in equal numbers to Black people. For every Black victim, there is a White victim in his narrative that is based on his personal selection of cases. After creating the illusion that there is a White victim for every Black victim, he is ready to repeat his claim that we have been manipulated by the liberal media. “So, the perception that the police regularly kill black people under circumstances in which white people would be merely disciplined is in fact a misperception.”

But is it a misperception? That would require actual scientific information about the influence of race on lethal use of force by police officers in the US. This evidence is reviewed after the conclusion has already been offered that the common assumption of racial bias against African Americans is a misperception.

McWhorter next explains correctly that African Americans are a minority in the United States. If police were unbiased in the lethal use of force, we would expect a lot more victims to be White than Black. He then correctly states that ” it remains true that black people are killed at a rate disproportionate to their percentage of the population.”

So, it is NOT a misperception that police kill disproportionately more African Americans. There is racial disparity in the use of force. This invalidates the claim that we all believe that racial bias exists because we have been manipulated by the liberal media.

McWhorter then makes a distinction between racial disparity and racial bias. “However, these figures are not necessarily evidence of police racism. According to the Washington Post‘s database, over 95 percent of the people fatally shot by police officers in 2019 were male, and no serious-minded person argues that this is evidence of systemic misandry. So what, then, accounts for the disproportionate representation of black men among those killed by cops?”

This is a controversial topic that has been examined in numerous articles by social scientists in scientific journals. But McWhorter does not cite these studies, presumably because he lacks the training to understand the sometimes complicated statistical methods that have been used in these articles.

Like a novelist, he creates facts with the stroke of a pen. “The socioeconomic gap between blacks and whites is doubtless an important contributing factor.” and “This disparity in poverty rates means black people are also disproportionately presented in violent rates of violent crime” Here we go again. Police is not racially biased. The real reason why they kill more Black people is because Black people are more criminal. Blame the victim. To give this biased narrative some credibility, McWhorter cites only one scientific article that supports his story. “Contrary to his expectations, Harvard economist Roland Fryer has found that while white men are actually more likely to be killed by cops,” as if an economist is more credible than criminologists or other scientists because he is at Harvard. That is not how science works. You also have to cite evidence that contradicts your claims (Young unarmed nonsuicidal male victims of fatal use of force are 13 times more likely to be Black than White).

In the end McWhorter softens his stance a bit. “This disparity cannot explain every fatal police shooting,” “This is not to say that race has nothing to do with policing issues in America.”   But these sentences are mere rhetorical devices to signal that the author is balanced and reasonable, when the truth is that the author is ignorant about the science on racial bias in policing, including use of force.

I have no reason to believe that McWhorter wrote this terrible article because he is motivated by racism, but it is not clear to me why McWhorter wrote such a biased article that is so hurtful to many African Americans who are traumatized by the callus killing of innocent African Americans. All I can say is that McWhorter lacks the competence to write about this topic because he is either too lazy or not trained to follow the actual science on this topic. In Germany we say “Schuster blieb bei Deinen Leisten” (“Every man to his trade.”) Please follow this good advice, Dr. McWhorter.

Can We Measure Racism? Yes We Can

A famous quote states that something that cannot be measured does not exist. This is of course not true, but if we want to move from anecdotal evidence to scientific evidence and theories of racism, we need valid measures of racism.

Social psychology has a long history of developing measures of racism and today there are dozens of different measures of racism. Unfortunately, psychologists are better at developing new measures than at validating existing ones. This makes research on racism vulnerable to criticism that racism measures are invalid or biased (Feldman & Huddy, 2005; Zigerell, 2015).

Take the item “Irish, Italians, Jewish and many other minorities overcame prejudice and worked their way up. Blacks should do the same without special favors” as an example. The item is one of several items that is used to measure a form of racism called symbolic racism.

Feldman and Huddy (2005) argue that items like this one have two components. A purely racist component where White people do not see Black people as equal citizens and a purely ideological component that opposes policies that favor any particular group, even if this group is disadvantaged by a history of racism. Whether the latter component is itself racist or not is not the topic of this blog post. My focus is rather on the separation of the two components. How can we separate agreement to the item that is based on racism from endorsement of the item for purely political reasons?

One solution to this problem is to see how endorsement of items with political content is related to items that have no political content. Using a statistical method called factor analysis it is then possible to separate the racial and the ideological component and to examine how much political orientation is related to the two components.

Indirect Measures of Racism

The problem with direct measures of racism is that open admission of racial bias has become less acceptable over time. This makes it harder to measure racism with items like “What about having a close relative marry a Black person? Would you be very in favor of it happening, somewhat in favor, neither in favor nor opposed to it happening, somewhat opposed, or very opposed to it happening?” Respondents may be unwilling to report their true feelings about this issue, especially when the interviewer is African American (Schaeffer, 1980; Schimmack, 2020).

Modern psychological testing with computers has made it possible to avoid these problems by measuring racism with computerized tasks that rely on participants’ behavior in response to racial stimuli. There are several tasks such as the evaluative priming task, the affective misattribution task and the popular Implicit Association Task (IAT). Unfortunately, the IAT has become known as a measure of implicit bias or implicit racism that is distinct from racism that can be measured with self-report measures. I have argued that there is no evidence that people can hide their feelings towards African Americans from themselves. It is more useful to see these tasks as alternative measures of racism that are less susceptible to fake responding. This does not mean that these tasks are perfect measures of racism because the use of computerized tasks creates new problems. Thus, there is no perfect measure of racism, but all valid measures of racism should be positively correlated with each other and the shared variance among these measures is likely to reflect variation in racism. The interesting question is whether political orientation is related to the shared variance among a variety of direct and indirect racism measures.


The data come from a study by Bar-Anan and Vianello (2018). The data have also been used in my critique of the IAT as a measure of implicit bias (Schimmack, 2020). The study assessed political orientation and racism with multiple measures. Political orientation was also measured with the standard and the brief IAT. In addition, participants reported whether they voted Republican or Democrat. Only White participants who reported voting were included in the analysis.

Racism was measured with the standard IAT, the brief IAT, the evaluative priming task, the Affective Missattribution Task, a direct rating of preference for White or Black people, and the Modern Racism Scale. Like other measures that have been criticized, the Modern Racism scale mixes racism and political ideology.

The interesting question is how much political orientation is related to the unique variance in the modern racism scale that is not shared with other racism measures and how much it is related to the shard variance with other racism measures.


The results show two clearly identified factors. The strong relationship between voting and the Republican factor (rep) shows that political orientation can be measured well with a direct question. In contrast, racism is more difficult to measure. The best measure in this study would be the direct preference rating (r_att) that is related .6 with the pro-White factor. But even this relationship implies that only about a third of the variance in the actual ratings reflects racism. The rest of the variance is measurement error. So, there is no gold-standard or perfect way to measure racism. There are only multiple imperfect ways. The results also show that the controversial Modern Racism Scale (mrs) reflects both racism (.444) and political orientation (.329). This shows that Republicans score high on Modern Racism in part because they reject social policies that favor minority groups independent of their attitudes towards Black Americans. However, the figure also shows that Republicans are more racist, as reflected in the relationship between the Republican and Racism factors (.416).

It is important that these results cannot be used to identify individuals or to claim that a particular Republican is a racist. The results do show however, that people who vote Republican are more likely to score higher on a broad range of racism measures whether they mention a political agenda or not.


Critics of racism research by social psychologists have argued that the research is biased because many social psychologists are liberal. The accusation is that social psychologists have created biased measures that conflate liberal policies with bigotry. Here I show that these critics have a valid point and that high scores on scales like the symbolic racism scale and the modern racism scale are influenced by attitudes towards egalitarian policies. However, I also showed that Republicans are more racist when racism is measured with a broad range of measures that have only racism as a common element.

Conservatives may be displeased by this finding, but recent events in 2020 have made it rather obvious that some Americans are openly racist and that these Americans are also openly supporting Trump. The real question for Republicans who oppose racism is how they can get rid of racism in their party.