Category Archives: Uncategorized

Police Shootings and Race in the United States

The goal of social sciences and social psychology is to understand human behavior in the real world. Experimental social psychologists use laboratory experiments to study human behavior. The problem with these studies is that some human behaviors cannot be studied in the laboratory for ethical or practical reasons. Police shootings are one of them. In this case, social scientists have to rely on observations of these behaviours in the real world. The problem is that it is much harder to draw causal inferences from these studies than from laboratory experiments.

A team of social psychologists examined whether police shootings in the United States are racially biased (Are victims of police shootings more likely to be not White (Black, Hispanic). This is an important political issue in the United States. The abstract of their article states their findings.

The abstract starts with a seemingly clear question. “Is there evidence of a Black-White disparity in death by police gunfire in the United States?” However, even this question is not clear because it is not clear what we mean by disparity. Disparity can mean “a lack of equality or a lack of equality that is unfair (Cambridge dictionary).

There is no doubt that Black citizens of the United States are more likely to be killed by police gunfire than White citizens. The authors themselves confirmed this in their analysis. They find that the odds of being killed by police are three times higher for Black citizens than for White citizens.

The statistical relationship implies that race is a contributing causal factor to being killed by police. However, the statistical finding does not tell us why or how race influences police shootings. In psychological research this question is often framed as a search for mediators; that is, intervening variables that are related to race and to police shootings.

In the public debate about race and police shooting, two mediating factors are discussed. One potential mediator is racial bias that makes it more likely for a police officer to kill a Black suspect than a White suspect. Cases like the killing of Tamir Rice or Philando Castile are used as examples of innocent Black citizens being killed under circumstances that may have led to a different outcome if they had been White. Others argue that tragic accidents also happen with White suspects and that these cases are too rare to draw scientific conclusions about racial bias in police shootings.

Another potential mediator is that there is also a disparity between Black and White US citizens in violent crimes. This is the argument put forward by the authors.

When adjusting for crime, we find no systematic evidence of anti-Black disparities in fatal shootings, fatal shootings of unarmed citizens, or fatal shootings involving identification of harmless objects.

This statement implies that the authors conducted a mediation analysis, which uses statistical adjustment for a potential mediator to examine whether a mediator explains the relationship between two other variables.

In this case, racial differences in crime rates are the mediator and the claim is that once we take into account that Black citizens are more involved in crimes and involvement in crimes increases the risk of being killed by police, there are no additional racial disparities. If a potential mediator fully explains the relationship between two variables, we do not need to look for additional factors that may explain the racial disparity in police shootings.

Readers may be forgiven if they interpret the conclusion in the abstract as stating exactly that.

Exposure to police given crime rate differences likely accounts for the higher per capita rate of fatal police shootings for Blacks, at least when analyzing all shootings.

The problem with this article is that the authors are not examining the question that they are stating in the abstract. Instead they are conducting a number of hypothetical analyses that start with the premises that police officers only kill criminals. They then examine racial bias in police shootings under this assumption.

For example, in Table 1 they report that the NIBRS database recorded 135,068 sever violent crimes by Black suspects and 59,426 violent crimes by White suspects in the years 2015 and 2016. In the same years, 475 Black citizens and 1168 White citizens were killed by police. If we assume that all of those individuals killed by police were suspected of a violent crime recorded in the NIBRS database, we see that White suspects are much more likely to be killed by police (1168 / 59,426 = 197 out of 10,000) than Black suspects (475 / 135068 = 35 out of 10000). The odds ratio is 5.59, which means for every Black suspect police kills over 5 White suspects. This is shown in Figure 1 of the article as the most extreme bias against White criminals. However most other crime statistics also lead to the conclusion that White criminals are more likely to be shot by police than Black criminals.

This is a surprising finding to say the least. While we started with the question why police officers in the United States are more likely to kill Black citizens than White citizens, we end with the conclusion that police officers only kill criminals and are more likely to kill White criminals than Black criminals. I hope I am not alone in noticing a logical inconsistency. If police doesn’t shoot innocent citizens and they shoot more White criminals than Black criminals, we should see that White US citizens are killed more often by police than Black citizens. But that is not the case. We started our investigation with the question why Black citizens are killed more often by police than White citizens. The authors statistical analysis does not answer this question. Their calculations are purely hypothetical and their conclusions suggest only that their assumptions are wrong.

The missing piece is information about the contribution of crime to the probability of being killed by police. Without this information it is simply impossible to examine to what extent racial differences in crime contribute to racial disparities in police shootings. And therewith it is also impossible to say anything about other factors, such as racial bias, that may also contribute to racial disparities in police shootings. This means that this article makes no empirical contribution to the understanding of racial disparities in police shootings.

The fundamental problem of the article is that the authors think they can simply substitute populations. Rather than examining killings in the population of citizens, which the statistic is based on, they think they can replace it by another population, the population of criminals. But, the death counts apply to the population of citizens and not to the population of criminals.

In this article, we approached the question of racial disparities in deadly force by starting with the widely used technique of benchmarking fatal shooting data on population
proportions. We questioned the assumptions underlying this analysis and instead proposed a set of more appropriate benchmarks given a more complete understanding of the context of police shootings

The authors talk about benchmarking and discuss the pros and cons of different benchmarks. However, the notion of a benchmark is misleading. We have a statistic about the number of police killings in the population of the United States. This is not a benchmark, it is a population. In this population, Black citizens are disproprotionally more likely to get killed by police. That is a fact. It is also a fact that in the population of US citizens more crimes are being committed by Black citizens (discussing the reasons for this is another topic that is beyond this criticism of the article). Again, this is not a benchmark, it is a population statistic. The author now use the incident rates of crime to ask the question how many Black or White criminals are being shot by police. However, the population statistics do not provide that information. We could also use other statistics that lead to different conclusions. For example, White US citizens own disproportionally more guns than Black citizens. If we would use that to “benchmark” police shootings, we would see a bias to shoot more Black gun-owners than White gun-owners. But we don’t really see that in the data because we have no information about the death rates of gun owners, just as the article does not provide information about the death rates of criminals and innocent citizens. Thus, the fundamental flaw of the article is the idea that we can simply take two population statistics and compute conditional probabilities from these statistics. This is simply not possible.

The authors caution readers that their results are not conclusive. “The current research is not the final answer to the question of race and police use of deadly force” In fact, the results presented in this article do not even begin to address the question. The data simply provide no information about the causal factors that produce racial inequality in police shootings.

The authors then contradict themselves and reach a strong and false conclusion.

Yet it does provide perspective on how one should test for group disparities in behavioral outcomes and on whether claims of anti-Black disparity in fatal police shootings are as certain as often portrayed in the national media. When considering all fatal shootings, it is clear that systematic anti-Black disparity at the national level is not observed.

They are wrong on two counts. First, their analysis is statistically flawed and leads to internally inconsistent results. Police only kill criminals and are more likely to kill White criminals, which does not explain why we see more Black victims of police shootings. Second, even if their study had shown that there is no evidence of racial inequality, we cannot infer that racial biases do not exist. Absence of evidence is not the same as evidence of absence. Cases like the tragic death of Tamir Rice may be rare, and they may be too rare to be picked up in a statistic, but that doesn’t mean they should be ignored.

The rest of the discussion section reflects the authors’ personal views more than anything that can be learned from the results of this study. For example, the claim that better training will produce no notable improvements is pure speculation, and ignores a literature on training in the use of force and its benefits for all citizens. The key of police training in shooting situations is for police officers to focus on relevant cues (e.g., weapons) and to ignore irrelevant factors such as race. Better training can reduce killings of Black and White citizens.

This suggests that department-wide attempts at reform through programs such as implicit bias training will have little to no effect on racial disparities in deadly force, insofar as
officers continue to be exposed after training to a world in which different racial groups are involved in criminal activity.

It is totally misleading to support this claim with trivial intervention studies with students.

This assessment is consistent with other evidence that the effects of such interventions are short lived (e.g., Lai, 2017).

And once more the authors attribute racial differences in police shootings to crime rates and they ignore that the influence of crime rates on shootings is their own assumption and not an empirical finding that is supported by their statistical analyses.

Note that this analysis does not blame unarmed individuals shot by police for their own behavior. Instead, it highlights the difficulty of eliminating errors under conditions of uncertainty when stereotypes may bias the decision-making process. This difficulty is amplified when the stereotype accurately reflects the conditional probabilities of crime across different racial groups.

Like many articles, the limitation section is not really a limitation section, but the authors pretend that these limitations do not undermine their conclusions.

One potential flaw is if discretionary stops by police lead to a higher likelihood of being shot in a way not captured by our crime report data sets. If officers are more likely to stop and frisk a Black citizen, for example, then officers might be more likely to enter into a deadly force situation with Black citizens independent of any actual crime rate differences across races. Online Supplemental Material #5 presents some indirect data relevant to this possibility. Here, we simply note that the number of police shootings that start with truly discretionary stops of citizens who have not violated the law is low (*5%) and probably do not meaningfully impact the analyses.

There are about 1000 police killings a year in the United States. If 5% of police killings started without any violation of the law, this means 50 people are killed every year by mistake. This may not be a meaningful number to statisticians for their data analysis, but it is a meaningful number for the victims and their families. In no other Western country, citizens are killed in such numbers by their police.

The final conclusion shows that the article lacks any substantial contribution.

At the national level, we find little evidence within these data for systematic anti-Black disparity in fatal police deadly force decisions. We do not discount the role race may play in individual police shootings; yet to draw on bias as the sole reason for population-level disparities is unfounded when considering the benchmarks presented here. We hope this research demonstrates the importance of unpacking the underlying assumptions inherent to using benchmarks to test for outcome disparities.

The authors continue their misguided argument that we should use crime rates rather than population to examine racial bias. Once more, this is nonsense. It is a fact that Black citizens are more likely to be killed by police than White citizens. It is worthwhile to examine which causal factors contribute to this relationship, but the authors approach cannot answer this question because they lack information about the contribution of crime rates to police shootings.

The statement that their study shows that racial bias of police offers is not the only reason is trivial and misleading. The authors imply that crime rates alone explain the racial disparity and even come to the conclusion that police is more likely to kill White suspects. In reality, crime rates and racial biases are likely to be factors, but we need proper data to tease apart those factors and this article does not do this.

I am sure that the authors truly believe that they made a valuable scientific contribution to an important social issue. However, I also strongly believe that they failed to do so. They start with the question “Is there evidence of a Black-White disparity in death by police gunfire in the United States?” The answer to their question is an unequivocal yes. The relevant statistic are the odds of being killed by police for Black and White US citizens, and these statistics show that Black citizens are at greater risk to be killed by police than White citizens. The next question is why this disparity exist. There will be no simple and easy answer to this question. This article suggests that a simple answer is that Black citizens are more likely to be criminals. This answer is not only too simple, it is also not supported by the authors statistical analysis.

Scientists are human, and humans make mistakes. So, it is understandable that the authors made some mistakes in their reasoning. However, articles that are published in scientific journals are vetted by peer-review, and the authors thank several scientists for helpful comments. So, several social scientists were unable to realize that the statistical analyses are flawed even though they produced the stunning result that police officers are 5 times more likely to kill White criminals than Black criminals. Nobody seemed to notice that this doesn’t make any sense. I hope that the editor of the journal and the authors carefully examine my criticism of this article and take appropriate steps if my criticism is valid.

I also hope that other social scientists examine this issue and add to the debate. Thanks to the internet, science is now more open and we can use open discussion to fix mistakes in scientific articles much faster. Maybe the mistake is on my part. Maybe I am not understanding the authors’ analyses properly. I am also not a neutral observer living on planet Mars. I am married to an African American woman with an African American daughter and my son is half South-Asian. I care about their safety and I am concerned about racial bias. Fortunately, I live in Canada where police kill fewer citizens.

I welcome efforts to tackle these issues using data and the scientific method, but every scientific result needs to be scrutinized even after it passed peer-review. Just because something is published in a peer-reviewed journal doesn’t make it true. So, I invite everybody to comment on this article and my response. Together we should be able to figure out whether the authors’ statistical approach is valid or not.

Peer-Review is Censorship not Quality Control

The establishment in psychological science prides itself on their publications in peer-reviewed journals. However, it has been known for a long time that peer-review, especially at fancy journals with high rejection rates, is not based on the quality of an empirical contribution. Peer-review is mainly based on totally subjective criteria of quality. Rather than waiting for three month, editors should just accept or reject papers and state clearly that the reason for their decision is their subjective preference.

I resigned from the editorial board of JPSP-PPID after I received one of these action letters from JPSP.

The key finding of our study based on 450 triads (student, mother, father) who reported on their personality and well-being in a round-robin design (several years of data collection, 100,000 dollar in research funding) was that positive illusions about one’s personality (self-enhancement) did not predict informant ratings of well-being. Surely, we can debate the implications of this finding, but it is rather interesting that positive illusions about the self do not seem to make individuals happier in ways that others cannot perceive. Or so I thought. Not interesting at all because apparently, self-ratings of well-being are perfectly valid indicators of well-being. So, if informants don’t see that fools are happier, they are still happier, just in a way that others do not see. At least, that was the opinion of the editor and as it had the power to decide what gets published in JPSP, it was rejected.

I am mainly posting the editorial letter here because I think the review process should be transparent and open. After all, these decisions influence what gets published when and where. If we pride ourselves on the quality of the review process, we shouldn’t have a problem to demonstrate this quality by making decision letters public. Here everybody can judge for themselves how good the quality of the peer-review process at JPSP is. That is called open science.

Manuscript No. PSP-P-2019-1535

An Integrated Social and Personality Psychological Model of Positive Illusions, Personality, and Wellbeing Journal of Personality and Social Psychology:  Personality Processes and Individual Differences

Dear Dr. Schimmack,

I have now received the reviewers’ comments on your manuscript. I appreciated the chance to read this paper. I read it myself prior to sending it out and again prior to reading the reviews. As you will see below, the reviewers and I found the topic and dataset to be interesting. However, based on their analysis and my own independent evaluation, I am sorry to say that I cannot accept your manuscript for publication in Journal of Personality and Social Psychology:  Personality Processes and Individual Differences.

The bottom line is that the strongest theoretical contribution the model would appear to produce is not currently justified in the paper and the empirical evidence presented regarding that contribution does not rise to the occasion to support publication in JPSP.

I’ll start by mentioning that both Reviewers commented on the lack of clarity of the presentation, and unfortunately, I agree. Reviewer 1 commented overall, “There were passages with a lack of clarity.” Just focusing on the Introduction, I read it multiple times to try to understand what you see as the central theoretical contribution. I found that entire literatures were overlooked (well-being) or mischaracterized (social psychological perspectives on well-being), and illogical arguments were advanced(p. 4). Terms were introduced without definition (e.g., hedonic balance) but later included in the statistical model, and – when later comparing the terms in the model to the literature reviewed – I found a lack of discussion of or justification for the paths that were actually tested and that seem to be at the heart of what you see as the main contribution to the literature (as indicated by the first paragraphs of the Discussion section). Echoing Reviewer 1’s more general point, Reviewer 2 commented specifically on the model, stating: “The authors should describe the statistical model in much more detail to make the statistical analyses easier to follow, more transparent, and replicable.” Again, I agree.

I bring that up at the outset of the letter because you will see me refer to lack of clarity here and there, below. But now I’d like to set aside writing to focus on the theoretical and empirical contribution, which are the heart of the matter.

Through a careful reading of the terms in the model, analyses, and the first paragraphs of the Discussion section, here is what I understand are the claims about the central theoretical and empirical contributions (by “central” I mean the contributions that would make this work cross the threshold for publication in JPSP:PPID): you believe (subjective) well-being has a “truth” to it in the way personality traits might, and a bias to it. You think the “truth” (public view) estimate of well-being is a more important outcome than the “bias” estimate. As such, you draw the conclusion that you have overturned a seminal paper and the field of social psychology’s perspective on well-being because the measure you care about, the “truth” estimate of well-being, does not correlate with self-ratings of positive illusions. (This conclusion appears to be drawn despite the fact that positive illusions about the self and self-reported well-being are indeed correlated, which replicates prior positive illusions literature.)

Given that the argument turns on this interesting idea about truth and bias estimates of well-being, I’ll focus on well-being. There is a huge literature on well-being. Since Schwarz and Strack (1999), to take that arbitrary year as a starting point, there have been more than 11,000 empirical articles with “wellbeing” (or well-being or well being) in the title, according to PsychInfo. The vast majority of them, I submit, take the subjective evaluation of one’s own life as a perfectly valid and perhaps the best way to assess one’s own evaluation of one’s life. So if you are staking the conclusion of your paper on the claim that in fact others’ agreement with a person about whether that person’s life is good is the best representation of one’s well-being, and researchers in the field should dismiss the part about the evaluation that is unique to the evaluator, then that needs to be heartily justified in the paper’s Introduction. The onus is on the authors to do that and I do not believe it is there.

Instead, from what I can tell you appear to be relying on an assumption that, because well-being is consistent with statistical properties of personality in that “wellbeing judgments show agreement between self-ratings and informant ratings (Schneider & Schimmack, 2009; Zou, Schimmack, & Gere, 2013) and are much more stable than the heuristics-and-bias perspective suggests (Schimmack & Oishi, 2005)” (p. 7), therefore the conceptual problem is the same as for measures of personality. It is not. It is of course well-established on theoretical grounds why personality traits are useful to assess from multiple perspectives. But for the question of well-being, this is literally about my subjective feeling about my life; on what grounds do others’ perspectives take a higher priority than the self’s? I agree that it is an interesting question to know if others can see my well-being the way that I do, but this so-called “truth” estimate speaks to quite a different research question than what most of the well-being research field would consider to be an important question. If you think it is important or even more important than the way it has been traditionally done (which I surmise you might, based on what appears to be the dismissal of 30 years of research on positive illusions and well-being in the Discussion), it is up to you to (a) define and measure well-being as it relates to the contemporary psychological literature, (b) explain why this subjective assessment should not be taken at face value but instead needs multi-rater reports to make accurate or meaningful inferences, then (c) explain why each of your predictors would map on to each of these two estimates (i.e., truth and bias) and (d) why those paths matter for the broader literature.

I do see where you talked about positive illusions and “positive beliefs” (which I think you equate with wellbeing but it was unclear) side by side in the introduction (e.g., p. 4), but not where you (1) recognized (a) positive illusions about personality and (b) wellbeing estimates as distinct constructs and (2) justified why one would be associated with the other.

If you make those arguments – situated in the contemporary literature on well-being, and reviewers for a future submission agree with the logic and potential theoretical contribution – the next hurdle of course is the empirical contribution. Assuming the models are correct (see both reviewers’ comments on this), this paper would make empirical contributions in its conceptual replications of prior findings and a few other interesting observations. But the biggest theoretical contribution you appear to want to claim is that “Overall, these results challenge Taylor and Brown’s seminal claim that mental health and wellbeing are rooted in positive illusions.” Yet, (a) you do present evidence that the link between positive illusions about the self and well-being as assessed by the self are correlated, as has been done previously in that literature, and (b) this conclusion appears to be drawn based on null effects using a measure that is not established (i.e., “truth”). (And please see Reviewer 1’s concerns about the cross-sectional nature of the findings as well as the fact that measures use few items.)

Overall, this dataset is rich and the idea of considering convergence and bias in well-being estimates is interesting. To produce a paper that will have a strong impact, I suggest you take a close look at your modeling approach (Reviewer 2), take a close look at your conceptual model itself (not the results) and map it on to the points in the literature that it most closely addresses (e.g., novel questions about separating well-being into truth and bias), and consider what additional evidence might bolster that theoretical or methodological contribution.

Additionally, Reviewer 1 commented on the framing of the paper, on antagonistic language, and on editorializing, and I agree on all fronts. The frame is much too broad, sets up a false dichotomy between social and personality psychology, and the evidence does not rise to the occasion to either (a) take down the paper the Introduction sets up as the foil (i.e., Taylor and Brown, 1988) or (b) allow personality psychologists to “win” the false competition between social and personality psychology about whether positive illusions contribute to well-being.

Other comments:

–              Please justify the use of these two sets of life evaluations but not hedonic balance as indicators of well-being, based on contemporary literature and evidence on well-being and how these should relate to one another. (I note that, incidentally, Schwarz and Strack include happiness judgments in their review of well-being.)

–              In the Method section, what was the timescale of the “hedonic balance” assessment. Was it “right now”? The past 24 hours? Two weeks?

–              Both reviewers were experts in SEM methods and personality; please do take a close look at their methodological comments, which were quite thoughtful and helpful as I considered my decision.

–              I had similar questions as Reviewer 1 regarding the fact that student gender was lumped together relative to mother and father reports, where gender is naturally separated. I agree that there is low statistical power to address this empirically but just wanted to let you know that this thought independently came up for two of us.

In closing, I would like to thank the reviewers for their constructive comments, and I look forward to reading more about this research in the future.

For your guidance, I have appended the reviewers’ comments, and hope they will be useful to you as you prepare this work for another outlet.

Thank you for giving us the opportunity to consider your submission.


Sara Algoe, Ph.D.

Associate Editor

Journal of Personality and Social Psychology:  Personality Processes and Individual Differences

Personality and Life-Satisfaction in the SOEP

In 1980, Costa and McCrae proposed an influential model of well-being. Their seminal article has 1,400 citations in WebofScience so far.

The model assumed that personality traits are stable (during adulthood) and that stable internal dispositions are the major determinant of wellbeing.

The model has led to the development of a two-dimensional model of affect that assumes positive affect and negative affect are independent dimensions (Watson, Tellegen, & Clark, 1980). Accordingly, wellbeing is characterized by high PA and low NA. The model also implies that happy people are extroverted and emotionally stable (low neuroticism).

This model had a strong influence on wellbeing research in psychology for two decades (see Diener et al., 1999, Psych Bull, for a review).

The model implies that (a) life-satisfaction is stable over time and (b) personality traits should account for most of the stable variance in life-satisfaction. This follows directly from Costa and McCrae’s (1980) assumption that personality effects are stable, while environmental factors change over time.

Few would argue against the position that, for normal people, the major determinant of
momentary happiness is the specific situation in which the individual finds himself or herself. Social slights hurt our feelings, toothaches make us miserable, compliments raise
our spirits, eating a good meal leaves us satisfied. The contribution of personality to any
one of these feelings is doubtless small. Yet over time, the small but persistent effects of
traits emerge as a systematic source of variation in happiness, whereas situational determinants that vary more or less randomly tend to cancel each other out (cf. Epstein, 1977)
.” (Costa and McCrae, 1980, p. 676).

Since 2000, well-being researchers have revised their views about the major determinants of well-being. Thanks to large panel studies (longitudinal studies with repeated measurements) like the SOEP it became clear that environmental influences are more important than Costa and McCrae (1980) suggested. After separating variance in life-satisfaction into a stable component and a changing component, changing factors still had an annual stability around .9 (Lucas & Donnellan, 2007; Schimmack, Schupp & Wagner, 2008; Schimmack & Lucas, 2010). For example, unemployment has been shown to influence well-being and individuals who are unemployed are more likely to remain unemployed in the next year. Similarly, income is highly stable from year to year.

To complicate maters further, there is some controversy about the stability of personality traits and it has been suggested that changes in well-being may produce personality changes (Scollon & Diener, 2006). Studies of the stability of personality and life-satisfaction partially resolve this question (Conley, 1984; Anusic & Schimmack, 2016). These studies show that personality is more stable than life-satisfaction judgments. This finding suggests that personality influences wellbeing. The reason is that a causal influence implies that changes in one variable produce changes in another variable. Thus, if life-satisfaction changes and personality remains stable, life-satisfaction cannot be a cause of personality. However, this line of reasoning is indirect. A better test of the relationship between personality and life-satisfaction would require longitudinal studies that measure personality and life-satisfaction over time. Until recently, these kind of studies were largely absent. Here I use the SOEP and MIDUS data to examine concurrent changes in personality and life-satisfaction.



The SOEP measured personality in 2005, 2009, and 2013 with the 15-item BFI-S. I developed a measurement model for these items (Schimmack, 2019). Here, the same measurement model was fitted the the data from all three waves, while imposing metric invariance. In addition, the 11-point life-satisfaction item was included as an indicator of well-being.


Longitudinal stability and change was modeled with Heise’s (1969) autoregressive model that separates variance into an occassion-specific error component and a state component. The state component changes gradually over time. This model does not include a trait component because four measurements are needed to identify the influence of stable factors. However, over shorter time intervals of a decade, an autoregressive model can approximate a trait model.

The personality model has seven factors. Five factors represent the Big Five personality traits. The other two traits reflect acquiescence bias and halo bias. Unfortunately, there are model identification problems when life-satisfaction judgments are regressed on all seven factors. Previous studies have shown that openness is a very weak predictor of life-satisfaction, while halo bias predicts life-satisfaction ratings (Kim, Schimmack, & Oishi, 2012). Thus, the relationship between openness and life-satisfaction was fixed to zero, while the relationship for halo was estimated freely.

Life-satisfaction at all time-points was regressed on neuroticism (N), extraversion (E), agreeableness (A), conscientiousnes (C), and halo (H). Costa and McCrae’s (1980) model predicts that extraversion and neuroticism account for the lion share of the correlation between personality and life-satisfaction concurrently and over time. McCrae and Costa (1991) also found that agreeableness and conscientiousness added to the prediction of life-satisfaction. Thus, these personality traits should also predict life-satisfaction concurrently and over time.

The novel question is whether changes in personality also predict changes in life-satisfaction. To examine this question, the residual variances at time 2 and time 3 were allowed to predict life-satisfaction at time 2 and time 3, respectively.

The second novel question was whether personality accounts for all of the stability in life-satisfaction. If this were the case, the residuals of life-satisfaction should be fairly independent over time. That is, the variance in life-satisfaction that is not explained by personality at time 1 should not predict variance in life-satisfaction at time 2.

The syntax for this (somewhat complex) model and the complete results can be found on OSF ( The model fit was acceptable, CFI = .964, RMSEA = .022, SRMR = .031.


Table 1 shows the reliability and stability estimates.

Life Satisfaction0.610.800.95

The reliability estimate for the life-satisfaction ratings is consistent with estimates based on extensive multi-wave models (Lucas & Donnellan, 2007; Schimmack et al., 2008). The reliabilty estimates for personality are higher because they are based on three item measures and a latent-variable model that reduces measurement error. The results also show that stability of life-satisfaction is lower than stability of personality traits, although the difference is small and conscientiousness is a notable exception.

Table 2 shows the effect sizes for the relationships of personality at time 1 with life-satisfaction at times 1 to 3. The pattern of relationships is consistent with previous studies. Neuroticism is the strongest Big Five predictor of life-satisfaction, with about 10% explained variance. Extraversion is a significant predictor, but only explains about 1% of the variance in life-satisfaction. Effects for agreeableness and conscientiousness are weak. Halo is a notable predictor that explains another 10% of the variance.

The second important finding is that personality traits measured at time 1 predict life-satisfaction with equal strength across the three time points covering eight years. This confirms the hypothesis that personality accounts for stability in life-satisfaction. However, all personality measures combined account for less than a quarter of the variance in life-satisfaction. Given the much higher stability of life-satisfaction (see Table 1), the Big Five are not sufficient to explain stability in life-satisfaction.


Table 3 shows the relationship between residual variances in personality and life-satisfaction at times 2 and 3. These results show whether changes in personality predict changes in well-being. The coefficients in Table 3 cannot be directly compared to those in Table 2 because they are standardized coefficients and the residual variance in personality is much smaller than the stable variances. However, the results do provide seminal information whether changes in personality can predict changes in life-satisfaction.


Results for neuroticism and extraversion trend in the right direction, but weak at time 2. Results for agreeableness would point towards a negative effect, which is difficult to interpret, and effects for conscientiousness are weak at time 2 and 3. The strongest effect was found for halo. When halo bias changes, life-satisfaction ratings change in the same direction. This is consistent with an evaluative bias effect. Overall, these results are consistent with the view that personality is stable and mostly accounts for stable variance in life-satisfaction judgments.


Forty years ago, Costa and McCrae (1980) proposed that well-being is influenced by stable personality dispositions and that these stable dispositions account for stability in well-being. Fourty years later, we can evaluate their theory with new and better data. The results confirm the prediction that neuroticism is a stable personality disposition that produces lasting individual differences in well-being. The results for extraversion are also consistent with the theory, although the effect size is weaker and extraversion accounts for only a small portion of variance in life-satisfaction. The addition of agreeableness and conscientiousness as predictors of well-being is less supported by the data.

The most important limitation of Costa and McCrae’s model is that neuroticism and extraversion explain only a portion of stability in life-satisfaction. Life-satisfaction shows additional stability that is not explained by personality, or at least the Big Five. An important avenue for future research is to find additional predictors of stable variance in life-satisfaction. Costa and McCrae and subsequent researchers may have understimated the stabiltity and importance of enviromental influences on well-being (Schimmack & Lucas, 2010).

Personality Change in the MIDUS

In 2000, Costa, Herbst, McCrae and Siegler published the article “PERSONALITY AT MIDLIFE:
STABILITY, INTRINSIC MATURATION, AND RESPONSE TO LIFE EVENTS”. The article reported the biggest study of personality stability and change at that time.

Over 1000 participants (N = 1,779) took the NEO, a measure of the Big Five personality traits, 9 years apart. Participants were 39 to 45 years old at time 1. The main finding was that mean levels of personality hardly changed. If anything all scales, except for agreeableness showed a small decrease. This finding led to the conclusion that personality is largely stable in adulthood.

Six years later, Roberts, Walton, and Viechtbauer reported the results of a meta-analysis of personality change over the life course. The results of this meta-analysis were dramatically different. In particular, conscientiousness showed marked increases throughout adulthood. According to this meta-analysis, conscientiousness would still increase by about half a standard deviation from age 30 to age 75.

Sometimes, meta-analysis are considered superior to original studies because they incorporate all of the available evidence. However, meta-analyses are also problematic because they combine a heterogeneous set of studies. The main limitation of Roberts et al.’s (2006) meta-analysis was the lack of good data. Costa et al.’s (2000) article was by far the largest sample with adult (age > 30) participants. Other studies sometimes had samples of fewer than 100 participants or examined very brief time intervals that leave little time for changes in personality. For example, one study was based on 37 participants with a 2 year retest interval (Weinryb et al., 1992). Thus, the amount of (mean-level) change in personality in adulthood remains an open empirical question that can only be answered with better data.

Fortunately, longitudinal data from large samples are now available to shed new light on personality change in adulthood. A few days ago, I posted results based on three wavers spanning 8 years in the German Socio-Economic panel. The results showed mainly cohort effects and little evidence of personality change with age. The figure below shows the results for conscientiousness. Only the youngest cohort (on the right) shows some increases from 2005 to 2013.


Here I present the results of an analysis of the MIDUS data. To examine age and cohort effects I fitted a measurement model (Schimmack, 2019) to the three waves of the MIDUS. I also divided the sample into three cohorts of 30-40 year olds (1965-75), 40-50 year olds (1955-65) and 50-60 year olds (1945-55) in 1995. The measurement model had metric and scalar invariance for all 9 groups (3 cohorts x 3 waves) and had acceptable fit to the data, CFI = .952, RMSEA = .027, SRMR = .055. The MPLUS syntax can be found on OSF ( The sample sizes for the three cohorts were N = 1,625, N = 1,674, and N = 1,279, although not all participants completed all three waves. Results were similar when data were analyzed with listwise deletion. The standardized means of the latent variables were centered so that all group means are deviations from the overall mean.



The results for conscientiousness are difficult to interpret. Unlike the SOEP data, conscientiousness scores increase from Wave 1 to Wave 3 in all three cohorts. The effect size is modest for the 18 -year interval but would double for a longer period from age 30 to age 70. Thus, an exclusive focus on change over time would be consistent with Roberts et al.’s findings. However, the figure also shows that there are no cohort differences in conscientiousness. That is, 50-60 year olds in 1995 (cohort 1945-55) did not score higher than 40-50 year olds in 1995, although they are 20 years older. One possible explanation for this finding would be a cohort effect that offsets the age effect, but this cohort effect would imply that younger generations are more conscientious than older generations. The problem with this explanation is that there is no evidence or theory that would suggest such a cohort effect.

The alternative explanation would be period effects. Period effects would change conscientiousness scores of all cohorts in the same direction. However, there are also no theories or data to suggest that conscientiousness has increased from 1995 to 2009.

In conclusion, it remains unclear whether and how much conscientiousness levels increase with age. Although new and better data are available, the data are inconsistent and inconclusive.


The results for agreeableness are similar to those for conscientiousness. A focus on the longitudinal trends suggests that agreeableness increases with age, which mirrors Roberts et al.’s (2006) meta-analysis. This time, the oldest cohort also shows a pattern that is consistent with an age effect. However, other interpretations are possible. The SOEP data suggested a small cohort effect with younger cohorts being less agreeable. Thus, the differences between cohorts may not be age effects. The effect sizes over an 18-year interval are small, but might add up to the d = .4 effect size from age 30 to 75 suggested by Roberts et al.’s (2006) meta-analysis.

Roberts et al.’s (2006) meta-analysis also suggested that neuroticism decreases with age, while the SOEP data didn’t show an age-trend for neuroticism. The MIDUS data also show little evidence that neuroticism decreases with age. Longitudinal trends were only notable for two cohorts and the effect size of d = .2 over an 18-year period is small.



At least the results for openness are consistent with previous findings that openness is fairly stable during adulthood.


This is also the case for extraversion.


The bedrock of science are objective empirical observations that produce a consistent picture of a phenomenon. Obtaining such consistent evidence can be difficult. Studying personality change is difficult for many reasons. Following a large sample of participants over time is hard and costly. Even cross-sectional and longitudinal information in combination is sufficient to disentangle age effects from period effects or cohort effects. It doesn’t help when effect sizes are small. Even a moderate effect size of d = .5 over a period of 10 years, implies only a tiny effect size of d = .05 over a one-year period. Moreover, personality measures have only modest validity and are influenced by systematic measurement error that can produce spurious evidence of personality change.

The study of mean differences also has the problem that many causal factors can explain a time-trend in the data at the mean level, and that mean level changes are most likely the aggregated effects of several causal factors at the individual level (e.g., work experiences or health problems may have opposite effects on conscientiousness). Thus, progress is more likely to be made by focusing on individuals’ trajectories rather than mean levels.

The broader implications of these findings are that there is no evidence that personality changes in substantial ways throughout adulthood. This conclusion is limited to the Big Five, although Costa and McCrae also found little evidence for age effects at the level of more specific personality traits. Of course, 20-year olds behave differently than 40-year olds, or 60-year olds. However, these changes in actual behaviors are more likely the result of changing life-circumstances than changes in personality traits.

The Hierarchy of Consistency Revisited

In 1984, James J. Conley published one of the most interesting studies of personality stability. However, this important article was published in Personality and Individual Differences and has been ignored. Even today, the article has only 184 citations in WebofScience. In contrast, the more recent meta-analysis of personality stability by Roberts and DelVeccio (2001) has 1,446 citations.

Sometimes more recent and more citations doesn’t mean better. The biggest problem in studies of stability is that random and occasion specific measurement error attenuates observed retest correlations. Thus, observed retest correlations are prone to underestimate the true stability of personality traits. With a single retest-correlation it is impossible to separate measurement error from real change. However, when more then two repeated measurements are observed, it is possible to separate random measurement error from true change, using a statistical approach that was developed by Heise (1969).

The basic idea of Heise’s model is that change accumulates over time. Thus, if traits change from T1 to T2 and from T2 to T3, the trait changed even more from T1 to T3.

Without going into mathematical details, the observed retest correlation from T1 to T3 should match the product of the retest correlations from T1 to T2 and T2 to T3.

For example, if r12 = .8 and r 23 = .8, r13 should be .8 * .8 = .64.

The same is also true if the retest correlations are not identical. Maybe more change occurred from T1 to T2 than from T2 to T3. The total stability is still a function of the product of the two partial stabilities. For example, r12 = .8 and r23 = .5 yields r13 = .8 * .5 = .4.

However, if there is random measurement error, the r13 correlation will be larger than the product of the r12 and r23 correlations. For example, using the above example and a reliability of .8, we get r12 = .8 * .8 = .64, r23 = .4 * .8 = .32 and the product is .64 * .32 = .20, while the actual r13 correlation is .32 * .8 = .256. Assuming that reliability is constant, we have three equations with three unknowns and it is possible to solve the equations to estimate reliability.

(1) r12 = rel*s1; s1 = r12/rel
(2) r23 = rel*s2; s2 = r23/rel
(3) r13 = rel*s1*s2, rel = r13/(s1*s2)

r = (r12*r23)/r13

with r12 = .64, r23 = .32, and r13 = .256, we get (.64*.32)/.256 = .8.

Heise’s model is called an autoregressive model which implies that over time, retest correlations will become smaller and smaller until they approach zero. However, if stability is high, this can take a long time. For example, Conley (1984) estimated that the annual stability of IQ tests is r = .99. With this high stability, the retest correlation over 40 years is still r = .67. Consistent with Conley’s prediction a study found that the retest correlation from age 11 to age 70 of r = .67 (ref), which is even higher than predicted by Conley.

The Figure below shows Conley’s estimate for personality traits like extraversion and neuroticism. The figure shows that reliability varies across studies and instruments from as low as .4 to as high as .9. After correcting for unreliability, the estimated annual stability of personality traits is s = .98.

The figure also shows that most studies in this meta-analysis of retest correlations covered short time-intervals from a few month up to 10 years. Studies with 10 or more years are rare. As a result, Conley’s estimates are not very precise.

To test Conley’s predictions, I used the three waves of the Midlife in the US study (MIDUS). Each wave was approximately 10 years apart with a total time span of 20 years. To analyze the data, I fitted a measurement model to the personality items in the MIDUS. The fit of the measurement model has been examined elsewhere (Schimmack, 2019). The measurement model was constrained for all three waves (see OSF for syntax). The model had acceptable overall fit, CFI = .963, RMSEA = .018, SRMR = .035 (see OSF for output).

The key finding are the retest correlations r12, r23, and r13 for the Big Five and two method factors; a factor for evaluative bias (halo) and acquiescence bias.


For all traits except acquiescence bias, the r13 correlation is lower than the r12 or r23 correlation, indicating some real change. However, for all traits, the r13 correlation is higher than the product of r12 and r23, indicating the presence of random measurement error or occasion specific variance.

The next table shows the decomposition of the retest-correlations into a reliability component and a stability component.

Reliable20Y Stability1Y Stability

The reliability estimates range from .84 to .92 for the Big Five scales. Reliability of the method factor is estimated to be lower. After correcting for unreliability, 20-year stability estimates increase from observed levels of .72 to .85 to estimated levels of .83 to .1. The implied annual stability estimates are above .99, which is higher than Conley’s estimate of .98.

Unfortunately, three time points are not enough to test the assumptions of Heise’s model. Maybe reliability increases over time. Another possibility is that some of the variance in personality is influenced by stable factors that never change (e.g., genetic variance). In this case, retest correlations do not approach zero, but to a level that is set by the influence of stable factors.

Anusic and Schimmack’s meta-analysis suggested that for the oldest age group, the amount of stable variance is 80, and that this asymptote is reached very quickly (see picture). However, this model predicts that 10-year retest correlations are equivalent to 20-year retest correlations, which is not consistent with the results in Table 1. Thus, the MIDUS data suggest that the model in Figure 1 overestimates the amount of stable trait variance in personality. More data are needed to model the contribution of stable factors to stability of personality traits. However, both models predict high stability of personality over a long period of 20 years.


Science can be hard. Astronomy required telescopes to study the universe. Psychologists need longitudinal studies to examine stability of personality and personality development. The first telescopes were imperfect and led to false beliefs about canals and life on Mars. Similarly, longitudinal data are messy and provide imperfect glimpses into the stability of personality. However, the accumulating evidence shows impressive stability in personality differences. Many psychologists are dismayed by this finding because they have a fixation on disorders and negative traits. However, the Big Five traits are not disorders or undesirable traits. They are part of human diversity. When it comes to normal diversity, stability is actually desirable. Imagine you train for a job and after ten years of training you don’t like it anymore. Imagine you marry a quiet introvert and five year later, he is a wild party animal. Imagine, you never know who you are because your personality is constantly changing. The grass on the other side of the fence is often greener, but self-acceptance and building on one’s true strength may be a better way to live a happy life than to try to change your personality to fit cultural norms or parental expectations. Maybe stability and predictability aren’t so bad after all.

The results also have implications for research on personality change and development. If natural variation in factors that influence personality produces only very small changes over periods of a few years, it will be difficult to study personality change. Moreover, small real changes will be contaminated with relatively large amounts of random measurement error. Good measurement models that can separate real change from noise are needed to do so.


Conley, J. J. (1984). The hierarchy of consistency: A review and model of longitudinal findings on adult individual differences in intelligence, personality and self-opinion. Personality and Individual Differences, 84, 11-25.

Heise D. R. (1969) Separating reliability and stability in test-retest correlation. Am. social. Rev. 34, 93-101.

Roberts, B. W., & DelVecchio, W. F. (2000). The rank-order consistency of personality traits from childhood to old age: A quantitative review of longitudinal studies. Psychological Bulletin, 126, 3–25.

Measuring Personality in the MIDUS

Although the replication crisis in psychology is far from over, a new crisis is emerging on the horizon; the validation crisis. Despite a proud tradition of psychological measurement, psychological science has ignored psychological measurement and treated sum scores of ratings or reaction times as valid, without testing this assumption (Schimmack, 2019a, 2019b).

Even when psychometricians examined the validity of psychological measures, these studies are often ignored. For example, there is ample evidence that self-ratings are influenced by a general evaluative bias or halo (Campbell & Fiske, 1959; Thorndike, 1920; Biesanz and West, 2004; deYoung, 2006; Anusic et al., 2009; Kim et al., 2012). Yet, psychometric studies of the Big Five tend to ignore this method factor (Zimprich, Allemand, Lachman, 2012).

This is unfortunate because psychologists now have invaluable datasets that examine personality in large, nationally representative, longitudinal studies such as the German Socio-Economic Panel (Specht et al.) and the Midlife In Da United States (MIDUS) study

The aim of this blog post is to invite psychologists to take advantage of advances in psychometric methods when the analyze these datasets. Rather than computing sum scores with low reliability that are contaminated by method variance, it is preferable to use latent variable models that can test measurement invariance across samples and over time.

To examine age effects on personality in the MIDUS, I developed an open measurement model ( Rather than arguing that this is the best measurement model, I consider it a starting point for further exploration. Exploring different measurement models and examining the theoretical consequences of different specifications is not validity hacking (v-hacking; cf. Schimmack, 2019c). Transparent open debate about specifications of measurement models is open science and necessary for developing better measures.

Using a measurement model for the MIDUS is particularly important because the questionnaire has only few items to represent some Big Five dimensions. Moreover, halo bias inflates factor loadings in models that do not control for halo bias (EFA, CFA without method factors) and results overestimate the validity of Big Five scales in the MIDUS.

The final model had acceptable overall fit and modification indices suggested no major further revisions to the model, CFI = .958, RMSEA = .039, 90%CI = .037 to .040, SRMR = .035.

Table 1 shows the factor loadings of items and scale scores on the latent Big Five factors.


Results show several items with notable secondary loadings (e.g., warm), and some primary factor loadings were modest (e.g., curious). Nevertheless, 50% or more of the variance in sum scores can be attributed to the primary content of a scale, except for conscientiousness. All scales also had considerable halo variance. For conscientiousness, halo variance was nearly as high as conscientiousness variance. Given these results, it is preferable to examine substantive questions with the latent factors of a measurement model rather than with manifest scale scores.

Age and Personality

The MIDUS data are some of the best data to examine the influence of age on personality because longitudinal studies with large samples and long retest intervals are rare (see meta-analysis by Anusic & Schimmack, 2016).

Age effects can be examined cross-sectionally and longitudinally. The problem with cross-sectional studies that age is confounded with cohort effects. The problem of longitudinal studies is that age is confounded with period effects. Stronger evidence for robust age effects is obtained in longitudinal cohort studies. The MIDUS data make it possible to compare participants who are 45 (40 to 50) to participants who are 55 (50 to 60) at time 1 and compare their scores at time 1 to their scores at time 2. The older age group at time 1 corresponds to the younger age group at time 2 (age 50 to 60). Thus, these groups should be similar to each other, but differ from the younger group at time 1 and the older group at time 2, if age influences personality.

To test this hypothesis, I fitted a multi-group model to the MIDUS data at time 1 and time 2. The model assumed metric and scalar invariance four all four groups. This model had good fit to the data, CFI = .957, RMSEA = .026, SRMR = .047.

The mans of the latent Big Five factors and the two method factors were divided by the overall mean of the four groups so that mean differences are presented as deviations from 0 (rather than using one group as an arbitrary reference group).

The results show no notable age effects for extraversion or openness. Neuroticism shows a decreasing trend with a standardized mean difference of .33 from age 40-50 to age 60-70. Agreeableness shows an even smaller increase by .21 standard deviations. The results for conscientiousness are difficult to interpret because the equivalent age groups differ more from each other than from other age groups. Overall, these results suggest that mean levels of personality are fairly stable from age 40 to age 70.

The halo factor shows a trend towards increasing with age. However, the increase is also modest, d = .35. The largest effect is a decrease in acquiescence. This effect is mostly driven by a retest effect, suggesting that acquiescence bias decreases with repeated testing.

These results suggest that most changes in personality may occur during adolescence and early adulthood, but that mean levels of personality are fairly stable through-out mid-life.

The model also provides information about the rank-order consistency of personality over a 10-year period. Consistent with meta-analytic evidence, retest correlations are high: neuroticism, r = .81, extraversion r = .87, openness r = .78, agreeableness r = .84, and conscientiousness, r = .81. A novel finding is that halo bias is also stable over a 10-year period, r = .69. So is acquiescence bias, r = .57. Thus, even time-lagged correlations can be influenced by method factors. Thus, it is necessary to control for halo bias in studies that rely on self-reports.

Gender and Personality

I also fitted a multiple-group model to the data with gender as between-group variable and time (T1 vs. T2). This model examines age differences for groups age 40-50 (T1) and age 50-60 (T2). The model with metric and scalar invariance had acceptable fit, CFI = .952, RMSEA = .027, SRMR = .051. As before, the means of the latent factors were transformed so that the overall mean was zero.

The main finding is a large difference between men and women’s agreeableness of nearly a full standard deviation. This difference was the same in both age groups. This finding is consistent with previous studies, including cross-cultural studies, suggesting that gender differences in agreeableness are robust and universal.

The results also showed consistent gender differences in neuroticism with an effect size of about 50% of a standard deviation. Again, the gender difference was observed in both age groups. This finding is also consistent with cross-cultural studies.

Hidden Invalidity of Personality Measures?

Sometimes journal articles have ironic titles. The article “Hidden invalidity among fifteen commonly used measures in social and personality psychology” (in press at AMPPS) is one of them. The authors (Ian Hussey & Sean Hughes) claim that personality psychologists engaged in validity-hacking (v-hacking) and claim validity for personality measures when actual validation studies show that these measures have poor validity. As it turns out, these claims are false and and the article is an example of invalidity hacking where the authors ignore and hide evidence that contradicts their claims.

The authors focus on several aspects of validity. Many measures show good internal consistency and retest-reliability. The authors ignore convergent and discriminant validity as important criteria of construct validity (Campbell & Fiske, 1959). The claim that many personality measures are invalid is based on examination of structural validity and measurement invariance across age groups and genders.

Yet, when validity was assessed comprehensively (via internal consistency, immediate and delayed test-retest reliability, factor structure, and measurement invariance for median age and gender) only 4% demonstrated good validity. Furthermore, the less commonly a test is reported in the literature, the more likely it was to be failed (e.g., measurement invariance). This suggests that the pattern of underreporting in the field may represent widespread hidden invalidity of the measures we use, and therefore pose a threat to many research findings. We highlight the degrees of freedom afforded to researchers in the assessment and reporting of structural validity. Similar to the better-known concept of p-hacking, we introduce the concept of validity hacking (v-hacking) and argue that it should be acknowledged and addressed.

Structural validity is important when researchers rely on manifest scale scores to test theoretical predictions that hold at the level of unobserved constructs. For example, gender differences in agreeableness are assumed to exist at the level of the construct. If a measurement model is invalid, mean differences between men and women on an (invalid) agreeableness scale may not reveal the actual differences in agreeableness.

The authors claim that “rigorous tests of validity are rarely conducted or reported” and that “many of the measures we use appear perfectly adequate on the surface and yet fall apart when subjected to more rigorous tests of validity beyond Cronbach’s α.” This claim is neither supported by citation nor consistent with the general practice in the development of psychological measures to explore the factor structure of items. For example, the Big Five were not conceived theoretically, but found empirically by employing exploratory factor analysis (or principal component analysis). Thus claims of widepread v-hacking by omitting structural analyses seems inconsistent with actual practices.

Based on a questionable description of the state of the affairs, the authors suggest that they are the fist to conduct empirical tests of structural validity.

“With this in mind, we examined the structural validity of fifteen well-known selfreport measures that are often used in social and personality psychology using several best practices (see Table 1).”

The practice to present something as novel by omitting relevant prior studies has been called l-hacking (literature review hacking). It also makes it unnecessary to compare results with prior results and to address potentially inconsistent results.

This also allows the authors to make false claims about their data. “The sheer size of the sample involved (N = 81,986 individuals, N = 144,496 experimental sessions) allowed us to assess the psychometric properties of these measures with numbers that were far greater than those used in many earlier validation studies. Contrary to this claim, Nye, Allemand, Gosling, and Roberts (2016) published a study of structural validity of the same personality measure (BFI) with over 150,000 participants. Thus, their study was neither novel nor did it have a larger sample size than prior studies.

The authors also made important and questionable choices that highlight the problem of researchers’ degrees of freedom in validation studies. In this case, their choice to fit a simple-structure model to the data ensured that they would obtain relatively bad fit if scales included reverse scored items, which is a good practice to reduce the influence of acquiescence bias on scale scores. However, the presence of acquiescence bias will also produce weaker correlations between direct and revere scored items. This response style can be modeled by including a method factor in the measurement model. Prior articles showed that acquiescence bias is present and that including an acquiescence factor improves model fit (Anusic et al., 2009; Nye et al., 2016). The choice not to include a method factor contributed to the authors conclusion that Big Five scales are structurally invalid. Thus, the authors conclusion is based on their own choice of a poor measurement model rather than hidden invalidity of the BFI.

The authors justify their choice of a simple-structure with the claim that most researchers who use these scales simply calculate sum scores and rely on these in their subsequent analyses. In doing so, they are tacitly endorsing simple measurement models with no cross-loadings or method factors). This claim is plain wrong. The only purpose of reverse scored items is to reduce the influence of acquiescence bias on scale scores because aggregation of direct and reverse scored items reduces the bias that is common to both types of items. If researchers would assume that acquiescence bias is absent, there would be no need for reverse scored items. Moreover, aggregation of items does not imply that all items are pure indicators of the latent construct or that there are no additional relationships among items (see Nye et al., 2016). The main rational for summing items is that they all have moderate to high loadings on the primary factor. When this is the case, most of the variance in sum scores reflects the common primary factor (see, e.g., Schimmack, 2019, for an example).

The authors also developed their own coding scheme to determine whether a scale has good, mixed, or poor fit to the data based on three fit indices. A scale was said to have poor fit, if CFI was below .95, TLI was below .95, RMSEA was below .06, and SRMR was above .09. That is, to have good fit, a scale must meet all four criteria. A scale was said to have poor fit, if it met none of the four criteria. All other possibilities were considered to be mixed fit. Only Conscientiousness met all four criteria. Agreeableness met 4 out of 3 (RMSEA = .063). Extraversion met 3 out of 4 (RMSEA .075). Neuroticism met 4 out of 3 (RMSEA = .065). And openness met 1 out of 4 (SRMR = .060), but was misclassified as poor. Thus, although the authors fitted a highly implausible simple structure model, fit suggested that a single-factor model fitted the data reasonably well. Experienced SEM researchers would also wonder about the classification of Openness as poor fit given that CFI was .933 and RMSEA was .069.

More important than meeting conventional cut-off values is to examine problems with a measurement model. In this case, one obvious problem is the lack of a method factor for acquiescence bias; or the presence of substantive variance that reflects lower-order traits (facets).

It is instructive to compare these results to Nye et al.’s (2016) prior results of structural validity. They found slighlty worse fit for the simple-structure model, but they also showed that model fit improved when they modeled the presence of lower-order factors or acquiescence bias (2 factor, pos/neg.) in the data. An even better model fit would have been obtained by modeling facets and aquiescence bias in a single model (Schimmack, 2019).

In short, the problem with the Big Five Inventory is not that it has poor validity as a measure of the Big Five. Poor fit of a simple-structure simply shows that other content and method factors contribute to variance in Big Five scales. A proper assessment of validity would require quantifying how much of the variance in Big Five scales can be attributed to the variance in the intended construct. That is, how much of the variance in extraversion scores on the BFI reflects actual variation in extraversion? This fundamental question was not addressed in the “hidden invalidity” article.

The “hidden invalidity” article also examined measurement invariance across age groups (median split) and the two largest gender groups (male, female). The actual results are only reported in a Supplement. Inspecting the Supplement shows hidden validity. Big Five measures passed most tests of metric and scalar invariance by the authors own criteria.

Big 5 Inventory – Aagefit_configuralNANANANAPoor
Big 5 Inventory – Aagefit_metric0.0200.044-0.0130.001Passed
Big 5 Inventory – Aagefit_scalar-0.0130.0000.0000.002Passed
Big 5 Inventory – Asexfit_configuralNANANANAPoor
Big 5 Inventory – Asexfit_metric0.0230.046-0.0140.001Passed
Big 5 Inventory – Asexfit_scalar-0.014-0.0030.0010.003Passed
Big 5 Inventory – Cagefit_configuralNANANANAPoor
Big 5 Inventory – Cagefit_metric0.0290.054-0.0170.001Passed
Big 5 Inventory – Cagefit_scalar-0.013-0.0030.0010.003Passed
Big 5 Inventory – Csexfit_configuralNANANANAPoor
Big 5 Inventory – Csexfit_metric0.0310.055-0.0180.002Passed
Big 5 Inventory – Csexfit_scalar-0.0050.006-0.0020.001Passed
Big 5 Inventory – Eagefit_configuralNANANANAPoor
Big 5 Inventory – Eagefit_metric0.0450.081-0.0320.001Passed
Big 5 Inventory – Eagefit_scalar-0.013-0.0010.0000.003Passed
Big 5 Inventory – Esexfit_configuralNANANANAPoor
Big 5 Inventory – Esexfit_metric0.0420.078-0.0300.002Passed
Big 5 Inventory – Esexfit_scalar-0.0070.007-0.0030.001Passed
Big 5 Inventory – Nagefit_configuralNANANANAPoor
Big 5 Inventory – Nagefit_metric0.0260.054-0.0200.002Passed
Big 5 Inventory – Nagefit_scalar-0.022-0.0100.0040.005Failed
Big 5 Inventory – Nsexfit_configuralNANANANAPoor
Big 5 Inventory – Nsexfit_metric0.0320.061-0.0220.001Passed
Big 5 Inventory – Nsexfit_scalar-0.0110.001-0.0010.003Passed
Big 5 Inventory – Oagefit_configuralNANANANAPoor
Big 5 Inventory – Oagefit_metric0.0360.068-0.0140.001Passed
Big 5 Inventory – Oagefit_scalar-0.041-0.0240.0050.006Failed
Big 5 Inventory – Osexfit_configuralNANANANAPoor
Big 5 Inventory – Osexfit_metric0.0350.065-0.0140.002Passed
Big 5 Inventory – Osexfit_scalar-0.043-0.0280.0060.006Failed

However, readers of the article don’t get to see this evidence. Instead they are presented with a table that suggests Big Five measures lack measurement invariance.

Aside from the misleading presentation of the results, the results are not very informative because they don’t reveal whether deviations from a simple-structure pose a serious threat to the validity of Big Five scales. Unfortunately, the authors’ data are currently not available to examine this question.

Own Investigation

Incidentally, I had just posted a blog post about measurement models of Big Five data (Schimmack, 2019), using open data from another study (Beck, Condon, & Jackson, 2019) using a large, online dataset with the IPIP-100 items. I showed that it is possible to fit a measurement model to the IPIP-100. To achieve model fit, the model included secondary loadings, some correlated residuals, and method factors for acquiescence bias and evaluative (halo) bias. These results show that a reasonable measurement model can fit Big Five data, as was demonstrated in several previous studies (Anusic et al., 2009; Nye et al., 2016).

Here, I examine measurement invariance for gender and age groups. I also modified and improved the measurement model, by using several of the rejected IPIP-100 items as indicators of the halo factor. Item analysis showed that the items “quick to understand things,” “carry conversations to a higher level,” “take charge,” “try to avoid complex people,” “wait for others to lead the way,” “will not probe deeply into a subject” loaded more highly on the halo factor than on the intended Big Five factor. This makes these items ideal candidates for the construction of a manifest measure of evaluative bias.

The sample were 9,309 Canadians, 140,479, US Americans, 5,804 British, and 5,091 Australians between the age of 14 and 60 (see for data). Data were analyzed with MPLUS.8.2 using robust maximum likelihood (see for complete syntax). The final model met the standard criteria for acceptable fit (CFI = .965, RMSEA = .015, SRMR = .032).

Table 1. Factor Loadings and Item Intercepts for Men (First) and Women (Second)

3 .48/  .47-.07 /-.08-.21/-.21.14/ .13-0.34/-0.45
10-.61/-.60.16/ .16.13/ .130.21/ 0.22
17-.66/-.63.09/ .09.14/ .13.15/ .140.71/ 0.70
46 .73/ .73.11/ .09-.20/-.20 .13/ .12-0.19/-0.20
56  .56/ .54-.15/-.16-.20/-.19.13/ .12-0.46/-0.46
SUM .83/ .83-.02/-.02.00/  .00.03/ .03-.06/-.07-.26/-.26.04/ .04
16.12/ .12-.72/-.69-.20/-.19.13/ .120.31/ 0.32
33-.32/-.32 .60/ .61.24/ .21.08/ .09.19/ .20.15/ .140.56/ 0.58
38.20/ .20-.71/-.70-.09/-.10-.24/-.24.13/ .12-0.17/-0.18
60.09/ .08-.69/-.67-.21/-.21.14/ .13-0.07/-0.08
88-.10/-.10 .72/ .72.16 /.14 .26/ .26.14/ .140.44/ 0.46
93-.15/-.15 .72/ .72.10/ .09.16/ .16.12/ .110.03/ 0.03
SUM-.23/-.23 .81/ .81.00/ .00.14/ .12.05/ .05.24/ .24.08/ .07
5.10/.09 .40/ .42-.07/-.08.50/ .47.19/ .171.32/ 1.29
27-.70/-.74-.23/-.22.15/ .14-1.12/-1.09
52.08/ .08 .73/ .76-.14/-.14.29/ .27.17/ .151.21/ 1.16
53-.64/-.65-.37/-.34.18/ .16-1.49/-1.40
SUM.03/ .02.03/ .03 .79/ .82.00/ .00-.07/-.07.44/.40.00/.00
8-.59/-.53-.22/-.24.15/ .15-0.65/-0.71
12-.59/-.51-.22/-.23.14/ .14-0.49/-0.52
35-.63/-.54-.23/-.24.15/ .14-0.79/-0.83
51.63/ .61.02/ .02.15/ .160.62/0.72
89.74/ .72.20/ .23.16/ .180.80/ 0.95
94.58/ .53.10/ .12.12/ .10.16/ .160.45/ 0.50
SUM.00/ .00.00/ .00.00/ .00.87/ .84.02/ .03.24/ .26.00/ .00
40 .42/ .46.08/ .08.14/ .130.15/ 0.38
43.09/.09 .63/ .66.14/ .13.13/ .12-0.26/-0.13
63-.76/-.79-.18/-.17.12/ .100.19/ 0.18
64-.69/-.74-.04/-.04.12/ .11-0.08/ 0.01
68.12/ .12-.06/-.06-.06/-.07.14/ .12 .43/ .48-.03/-.03.15/ .140.38/ 0.39
79-.72/-.76-.19/-.18.12/ .11-0.11/-0.11
SUM.03/ .02.01/ .01-.01/-.01.03/ .02 .87/ .89 .15/ .14.00 / .00
15 .53/ .51.19/ .181.38/ 1.38
23.10 / .10.20/ .20 .61/ .62.16/ .160.79/ 0.82
90.43/ .41.14/ .15 .36/ .35.15/ .140.65/ 0.64
95.13/ .13-.49/-.46.15/ .13-0.73/-0.70
97-.41/-.39.14/ .12-.10/-.11-.41/-.40.14/ .13-0.38/-0.38
99-.11/-.11-.54/-.53.16/ .15-0.86/-0.87
SUM.06/ .05.29/ .28.00/ .00-.04/-.04.03/ .03 .77/ .77.00/ .00
SUM.10/ .10.03/ .03-.04/-.05.14/ .12-.19/-.21-.17/-.17 .71/ .69

The factor loadings show that items load on the primary factors and that these factor loadings are consistent for men and women. Secondary loadings tended to be weak, although even the small loadings were highly significant and consistent across both genders; so were loadings on the two method factors. The results for the sum scores show that most of the variance in sum scores was explained by the primary factor with effect sizes ranging from .71 to .89.

Item-intercepts show the deviation from the middle of the scale in standardized units (standardized mean differences from 3.5). The assumption of equal item-intercepts was relaxed for four items (#3, #40, #43, #64), but even for these items the standardized mean differences were small. The largest difference was observed for following a schedule (M = 0.15, F = 0.38). Constraining these coefficients would reduce fit, but it would have a negligible effect on gender differences on the Big Five traits.

Table 2 and Figure 1 show the standardized mean differences between men and women for latent variables and for sum scores. The results for sum scores were based on the estimated means and variances in the Tech4 output of MPLUS (see output file on OSF).


Given the high degree of measurement invariance and the fairly high correlations between latent scores and sum scores, the results are very similar and replicate previous findings that most gender differences are small, but that women score higher on neuroticism and agreeableness. These results show that these differences cannot be attributed to hidden invalidity of Big Five measures. In addition, the results show a small difference in evaluative bias. Men are more likely to describe their personality in an overly positive way. However, given the size effect and the modest contribution of halo bias to sum scores, it has a small effect effect on mean differences in scales. Along with unreliability, it attenuates the gender differences in agreeableness from d = .80 to d = .56.


Hussey and Hughes claim that personality psychologist were hiding invalidity of personality measures by not reporting tests of structural validity. They also claim that personality measures fail tests of structural validity. The first claim is false because personality psychologists have examined factor structures and measurement invariance for the Big Five (e.g., Anusic et al., 2009; Nye et al., 2016). Thus, Hussey and Hughes misrepresent the literature and fail to cite relevant work. The second claim is inconsistent with Nye et al. results and with my new examination of structural invariance in personality ratings. Thus, Hussey and Hughes article does not make a contribution to the advancement of psychological science. Rather it is an example of poor scholarship, where authors make strong claims (validity hachking) with weak evidence.

The substantive conclusion is that men and women have similar measurement models of personality and that it is possible to use sum scores to compare them. Thus, past results that are based on sum scores reflect valid personality differences. This is not surprising because men and women speak the same language and are able to communicate to each other about personality traits of men and women. There is also no evidence to suggest that memory retrieval processes underlying personality ratings differ between men and women. Thus, there are no reasons to expect structural invariance in personality ratings.

A more important question is whether gender differences in self-ratings reflect actual differences in personality. One threat to the validity could be social comparison processes where women compare to other women and men compare to other men. However, social comparison would attenuate gender differences and cannot explain the moderate to large differences in neuroticism and agreeableness. Nevertheless, future research should examine gender differences using measures of actual behavior and informant ratings. Althoug sum scores are mostly valid, it is preferable to use latent variable models for these studies because latent variable models make it possible to test assumptions that are merely assumed to hold in studies with sum scores.