Does PNAS article show there is no racial bias in police shootings?

Politics in the United States is extremely divisive and filled with false claims based on fake facts. Ideally, social scientists would provide some clarity to these toxic debates by informing US citizens and politicians with objective and unbiased facts. However, social scientists often fail to do so for two reasons. First, they often lack the proper data to provide valuable scientific input into these debates. Second, when the data do not provide clear answers, social scientists’ inferences are shaped as much (or more) by their preexisting beliefs than by the data. It is therefore not a surprise, that the general public increasingly ignores scientists because they don’t trust then to be objective.

Unfortunately, the causes of police killings in the United States is one of these topics. While a few facts are known and not disputed, these facts do not explain why Black US citizens are killed by police more often than White citizens. While it is plausible that there are multiple factors that contribute to this sad statistic, the debate is shaped by groups who either blame a White police force on the one hand and those who blame Black criminals on the other hand.

On September 26, the House Committee on the Judiciary held an Oversight Hearing on Policing Practices. In this meeting, an article in the prestigious journal Proceedings of the National Academy of Sciences (PNAS) was referenced by Heather Mac Donald, who works for the conservative think tank Manhattan Institute, as evidence that crime is the single factor that explains racial disparities in police shootings.

The Manhattan Institute posted a transcript of her testimony before the committee. Here claim is clear. She not only claims that crime explains the higher rate of Black citizens being killed, she even claims that taking crime into account shows a bias of the police force to kill disproportionally FEWER Black citizens than White citizens.


Heather MacDonald is not a social scientist and nobody should expect that she is an expert in logistic regression. This is the job of scientists; authors, reviewers, and editors. The question is whether they did their job correctly and whether their analyses support the claim that after taking population ratios and crime rates into account, police officers in the United States are LESS likely to shot a Black citizen than a White citizen.

The abstract of the article summarizes three findings.

1. As the proportion of Black or Hispanic officers in a FOIS increases, a person shot is more likely to be Black or Hispanic than White.

In plain English, in counties with proportionally more Black citizens, proportionally more Black people are being shot. For example, the proportion of Black people killed in Georgia or Florida is greater than the proportion of Black people killed in Wyoming or Vermont. You do not need a degree in statistic to realize that this tells us only that police cannot shoot Black people if there are no Black people. This result tells us nothing about the reasons why proportionally more Black people than White people are killed in places where Black and White people live.

2. Race-specific county-level violent crime strongly predicts the race of the civilian shot.

Police do not shoot and kill citizens at random. Most, not all, police shootings occur when officers are attacked and they are justified to defend themselves with lethal force. When police officers in Wyoming or Vermont are attacked, it is highly likely that the attacker is White. In Georgia or Florida, the chance that the attacker is Black is higher. Once more this statistical fact does not tell us why Black citizens in Georgia or Florida or other states with a large Black population are killed proportionally more often than White citizens in these states.

3. The key finding that seems to address racial disparities in police killings is that “although we find no overall evidence of anti-Black or anti-Hispanic disparities in fatal shootings, when focusing on different subtypes of shootings (e.g., unarmed shootings or “suicide by
cop”), data are too uncertain to draw firm conclusions”

First, it is important to realize that the authors do not state that they have conclusive evidence that there is no racial bias in police shootings. In fact, they clearly that that for shootings of unarmed citizens” their data are inconclusive. It is a clear misrepresentation of this article to claim that it provides conclusive evidence that crime is the sole factor that contributes to racial disparity in police shootings. Thus, Heather Mac Donald lied under oath and misrepresented the article.

Second, the abstract misstates the actual findings reported in the article, when the authors claim that they “find no overall evidence of anti-Black or anti-Hispanic disparities in fatal shootings”. The problem is that the design of the study is unable to examine this question. To see this, it is necessary to look at the actual statistical analyses more carefully. Instead, the study examines another question: Which characteristics of a victim make it more or likely that a victim is Black or White. For example, an effect of age could show that young Black citizens are proportionally more likely to be killed than young White citizens, while older Black men are proportionally less likely to be shot than older White men. This would provide some interesting insights into the causal factors that lead to police shootings, but it doesn’t change anything about the proportions of Black and White citizens being shoot by police.

We can illustrate this using the authors’ own data that they shared (unfortunately, they did not share information about officers to fully reproduce their results). However, they did find a significant effect for age. To make it easier to interpret the effect, I divided victims into those under 30 and those 30 and above. This produces a simple 2 x 2 table.

An inspection of the cell frequencies shows that the group with the highest frequency are older White victims. This is only surprising if we ignore the base rates of these groups in the general population. Older White citizens are more likely to be victims of police shootings because there are more of them in the population. As this analysis does not examine proportions in the population this information is irrelevant.

It is also not informative, that there are about two times more White victims (476) than Black victims (235). Again, we would expect more White victims simply because more US citizens are White.

The meaningful information is provided by the odds of being a Black or White victim in the two age groups. Here we see that older victims are much less likely to be Black (122/355) than younger victims (113/121). When we compute the odds ratio, we see that young victims are 1.89 times more likely to be Black than old victims. This shows that young Black man are disproportinally more likely to be the victims of police shootings than young White men. Consistent with this finding, the article states that “Older civilians were 1.85
times less likely (OR = 0.54 [0.45, 0.66]) to be Black than White”

In Table 2, the age effect remains significant after controlling for many variables, including rates of homicides committed by Black citizens. Thus, the authors found that young Black citizens are killed more frequently by police than young White men, eve when they attempted to control statistically for the fact that young Black men are disproportionally involved in criminal activities. This finding is not surprising to critics who claim that there is a racial bias in the police force that has resulted in deaths of innocent young Black men. It is actually exactly what one would expect if racial bias plays a role in police shootings.

Although this finding is statistically significant and the authors actually mention it when they report the results in Table 1, they never comment on this finding again in their article. This is extremely surprising because it is common practice to highlight statistically significant results and to discuss their theoretical implications. Here, the implications are straightforward. Racial bias does not target all Black citizens equally. Young Black men (only 10 Black and 25 White victims were female) are disproportionally more likely to be shoot by police even after controlling for several other variables.

Thus, while the authors attempt to look for predictors of victims’ race provides some interesting insights into the characteristics of Black victims, these analyses do not address the question why Black citizens are more likely to be shot than White citizens. Thus, it is unclear how the authors can state “We find no evidence of anti-Black or anti-Hispanic disparities across shootings” (p. 15877) or “When considering all FOIS in 2015, we did not find anti-Black or anti-Hispanic disparity” (p. 15880).

Surely, they are not trying to say that they didn’t find evidence for it because their analysis didn’t examine this question. In fact, their claims are based on the results in Table 3. Based on these results, the authors come to the conclusion that “controlling for predictors at the civilian, officer, and county levels,” a victim is more than 6 times more likely to be to be White than Black. This makes absolutely no sense if the authors did, indeed, center continuous variables and effect coded nominal variables, as they state.

The whole point of centering and effect coding is to keep the intercept of an analysis interpretable and consistent with the odds ratio without predictor variables. To use age again as an example, the odds ratio of a victim being Black is .49. Adding age as a predictor shows us how the odds change within the two age groups, but this does not change the overall odds ratio. However, if we do not center the continuous age variable or do not take the different frequencies of young (224) and old (477) victims into account, the intercept is no longer interpretable as a measure of racial disparities.

This image has an empty alt attribute; its file name is image-64.png

To illustrate this, here are the results of several logistic regression analysis with age as a predictor variable.

First, I used raw age as a predictor.
summary(glm(race ~ pc$age,family=binomial(link=”logit”)))

The intercept changes from -.71 to 1.08. As these values are log-odds, we need to transform them to get the odds ratios, which are .49 (235/711) and 2.94. The reason is that the intercept is a prediction of the racial bias at age 0, which would suggest that police officers are 3 times more likely to kill a Black newborn than a White newborn. This prediction is totally unrealistic because there are fortunately very few victims younger than 15 years of age. In short, this analysis changes the intercept, but the results do no longer tell us anything about racial disparities in general because the intercept is about a very small, and in this case, non-existing subgroup.

We can avoid this problem by centering or standardizing the predictor variable. Now a value of 0 corresponds to the average age.

age.centered = pc$age – mean(pc$age)
summary(glm(race ~ age.centered,family=binomial(link=”logit”)))

The age effect remains the same, but now the intercept is proportional to the odds in the total sample [disclaimer: I don’t know why it changed from -.71 to -.78; any suggestions are welcome]

This is also true when we split age into young (< 30) and old (30 or older) groups.

When the groups are dummy coded (< 30 = 0, 30+ = 1), the intercept changes and shows that victims are now more likely to be Black in the younger group coded as zero.

summary(glm(race ~ (pc$age > 30),family=binomial(link=”logit”)))

However, with effect coding the intercept hardly changes.

summary(glm(race ~ (scale(pc$age > 30)),family=binomial(link=”logit”)))

Thus, it makes no sense when the authors claim that they centered continuous variables and effect coded nominal variables and the intercept changed from exp(-.71) = .49 to exp(-1.90) = .15, which they report in Table 3. Something went wrong in their analyses.

Even if this result were correct, the interpretation of this result as a measure of racial disparities is wrong. One factor that is omitted from the analysis is the proportion of White citizens in the counties. It doesn’t take a rocket scientist to realize that counties with a larger White population are more likely to have White victims. The authors do not take this simply fact into account, although they did have a measure of the population size in their data set. We can create a measure of the proportion of Black and White citizens and center the predictor so that the intercept reflects a population with equal proportions of Black and White citizens.

When we use this variable as a predictor, the surprising finding that police officers are much more likely to shot and kill White citizens disappears. The odds ratio changes from 0.49 to exp(-.04) = .96, and the 95%CI includes 1, 95%CI = 0.78 to 1.19.

This finding may be taken as evidence that there is little racial disparity after taking population proportions into account. However, this ignores the age effect that was found earlier. When age is included as a predictor, we see now that young Black men are disproprotionally likely to be killed, while the reverse is true for older victims. One reason for this could be that criminals are at a higher risk of being killed. Even if White criminals are not killed in their youth, they are likely to be killed at an older age. As Black criminals are killed at a younger age, there are fewer Black criminals that get killed at an older age. Importantly, this argument does not imply that all victims of police shootings are criminals. The bias to kill Black citizens at a younger age also affects innocent Blacks, as the age effect remained significant after controlling for crime rates.

The racial disparity for young citizens becomes even larger when homicide rates are included using the same approach. I also excluded counties with a ratio greater than 100:1 for population or homicide rates.

summary(glm(race ~ (pc$age > 30) + PopRatio + HomRatio,family=binomial(link=”logit”)))

The intercept of 0.65 implies that young (< 30) victims of police shootings are two times more likely to be Black than White when we adjust the risk for the proportion of Black vs. White citizens and homicides. The significant age effect shows again that this risk switches for older citizens. As we are adjusting for homicide rates, this suggest that older White citizens are at an increased risk of being killed by police. This is an interesting observation as much of the debate has been about young Black men who were innocent. According to these analyses, there should also be cases of older White men who are innocent victims of police killings. Looking for examples of these cases and creating more awareness about these cases does not undermine the concerns of the Black Lives Matter movement. Police killings are not a zero sum game. The goal should be to work towards reducing the loss of Black, Blue (police), White, and all other lives. Scientific studies can help to do that when authors analyze and interpret the data correctly. Unfortunately, this is not what happened in this case. Fortunately, the authors shared (some of) their data and it was possible to put their analyses under the microscope. The results show that their key conclusions are not supported by their data. First, there is no disparity that leads to the killing of more White citizens than Black or Hispanic citizens by police. This claim is simply false. Second, the authors have an unscientific aversion to take population rates into account. In counties with mostly White population, crime is mostly committed by White citizens, and police is more likely to encounter and kill White criminals. It is not a mistake to include population rates in statistical analyses. It is a mistake not to do so. Third, the authors ignored a key finding of their own analysis that age is a significant predictor of police shootings. Consistent with the Black Lives Matter claim, their data show that police disproportionally shoots young Black men. This bias is offset to some extent by the opposite bias in older age groups, presumably because Black men have already been killed, which reduces the at risk population of Black citizens in this age group.

In conclusion, the published article already failed to show that there is no racial disparity in police shootings, but it was easily misunderstood as providing evidence for this claim. A closer inspection of the article shows even more problems with the article, which means this article should not be used to support empirical claims about police shootings. Ideally, the article would be retracted. At a minimum, PNAS should publish a notice of concern.

Poverty Explains Racial Bias in Police Shootings

Statistics show that Black US citizens are disproportionally more likely to be killed by police than White US citizens. Cesario, Johnson, and Terrill (2019 estimated that the odds of being killed by police are 2.5 times higher for Black citizens than for White citizens. To my knowledge, no social scientist has disputed this statistical fact.

However, social scientists disagree about the explanation for this finding. Some social scientists argue that racial bias is at least a contributing factor to the disparity in police killings. Others, deny that racial bias is a factor and point out that Black citizens are killed in proportion to their involvement in crime.

Cesario et al. write “when adjusting for crime, we find no systematic evidence of
anti-Black disparities in fatal shootings, fatal shootings of unarmed citizens, or fatal shootings involving misidentification of harmless objects” (p. 586).

They argue that criminals are more likely to encounter police and that “exposure to police accounts for the racial disparities in fatal shootings observed at the population
level” (p. 591).

They also argue that the data are strong enough to rule out racial bias as a contributing factor that influences police shootings in addition to disproportionate involvement in criminal activities.

None of their tests “provided evidence of systematic anti-Black disparity.
Moreover, the CDC data (as well as the evidence discussed in Online Supplemental Material #2) provide a very strong test of whether biased policing accounts for these
results” (p. 591).

“When considering all fatal shootings, it is clear that systematic anti-Black disparity at the national level is not observed” (p. 591).

The authors also point out that their analyses are not conclusive, but recommend their statistical approach for future investigations of this topic.

“The current research is not the final answer to the question of race and police use of deadly force. Yet it does provide perspective on how one should test for group
disparities in behavioral outcomes and on whether claims of anti-Black disparity in fatal police shootings are as certain as often portrayed in the national media” (p. 591).

Here I follow the authors advice and use their statistical approach to demonstrate that crime rates do not account for racial disparities in police killings. Instead, poverty is a much more likely cause of racial disparities in police killings.

Imagine a scenario, where a cop stops a car on a country road for speeding. In scenario A, the car is a brand new, grey Lincoln, and the driver is neat and wearing a suit. In the other scenario, the car is a 1990s old van, and the driver is unkempt and wearing an undershirt and dirty jeans. Which of these scenarios is more likely to end up with the driver of the vehicle being killed? Importantly, I argue that it doesn’t matter whether the driver is Black, White or Hispanic. What matters is that they fit a stereotype of a poor person, who looks more like a potential criminal.

The poverty hypothesis explains the disproportionate rate of police killings of Black people by the fact that Black US citizens are more likely to be poor, because a long history of slavery and discrimination continues to produce racial inequalities in opportunities and wealth. According to this hypothesis, the racial disparities in police killings should shrink or be eliminated, when we use poverty rates rather than population proportions as a benchmark for police killings (Cesario et al., 2019).

I obtained poverty rates in the United States from the Kaiser Family Foundation website (KFF).

In absolute numbers, there are more White citizens who are poor than Black citizens. However, proportional to their representation in the population, Black citizens are 2.5 times more likely to be poor than White citizens.

These numbers imply that there are approximately 40 million Black citizens and 180 million White citizens.

Based on Cesario et al’s (2019) statistics in Table 1, there are on average 255 Black citizens and 526 White citizens that are killed by police in a given year.

We can now use this information to compute the odds of being killed, the odds of being poor, and the odds of being killed given being poor, assuming that police predominantly kill poor people.

First, we see again that Black citizens are about two times more likely to be killed by police than White citzens (Total OR(B/W) = 2.29). This matches the odds ratio of being Black among poor people (.20/.08 = 2.5).

More important, the odds ratio of getting killed by police for poor Black citizens, 3.34 out of 100,000, is similar to the odds ratio of getting killed by police for poor White citizens, 3.64 out of 100,000. The odds ratio is close to 1, and does no longer show a racial bias for Black citizens to be killed more often by police, OR(B/W) = 0.92. In fact, there is a small bias for White citizens to be more likely to be killed. This might be explained by the fact that White US citizens are more likely to own a gun than Black citizens, and owning a gun may increase the chances of a police encounter to go wrong (Gramlich, 2018).

The present results are much more likely to account for the racial bias in police killings than Cesario et al.’s (2019) analyses that suggested crime is a key factor. The crime hypothesis makes the unrealistic assumption that only criminals get killed by police. However, it is known that innocent US citizens are sometimes killed by accident in police encounters. It is also not clear how police could avoid such accidents because they cannot always know whether they are encountering a criminal or not. In these situations of uncertainty, police officers may rely on cues that are partially valid indicators such as race or appearance. The present results suggest that cues of poverty play a more important role than race. As a result, poor White citizens are also more likely to be killed than middle-class and well-off citizens.

Cesario et al.’s (2019) results also produced some surprising and implausible results. For example, when using reported violent crimes, Black citizens have a higher absolute number of severe crimes (67,534 reported crimes in a year) than White citizens (29,713). Using these numbers as benchmarks for police shootings leads to the conclusion that police offers are 5 times more likely to kill a White criminal than a Black criminal, OR(B/W) = 0.21.

According to this analyses, police should have killed 1,195 Black criminals, given the fact that they killed 526 White criminals, and that there are 2.3 times more Black criminals than White criminals. Thus, the fact that they only killed 252 Black criminals shows that police disproportionally kill White criminals. Cesario et al. (2019) offer no explanation for this finding. They are satisfied with the fact that their analyses show no bias to kill more Black citizens.

The reason for the unexplained White-bias in police killings is that it is simply wrong to use crime rates as the determinant of police shootings. Another injustice in the United States is that Black victims of crime are much less likely to receive help from the police than White victims (Washington Post). For example, the Washington Post estimated that every year 2,600 murders go without an arrest of a suspect. It is much more likely that the victim of an unsolved murder is Black (1,860) than White (740), OR(B/W) = 2.5. Thus, one reason why police offers are less likely to kill Black criminals than White criminals is that they are much less likely to arrest Black criminals who murdered a Black citizen. This means, that crime rates are a poor benchmark for encounters with the police because it is more likely that a Black criminal gets killed by another Black criminal than that he is arrested by a White police officer. This means that innocent, poor Black citizens face two injustices. They are more likely to be mistaken as a criminal and killed by police and they do not receive help from police when they are a victim of a crime.


I welcome Cesario et al.’s (2019) initiative to examine the causes of racial disparities in police shootings. I also agree with them that we need to use proper benchmarks to understand these racial disparities. However, I disagree with their choice of crime statistics to benchmark police shootings. The use of crime statistics is problematic for several reasons. First, police do not always know whether they encounter a criminal or not and sometimes shoot innocent people. The use of crime statistics doesn’t allow for innocent victims of police shootings and makes it impossible to examine racial bias in the killing of innocent citizens. Second, crime statistics are a poor indicator of police encounters because there exist racial disparities in the investigation of crimes with Black and White victims. I show that poverty is a much better benchmark that accounts for racial disparities in police shootings. Using poverty, there is only a relatively small bias that police officers are more likely to shoot White poor citizens than Black poor citizens, and this bias may be explained by the higher rate of gun-ownership by White citizens.


My new finding that poverty rather than criminality accounts for racial disparities in police shootings has important implications for public policy.

Cesario et al. (2019) suggest that their findings imply that implicit bias training will have little effect on police killings.

This suggests that department-wide attempts at reform through programs such as implicit bias training will have little to no effect on racial disparities in deadly force, insofar as
officers continue to be exposed after training to a world in which different racial groups are involved in criminal activity to different degrees (p.

This conclusion is based on their view that police only kill criminals during lawful arrests and that killings of violent criminals are an unavoidable consequences of having to arrest these criminals.

However, the present results lead to a different conclusions. Although some killings by police are unavoidable, others can be avoided because not all victims of police shootings are violent criminals. The new insight is that the bias is not only limited to Black people, but also includes poor White people. I see no reason why better training could not reduce the number of killings of poor Americans.

The public debate about police killings also ignores other ways to reduce police killings. The main reason for the high prevalence of police killings in the United States are the gun laws of the United States. This will not change any time soon. Thus, all citizens of the United States, even those that do not own guns, need to be aware that many US citizens are armed. A police officer who makes 20 traffic stops a day, is likely to encounter at least five drivers who own a gun and maybe a couple of drivers who have a gun in their car. Anybody who encounters a police officer needs to understand that they have to assume you might have a gun on you. This means citizens need to be trained how to signal to a police officer that they do not own a gun or pose no threat to the police officer’s live in any other way. Innocent until proven guilty applies in court, but it doesn’t apply when police encounter citizens. You are a potential suspect, until officers can be sure that you are not a treat to them. This is the price US citizens pay for the right to bear arms. Even if you do not exercise this right, it is your right, and you have to pay the price for it. Every year, 50 police officers get killed. Everyday they take a risk when they put on their uniform to do their job. Help them to do their job and make sure that you and them walk away sound and save from the encounter. It is unfair that poor US citizens have to work harder to convince the police that they are not a threat to their lives, and better communication, contact, and training can help to make encounters between police and civilians better and saver.

In conclusion, my analysis of police shootings shows that racial bias in police shootings is a symptom of a greater bias against poor people. Unlike race, poverty is not genetically determined. Social reforms can reduce poverty and the stigma of poverty, and sensitivity training can be used to avoid killing of innocent poor people by police.

Police Shootings and Race in the United States

The goal of social sciences and social psychology is to understand human behavior in the real world. Experimental social psychologists use laboratory experiments to study human behavior. The problem with these studies is that some human behaviors cannot be studied in the laboratory for ethical or practical reasons. Police shootings are one of them. In this case, social scientists have to rely on observations of these behaviours in the real world. The problem is that it is much harder to draw causal inferences from these studies than from laboratory experiments.

A team of social psychologists examined whether police shootings in the United States are racially biased (Are victims of police shootings more likely to be not White (Black, Hispanic). This is an important political issue in the United States. The abstract of their article states their findings.

The abstract starts with a seemingly clear question. “Is there evidence of a Black-White disparity in death by police gunfire in the United States?” However, even this question is not clear because it is not clear what we mean by disparity. Disparity can mean “a lack of equality or a lack of equality that is unfair (Cambridge dictionary).

There is no doubt that Black citizens of the United States are more likely to be killed by police gunfire than White citizens. The authors themselves confirmed this in their analysis. They find that the odds of being killed by police are three times higher for Black citizens than for White citizens.

The statistical relationship implies that race is a contributing causal factor to being killed by police. However, the statistical finding does not tell us why or how race influences police shootings. In psychological research this question is often framed as a search for mediators; that is, intervening variables that are related to race and to police shootings.

In the public debate about race and police shooting, two mediating factors are discussed. One potential mediator is racial bias that makes it more likely for a police officer to kill a Black suspect than a White suspect. Cases like the killing of Tamir Rice or Philando Castile are used as examples of innocent Black citizens being killed under circumstances that may have led to a different outcome if they had been White. Others argue that tragic accidents also happen with White suspects and that these cases are too rare to draw scientific conclusions about racial bias in police shootings.

Another potential mediator is that there is also a disparity between Black and White US citizens in violent crimes. This is the argument put forward by the authors.

When adjusting for crime, we find no systematic evidence of anti-Black disparities in fatal shootings, fatal shootings of unarmed citizens, or fatal shootings involving identification of harmless objects.

This statement implies that the authors conducted a mediation analysis, which uses statistical adjustment for a potential mediator to examine whether a mediator explains the relationship between two other variables.

In this case, racial differences in crime rates are the mediator and the claim is that once we take into account that Black citizens are more involved in crimes and involvement in crimes increases the risk of being killed by police, there are no additional racial disparities. If a potential mediator fully explains the relationship between two variables, we do not need to look for additional factors that may explain the racial disparity in police shootings.

Readers may be forgiven if they interpret the conclusion in the abstract as stating exactly that.

Exposure to police given crime rate differences likely accounts for the higher per capita rate of fatal police shootings for Blacks, at least when analyzing all shootings.

The problem with this article is that the authors are not examining the question that they are stating in the abstract. Instead they are conducting a number of hypothetical analyses that start with the premises that police officers only kill criminals. They then examine racial bias in police shootings under this assumption.

For example, in Table 1 they report that the NIBRS database recorded 135,068 sever violent crimes by Black suspects and 59,426 violent crimes by White suspects in the years 2015 and 2016. In the same years, 475 Black citizens and 1168 White citizens were killed by police. If we assume that all of those individuals killed by police were suspected of a violent crime recorded in the NIBRS database, we see that White suspects are much more likely to be killed by police (1168 / 59,426 = 197 out of 10,000) than Black suspects (475 / 135068 = 35 out of 10000). The odds ratio is 5.59, which means for every Black suspect police kills over 5 White suspects. This is shown in Figure 1 of the article as the most extreme bias against White criminals. However most other crime statistics also lead to the conclusion that White criminals are more likely to be shot by police than Black criminals.

This is a surprising finding to say the least. While we started with the question why police officers in the United States are more likely to kill Black citizens than White citizens, we end with the conclusion that police officers only kill criminals and are more likely to kill White criminals than Black criminals. I hope I am not alone in noticing a logical inconsistency. If police doesn’t shoot innocent citizens and they shoot more White criminals than Black criminals, we should see that White US citizens are killed more often by police than Black citizens. But that is not the case. We started our investigation with the question why Black citizens are killed more often by police than White citizens. The authors statistical analysis does not answer this question. Their calculations are purely hypothetical and their conclusions suggest only that their assumptions are wrong.

The missing piece is information about the contribution of crime to the probability of being killed by police. Without this information it is simply impossible to examine to what extent racial differences in crime contribute to racial disparities in police shootings. And therewith it is also impossible to say anything about other factors, such as racial bias, that may also contribute to racial disparities in police shootings. This means that this article makes no empirical contribution to the understanding of racial disparities in police shootings.

The fundamental problem of the article is that the authors think they can simply substitute populations. Rather than examining killings in the population of citizens, which the statistic is based on, they think they can replace it by another population, the population of criminals. But, the death counts apply to the population of citizens and not to the population of criminals.

In this article, we approached the question of racial disparities in deadly force by starting with the widely used technique of benchmarking fatal shooting data on population
proportions. We questioned the assumptions underlying this analysis and instead proposed a set of more appropriate benchmarks given a more complete understanding of the context of police shootings

The authors talk about benchmarking and discuss the pros and cons of different benchmarks. However, the notion of a benchmark is misleading. We have a statistic about the number of police killings in the population of the United States. This is not a benchmark, it is a population. In this population, Black citizens are disproprotionally more likely to get killed by police. That is a fact. It is also a fact that in the population of US citizens more crimes are being committed by Black citizens (discussing the reasons for this is another topic that is beyond this criticism of the article). Again, this is not a benchmark, it is a population statistic. The author now use the incident rates of crime to ask the question how many Black or White criminals are being shot by police. However, the population statistics do not provide that information. We could also use other statistics that lead to different conclusions. For example, White US citizens own disproportionally more guns than Black citizens. If we would use that to “benchmark” police shootings, we would see a bias to shoot more Black gun-owners than White gun-owners. But we don’t really see that in the data because we have no information about the death rates of gun owners, just as the article does not provide information about the death rates of criminals and innocent citizens. Thus, the fundamental flaw of the article is the idea that we can simply take two population statistics and compute conditional probabilities from these statistics. This is simply not possible.

The authors caution readers that their results are not conclusive. “The current research is not the final answer to the question of race and police use of deadly force” In fact, the results presented in this article do not even begin to address the question. The data simply provide no information about the causal factors that produce racial inequality in police shootings.

The authors then contradict themselves and reach a strong and false conclusion.

Yet it does provide perspective on how one should test for group disparities in behavioral outcomes and on whether claims of anti-Black disparity in fatal police shootings are as certain as often portrayed in the national media. When considering all fatal shootings, it is clear that systematic anti-Black disparity at the national level is not observed.

They are wrong on two counts. First, their analysis is statistically flawed and leads to internally inconsistent results. Police only kill criminals and are more likely to kill White criminals, which does not explain why we see more Black victims of police shootings. Second, even if their study had shown that there is no evidence of racial inequality, we cannot infer that racial biases do not exist. Absence of evidence is not the same as evidence of absence. Cases like the tragic death of Tamir Rice may be rare, and they may be too rare to be picked up in a statistic, but that doesn’t mean they should be ignored.

The rest of the discussion section reflects the authors’ personal views more than anything that can be learned from the results of this study. For example, the claim that better training will produce no notable improvements is pure speculation, and ignores a literature on training in the use of force and its benefits for all citizens. The key of police training in shooting situations is for police officers to focus on relevant cues (e.g., weapons) and to ignore irrelevant factors such as race. Better training can reduce killings of Black and White citizens.

This suggests that department-wide attempts at reform through programs such as implicit bias training will have little to no effect on racial disparities in deadly force, insofar as
officers continue to be exposed after training to a world in which different racial groups are involved in criminal activity.

It is totally misleading to support this claim with trivial intervention studies with students.

This assessment is consistent with other evidence that the effects of such interventions are short lived (e.g., Lai, 2017).

And once more the authors attribute racial differences in police shootings to crime rates and they ignore that the influence of crime rates on shootings is their own assumption and not an empirical finding that is supported by their statistical analyses.

Note that this analysis does not blame unarmed individuals shot by police for their own behavior. Instead, it highlights the difficulty of eliminating errors under conditions of uncertainty when stereotypes may bias the decision-making process. This difficulty is amplified when the stereotype accurately reflects the conditional probabilities of crime across different racial groups.

Like many articles, the limitation section is not really a limitation section, but the authors pretend that these limitations do not undermine their conclusions.

One potential flaw is if discretionary stops by police lead to a higher likelihood of being shot in a way not captured by our crime report data sets. If officers are more likely to stop and frisk a Black citizen, for example, then officers might be more likely to enter into a deadly force situation with Black citizens independent of any actual crime rate differences across races. Online Supplemental Material #5 presents some indirect data relevant to this possibility. Here, we simply note that the number of police shootings that start with truly discretionary stops of citizens who have not violated the law is low (*5%) and probably do not meaningfully impact the analyses.

There are about 1000 police killings a year in the United States. If 5% of police killings started without any violation of the law, this means 50 people are killed every year by mistake. This may not be a meaningful number to statisticians for their data analysis, but it is a meaningful number for the victims and their families. In no other Western country, citizens are killed in such numbers by their police.

The final conclusion shows that the article lacks any substantial contribution.

At the national level, we find little evidence within these data for systematic anti-Black disparity in fatal police deadly force decisions. We do not discount the role race may play in individual police shootings; yet to draw on bias as the sole reason for population-level disparities is unfounded when considering the benchmarks presented here. We hope this research demonstrates the importance of unpacking the underlying assumptions inherent to using benchmarks to test for outcome disparities.

The authors continue their misguided argument that we should use crime rates rather than population to examine racial bias. Once more, this is nonsense. It is a fact that Black citizens are more likely to be killed by police than White citizens. It is worthwhile to examine which causal factors contribute to this relationship, but the authors approach cannot answer this question because they lack information about the contribution of crime rates to police shootings.

The statement that their study shows that racial bias of police offers is not the only reason is trivial and misleading. The authors imply that crime rates alone explain the racial disparity and even come to the conclusion that police is more likely to kill White suspects. In reality, crime rates and racial biases are likely to be factors, but we need proper data to tease apart those factors and this article does not do this.

I am sure that the authors truly believe that they made a valuable scientific contribution to an important social issue. However, I also strongly believe that they failed to do so. They start with the question “Is there evidence of a Black-White disparity in death by police gunfire in the United States?” The answer to their question is an unequivocal yes. The relevant statistic are the odds of being killed by police for Black and White US citizens, and these statistics show that Black citizens are at greater risk to be killed by police than White citizens. The next question is why this disparity exist. There will be no simple and easy answer to this question. This article suggests that a simple answer is that Black citizens are more likely to be criminals. This answer is not only too simple, it is also not supported by the authors statistical analysis.

Scientists are human, and humans make mistakes. So, it is understandable that the authors made some mistakes in their reasoning. However, articles that are published in scientific journals are vetted by peer-review, and the authors thank several scientists for helpful comments. So, several social scientists were unable to realize that the statistical analyses are flawed even though they produced the stunning result that police officers are 5 times more likely to kill White criminals than Black criminals. Nobody seemed to notice that this doesn’t make any sense. I hope that the editor of the journal and the authors carefully examine my criticism of this article and take appropriate steps if my criticism is valid.

I also hope that other social scientists examine this issue and add to the debate. Thanks to the internet, science is now more open and we can use open discussion to fix mistakes in scientific articles much faster. Maybe the mistake is on my part. Maybe I am not understanding the authors’ analyses properly. I am also not a neutral observer living on planet Mars. I am married to an African American woman with an African American daughter and my son is half South-Asian. I care about their safety and I am concerned about racial bias. Fortunately, I live in Canada where police kill fewer citizens.

I welcome efforts to tackle these issues using data and the scientific method, but every scientific result needs to be scrutinized even after it passed peer-review. Just because something is published in a peer-reviewed journal doesn’t make it true. So, I invite everybody to comment on this article and my response. Together we should be able to figure out whether the authors’ statistical approach is valid or not.

Open Communication about the invalidity of the race IAT

In the old days, most scientific communication occured behind closed doors, when reviewers provide anonymous peer-reviews that determine the fate of manuscripts. In the old days, rejected manuscripts would not be able to contribute to scientific communications because nobody would know about them.

All of this has changed with the birth of open science. Now authors can share manuscripts on pre-print servers and researchers can discuss merits of these manuscripts on social media. The benefit of this open scientific communication is that more people can join in and contribute to the communication.

Yoav Bar-Anan co-authored an article with Brian Nosek titled “Scientific Utopia: I. Opening Scientific Communication.” In this spirit of openness, I would like to have an open scientific communication with Yoav and his co-author Michelangelo Vianello about their 2018 article “A Multi-Method Multi-Trait Test of the Dual-Attitude Perspective

I have criticized their model in an in press article in Perspectives of Psychological Science (Schimmack, 2019). In a commentary, Yoav and Michelangelo argue that their model is “compatible with the logic of an MTMM investigation (Campbell & Fiske, 1959). They argue that it is important to have multiple traits to identify method variance in a matrix with multiple measures of multiple traits. They then propose that I lost the ability to identify method variance by examining one attitude (i.e., race, self-esteem, political orientation) at a time. They then point out that I did not include all measures and included the Modern Racism Scale as an indicator of political orientation to note that I did not provide a reason for these choices. While this is true, Yoav and Michelangelo had access to the data and could have tested whether these choices made any differences. They do not. This is obvious for the modern racism scale that can be eliminated from the measurement model without any changes in the overall model.

To cut to the chase, the main source of disagreement is the modelling of method variance in the multi-trait-multi-method data set. The issue is clear when we examine the original model published in Bar-Anan and Vianello (2018).

In this model, method variance in IATs and related tasks like the Brief IAT is modelled with the INDIRECT METHOD factor. The model assumes that all of the method variance that is present in implicit measures is shared across attitude domains and across all implicit measures. The only way for this model to allow for different amounts of method variance in different implicit measures is by assigning different loadings to the various methods. Moreover, the loadings provide information about the nature of the shared variance and the amount of method variance in the various methods. Although this is valuable and important information, the authors never discuss this information and its implications.

Many of these loadings are very small. For example, the loading of the race IAT and the brief race IAT are .11 and .02. In other words, the correlation between these two measures is inflated by .11 * .02 = .0022 points. This means that the correlation of r = .52 between these two measures is r = .5178 after we remove the influence of method variance.

It makes absolutely no sense to accuse me of separating the models, when there is no evidence of implicit method variance that is shared across attitudes. The remaining parameter estimates are not affected if a factor with low loadings is removed from a model.

Here I show that examining one attitude at a time produces exactly the same results as the full model. I focus on the most controversial IAT; the race IAT. After all, there is general agreement that there is little evidence of discriminant validity for political orientation (r = .91, in the Figure above), and there is little evidence for any validity in the self-esteem IAT based on several other investigations of this topic with a multi-method approach (Bosson et al., 2000; Falk et al., 2015).

Model 1 is based on Yoav and Michelangelo’s model that assumes that there is practically no method variance in IAT-variants. Thus, we can fit a simple dual-attitude model to the data. In this model, contact is regressed onto implicit and explicit attitude factors to see the unique contribution of the two factors without making causal assumptions. The model has acceptable fit, CFI = .952, RMSEA = .013.

The correlation between the two factors is .66, while it is r = .69 in the full model in Figure 1. The loading of the race IAT on the implicit factor is .66, while it is .62 in the full model in Figure 1. Thus, as expected based on the low loadings on the IMPLICIT METHOD factor, the results are no different when the model is fitted only to the measure of racial attitudes.

Model 2 makes the assumption that IAT-variants share method variance. Adding the method factor to the model increased model fit, CFI = .973, RMSEA = .010. As the models are nested, it is also possible to compare model fit with a chi-square test. With five degrees of freedom difference, chi-square changed from 167. 19 to 112.32. Thus, the model comparison favours the model with a method factor.

The main difference between the models is that there the evidence is less supportive of a dual attitude model and that the amount of valid variance in the race IAT decreases from .66^2 = 43% to r = .47^2 = 22%.

In sum, the 2018 article made strong claims about the race IAT. These claims were based on a model that implied that there is no systematic measurement error in IAT scores. I showed that this assumption is false and that a model with a method factor for IATs and IAT-variants fits the data better than a model without such a factor. It also makes no theoretical sense to postulate that there is no systematic method variance in IATs, when several previous studies have demonstrated that attitudes are only one source of variance in IAT scores (Klauer, Voss, Schmitz, & Teige-Mocigemba, 2007).

How is it possible that the race IAT and other IATs are widely used in psychological research and on public websites to provide individuals with false feedback about their hidden attitudes without any evidence of its validity as an individual difference measure of hidden attitudes that influence behaviour outside of awareness?

The answer is that most of these studies assumed that the IAT is valid rather than testing its validity. Another reason is that psychological research is focused on providing evidence that confirms theories rather than subjecting theories to empirical tests that they may fail. Finally, psychologists ignore effect sizes. As a result, the finding that IAT scores have incremental predictive validity of less than 4% variance in a criterion is celebrated as evidence for the validity of IATs, but even this small estimate is based on underpowered studies and may shrink in replication studies (cf. Kurdi et al., 2019).

It is understandable that proponents of the IAT respond with defiant defensiveness to my critique of the IAT. However, I am not the first to question the validity of the IAT, but these criticisms were ignored. At least Banaji and Greenwald recognized in 2013 that they do “not have the luxury of believing that what appears true and valid now will always appear so” (p. xv). It is time to face the facts. It may be painful to accept that the IAT is not what it was promised to be 21 years ago, but that is what the current evidence suggests. There is nothing wrong with my models and their interpretation, and it is time to tell visitors of the Project Implicit website that they should not attach any meaning to their IAT scores. A more productive way to counter my criticism of the IAT would be to conduct a proper validation study with multiple methods and validation criteria that are predicted to be uniquely related to IAT scores in a preregistered study.


Bosson, J. K., Swann, W. B., Jr., & Pennebaker, J. W. (2000). Stalking the perfect measure of implicit self-esteem: The blind men and the elephant revisited? Journal of Personality and Social Psychology, 79, 631–643.

Falk, C. F., Heine, S. J., Takemura, K., Zhang, C. X., & Hsu, C. (2015). Are implicit self-esteem measures valid for assessing individual and cultural differences. Journal of Personality, 83, 56–68. doi:10.1111/jopy.12082

Klauer, K. C., Voss, A., Schmitz, F., & Teige-Mocigemba, S. (2007). Process components of the Implicit Association Test: A diffusion-model analysis. Journal of Personality and Social Psychology, 93, 353–368.

Kurdi, B., Seitchik, A. E., Axt, J. R., Carroll, T. J., Karapetyan, A., Kaushik, N., . . . Banaji, M. R. (2019). Relationship between the Implicit Association Test and intergroup behavior: A meta-analysis. American Psychologist, 74, 569–586.

The Diminishing Utility of Replication Studies In Social Psychology

Dorthy Bishop writes on her blog.

“As was evident from my questions after the talk, I was less enthused by the idea of doing a large, replication of Darryl Bem’s studies on extra-sensory perception. Zoltán Kekecs and his team have put in a huge amount of work to ensure that this study meets the highest standards of rigour, and it is a model of collaborative planning, ensuring input into the research questions and design from those with very different prior beliefs. I just wondered what the point was. If you want to put in all that time, money and effort, wouldn’t it be better to investigate a hypothesis about something that doesn’t contradict the laws of physics?”

I think she makes a valid and important point. Bem’s (2011) article highlighted everything that was wrong with the research practices in social psychology. Other articles in JPSP are equally incredible, but this was ignored because naive readers found the claims more plausible (e.g., blood glucose is the energy for will power). We know now that none of these published results provide empirical evidence because the results were obtained with questionable research practices (Schimmack, 2014; Schimmack, 2018). It is also clear that these were not isolated incidents, but that hiding results that do not support a theory was (and still is) a common practice in social psychology (John et al., 2012; Schimmack, 2019).

A large attempt at estimating the replicability of social psychology revealed that only 25% of published significant results could be replicated (OSC). The rate for between-subject experiments was even lower. Thus, the a-priori probability (base rate) that a randomly drawn study from social psychology will produce a significant result in a replication attempt is well below 50%. In other words, a replication failure is the more likely outcome.

The low success rate of these replication studies was a shock. However, it is sometimes falsely implied that the low replicability of results in social psychology was not recognized earlier because nobody conducted replication studies. This is simply wrong. In fact, social psychology is one of the disciplines in psychology that required researchers to conduct multiple studies that showed the same effect to ensure that a result was not a false positive result. Bem had to present 9 studies with significant results to publish his crazy claims about extrasensory perception (Schimmack, 2012). Most of the studies that failed to replicate in the OSC replication project were taken from multiple-study articles that reported several successful demonstrations of an effect. Thus, the problem in social psychology was not that nobody conducted replication studies. The problem was that social psychologists only reported replication studies that were successful.

The proper analyses of the problem also suggests a different solution to the problem. If we pretend that nobody did replication studies, it may seem useful to starting doing replication studies. However, if social psychologists conducted replication studies, but did not report replication failures, the solution is simply to demand that social psychologists report all of their results honestly. This demand is so obvious that undergraduate students are surprised when I tell them that this is not the way social psychologists conduct their research.

In sum, it has become apparent that questionable research practices undermine the credibility of the empirical results in social psychology journals, and that the majority of published results cannot be replicated. Thus, social psychology lacks a solid empirical foundation.

What Next?

It is implied by information theory that little information is gained by conducting actual replication studies in social psychology because a failure to replicate the original result is likely and uninformative. In fact, social psychologists have responded to replication failures by claiming that these studies were poorly conducted and do not invalidate the original claims. Thus, replication studies are both costly and have not advanced theory development in social psychology. More replication studies are unlikely to change this.

A better solution to the replication crisis in social psychology is to characterize research in social psychology from Festinger’s classic small-sample, between-subject study in 1957 to research in 2017 as exploratory and hypotheses generating research. As Bem suggested to his colleagues, this was a period of adventure and exploration where it was ok to “err on the side of discovery” (i.e., publish false positive results, like Bem’s precognition for erotica). Lot’s of interesting discoveries were made during this period; it is just not clear which of these findings can be replicated and what they tell us about social behavior.

Thus, new studies in social psychology should not try to replicate old studies. For example, nobody should try to replicate Devine’s subliminal priming study with racial primes with computers and software from the 1980s (Devine, 1989). Instead, prominent theoretical predictions should be tested with the best research methods that are currently available. Thus, the way forward is not to do more replication studies, but rather to use open science (a.k.a. honest science) that uses experiments to subject theories to empirical tests that may also falsify a theory (e.g., subliminal racial stimuli have no influence on behavior). The main shift that is required is to get away from research that can only confirm theories and to allow for empirical data to falsify theories.

This was exactly the intent of Danny Kahneman’s letter, when he challenged social priming researchers to respond to criticism of their work by going into their labs and to demonstrate that these effects can be replicated across many labs.

Kahneman makes it clear that the onus of replication is on the original researchers who want others to believe their claims. The response to this letter speaks volumes. Not only did social psychologists fail to provide new and credible evidence that their results can be replicated, they also demonstrated defiant denial in the face of replication failures by others. The defiant denial by prominent social psychologists (e.g., Baumeister, 2019) make it clear that they will not be convinced by empirical evidence, while others who can look at the evidence objectively do not need more evidence to realize that the social psychological literature is a train-wreck (Schimmack, 2017; Kahneman, 2017). Thus, I suggest that young social psychologists search the train wreck for survivors, but do not waste their time and resources on replication studies that are likely to fail.

A simple guide through the wreckage of social psychology is to distrust any significant result with a p-value greater than .01 (Schimmack, 2019). Prediction markets also suggest that readers are able to distinguish credible and incredible results (Atlantic). Thus, I recommend to build on studies that are credible and to stay clear of sexy findings that are unlikely to replicate. As Danny Kahneman pointed out, young social psychologists who work in questionable areas face a dilemma. Either they have to replicate the questionable methods that were used to get the original results, which is increasingly considered unethical, or they end up with results that are not very informative. On the positive side, the replication crisis implies that there are many important topics in social psychology that need to be studied properly with the scientific method. Addressing these important questions may be the best way to rescue social psychology.

Confirmation Bias is Everywhere: Serotonin and the Meta-Trait of Stability

Most psychologists have at least a vague understanding of the scientific method. Somewhere they probably heard about Popper and the idea that empirical data can be used to test theories. As all theories are false, these tests should at some point lead to an empirical outcome that is inconsistent with a theory. This outcome is not a failure. It is an expected outcome of good science. It also does not mean that the theory was bad. Rather it was a temporary theory that is now modified or replaced by a better theory. And so, science makes progress….

However, psychologists do not use the scientific method popperly. Null-hypothesis significance testing adds some confusion here. After all, psychologists publish over 90% successful rejections of the nil-hypothesis. Doesn’t that show they are good Popperians? The answer is no because the nil-hypothesis is not predicted by a theory. The nil-hypothesis is only useful to reject it to claim that there is a predicted relationship between two variables. Thus, psychology journals are filled with over 90% reports of findings that confirm theoretical predictions. While this may look like a major success, it actually shows a major problems. Psychologists never publish results that disconfirm a theoretical prediction. As a result, there is never a need to develop better theories. Thus, a root evil that prevents psychology from being a real science is verificationism.

The need to provide evidence for, rather than against, a theory led to the use of questionable research practices. Questionable research practices are used to report results that confirm theoretical predictions. For example, researchers may simply not report results of studies that did not reject the nil-hypothesis. Other practices can help to produce significant results by inflating the risk of a false positive result. The use of QRPs explains why psychology journals have been publishing over 90% results that confirm theoretical predictions for 60 years (Sterling, 1959). Only recently, it has become more acceptable to report studies that failed to support a theoretical prediction and question the validity of a theory. However, these studies are still a small minority. Thus, psychological science suffers from confirmation bias.

Structural Equation Modelling

Multivariate, correlational studies are different from univariate experiments. In a univariate experiment, a result is either significant or not. Thus, only tempering with the evidence can produce confirmation bias. In multivariate statistics, data are analyzed with complex statistical tools that provide researchers with flexibility in their data analysis. Thus, it is not necessary to alter the data to produce confirmatory results. Sometimes it is sufficient to analyze the data in a way that confirm a theoretical prediction without showing that alternative models fit the data equally well or better.

It is also easier to combat confirmation bias in multivariate research by fitting alternative models to the same data. Model comparison also avoids the problem of significance testing, where non-significant results are considered inconclusive, while significant results are used to confirm and cement a theory. In SEM, statistical inferences work the other way around. A model with good fit (non-significant chi-square or acceptable fit) is a possible model that can explain the data, while a model with significant deviation from the data is rejected. The reason is that the significance test (or model fit) is used to test an actual theoretical model rather than the nil-hypothesis. This forces researchers to specify an actual set of predictions and subject them to an empirical test. Thus, SEM is ideally suited to test theories popperly.

Confirmation Bias in SEM Research

Although SEM is ideally suited to test competing theories against each other, psychology journals are not used to model comparisons and tend to publish SEM research in the same flawed confirmatory way as other research is conducted and reported. For example, an article in Psychological Science this year published an investigation of the structure of personality and the hypothesis that several personality traits are linked to a bio-marker (Wright et al., 2019).

Their preferred model assumes that the Big Five traits neuroticism, agreeableness, and conscientiousness are not independent, but systematically linked by a higher-order triat called alpha or stability (Digman, 1997; DeYoung, 2007). In their model, the stability factor is linked to a marker of the serotonin (5-HT) prolactin response. This model implies that all three traits are related to the biomarker as there are indirect paths from all three traits to the biomarker that are “mediated” by the stability factor (for technical reasons the path goes from stabilty to the biomarker, but theoretically, we would expect the relationship to go the other way from a neurological mechanism to behaviour).

Thanks to the new world of open science, the authors shared actual MPLUS outputs of their models on OSF ( ). All the outputs also included the covariance matrix among the predictor variables, which made it possible to fit alternative models to the data.

Alternative Models

Another source of confirmation bias in psychology is that literature reviews fail to mention evidence that contradicts the theory that authors try to confirm. This is pervasive and by no means a specific criticism of the authors. Contrary to the claims in the article, the existence of a meta-trait of stability is actually controversial. Digman (1997) reported some SEM results that were false and could not be reproduced (cf. Anusic et al., 2009). Moreover, alpha could not be identified when the Big Five were modelled as latent factors (Anusic et al., 2009). This led me to propose that meta-traits may be an artifact of using impure Big Five scales as indicators of the Big Five. For example, if some agreeableness items have negative secondary loadings on neuroticism, the agreeableness scale is contaminated with valid variance in neuroticism. Thus, we would observe a negative correlation between neuroticism and agreeableness even across raters (e.g., self-ratings of neuroticism and informant ratings of agreeableness). Here I fitted a model with secondary loadings and independent Big Five factors to the data. I also examined the prediction that the biomarker is related to all three Big Five traits. The alternative model had acceptable fit, CFI = .976, RMSEA = .056.

The main finding of this model is that the biomarker shows only a significant relationship with conscientiousness, while the relationship with agreeableness trended in the right direction, but was not significant (p = .089) and the relationship for neuroticism was even weaker (p = .474). Aside from the question about significance, we also have to take effect sizes into account. Given the parameter estimates, the bimarker would produce very small correlations among the Big Five traits (e.g., r(A,C) = .19 * .10 = .019. Thus, even if these relationships were significant, they would not provide compelling evidence that a source of shared variance among the three traits has been identified.

The next model shows that the authors’s model ignored the stronger relationship between conscientiousness and the biomarker. When this relationship is added to the model, there is no significant relationship between the stability factor and the biomarker.

Thus, the main original finding of this study was that a serotonin related bio-marker was significantly related to conscientiousness, but not significantly related to neuroticism. This finding is inconsistent with theories that link neuroticism to serotonin, and evidence that serotonin reuptake inhibitors reduce neuroticism (at least in depressed patients). However, such results are difficult to publish because a single study with a non-significant results does not provide sufficient evidence to falsify a theory. However, fitting data to a theory only leads to confirmation bias.

The good news is that the authors were able to publish the results of an impressive study and that their data are openly available and can provide credible information for meta-analytic evaluations of structural models of personality, while the results of this study alone are inconclusive and compatible with many different theories of personality.

One way to take more advantage of these data would be to share the covariance matrix of items to model personality structure with a proper measurement model of the Big Five traits and to avoid the problem of contaminated scale scores, which is the best practice for the use of structural equation models. These models provide no evidence for Digman’s meta-traits (Schimmack, 2019a, Schimmack, 2019b).

In conclusion, the main point of this post is that (a) SEM can be used to test and falsify models, (b) SEM can be used to realize that data are consistent with multiple models and that better data are needed to find the better model, (c) studies of Big Five factors require a measurement model with Big Five factors and cannot rely on messy scale scores as indicators of the Big Five, and (d) personality psychologists need better training in the use of SEM.

32 Personality Types

Personality psychology is dominated by dimensional models of personality (Funder, 2019). There is a good reason for this. Most personality characteristics vary along a continuum like height rather than being categorical like eye color. Thus, a system of personality types requires some arbitrary decisions about a cutoff point. For example, a taxonomy of body types could do a median split on height and weight to assign people to the tall-heavy or the tall-light type.

However, a couple of moderately influential articles have suggested that there are three personality types (Asendorpf et al., 2001; Robins et al., 1996).

The notion that there are only three personality types is puzzling. The dominant framework in personality psychology is the Big Five model that conceptualizes personality traits as five independent continuous dimensions. If we were to create personality types by splitting each dimension at the median, it would create 32 personality types, where individuals are either above or below the median on neuroticism, extraversion, openness, agreeableness, and conscientiousness. if these five dimensions were perfectly independent of each other, we would see that individuals are equally likely to be assigned to one of the 32 types. There is no obvious way to reduce these 32 types to just 3.

Figure 1. small caps = below median, capitals = above mean

So, how did Robins et al. (1996) come to the conclusion that there are only three personality types? The data were Q-sorts. A Q-sort is similar to personality ratings on a series of attributes. The main difference is that the sorting task imposes a constraint on the scores that can be given to an individual. As a result, all individuals have the same overall mean across items. That is, nobody could be above average on all attributes. These kind of data are known as ipsative data. An alternative way to obtain ipsative data would be to subtract the overall mean of ratings from individual ratings. Although the distinction between ipsative and non-ipsative data is technically important, it has no implications for the broader understanding of Robins et al.’s work. The study could also have used ratings.

Robins et al. then performed a factor analysis. However, this factor analysis is different from a typical factor analysis that relies on correlations among items. Rather, the data matrix is transposed and the factor analysis is run on participants. With N = 300, there are three hundred variables and factor analysis is used to reduce this set of variables to a smaller set of factors, while minimizing the loss of information.

Everybody knows that the number of factors in a factor analysis is arbitrary and that a smaller number of factors implies a loss of information.

“Empirical research on personality typologies has been hampered by the lack of clear criteria for determining the number of types in a given sample. Thus, the costs and benefits of having a large number of types must be weighed against those of having relatively few types” (Robins et al., 1996).

The authors do not report Eigenvalues or other indicators of how much variance their three factor solution explained.

The three types are described in terms of the most and least descriptive items. Type 1 can be identified by high conscientiousness (“He is determined in what he does”), high extraversion (“He is energetic and full of life”), low neuroticism (reversed: “When he is under stress, he gives up and backs off”), high agreeableness (“He is open and straightforward”), and high openness (“He has a way with words”). In short, Type 1 is everybody’s dream child; a little Superman in the making.

Type 2 is characterized by high neuroticism (“He gets nervous in uncertain situations”), introversion (reversed: “He tries to be the center of attention”), low openness (reversed: he has a way with words,” but high agreeableness (“He is considerate and thoughtful of other people” ). Conscientiousness doesn’t define this type one way or the other.

Type 3 is characterized by low neuroticism (rerversed: “He is calm and relaxed; easy going”), high extraversion (“He tries to be the center of attention”), low conscientiousness (reversed: He plans things ahead; he thinks before he does something) and low agreeableness (He is stubborn”).

The main problem with this approach is that these personality profiles are not types. Take Profile 1 for example. While some participants’ profile correlated highly positively with Profile 1, some participants profile correlates highly negatively with Profile 1. What personality type are they? We might say that they are the opposite of Superman, but that would imply that we need another personality type for the Anti-Supermans. The problem doesn’t end here. As there are three profiles, each individual is identified by their correlations with all three profiles. Thus, we end up with eight different types depending on whether the correlation with the three profiles are positive or negative.

In short, profiles are not types. Thus, the claim that there are only three personality types is fundamentally flawed because the authors confused profiles with types. Even the claim that there are only 8 types would rest on the arbitrary choice of extracting only three factors. Four factors would have produced 16 types and five factors would have produced 32 types, just as the Big Five model predicted.

Asendorph et al. (2001) also found three profiles that they considered to be similar to those found by Robins et al. (1996). Moreover, they examined profiles in a sample of adult with a Big Five questionnaire (i.e., the NEO-FFI). Importantly, Asendorpf et al. (2001) use better terminology and refer to profiles as prototypes rather than types.

The notion of a prototype is that there are no clear defining features that determine class membership. For example, on average mammals are heavier than birds. So we can distinguish birds and mammals by their prototypical weight (how close their weight is to the average weight of a bird or mammal) rather than on the basis of a defining feature (lays eggs, has a uterus). Figure 2 shows the prototypical Big Five profile for the three groups of participants, when participants were assigned to three groups.

The problem is once more that the grouping into three groups is arbitrary. Clearly there are individuals with high scores on agreeableness and on openness, but this variation in personality was not used to create the three groups. Based on this figure, groupings are based on low N and high C, high N and low E, and low C. It is not clear what we should do with individuals who do not match any of these prototypical profiles. What type are individuals who are high in N and high in C?

In sum, a closer inspection of studies of personality types suggests that these studies failed to address the question. Searching for prototypical item-profiles is not the same thing as searching for personality types. In addition, the question may not be a good question. If personality attributes vary mostly quantitatively and if the number of personality traits is large, the number of personality types is infinite. Every individual is unique.

Are Some Personality Types More Common Than Others?

As noted above, the number of personality types that are theoretically possible is determined by the number of attributes and the levels of each attribute. If we describe personality with the Big Five and limit the levels to being above or below the median, we have 32 theoretical patterns. However, this does not mean that we actually observe all patterns. Maybe some types never occur or are at least rare. The absence of some personality types could provide some interesting insights into the structure of personality. For example, high conscientiousness might suppress neuroticism and we would see very few individuals who are high in C and low in N (Digman, 1997). However, when C is low, we could see equal numbers of individuals with high N and low N because conscientiousness only inhibits high N, while low conscientiousness does not lead to high N. It is impossible to examine such patterns with bivariate correlations (Feger, 1988).

A simple way to examine this question is to count the frequencies of personality traits (Anusic & Schimmack, unpublished manuscript that was killed in peer-review). Here, I present the results of this analysis based on Sam Gosling’s large internet survey with millions of visitors who completed the BFI (John, Naumann, & Soto, 2008).

Figure 3 simply shows the relative frequencies of the 32 personality types.

Figure 4 shows the results only for US residents. The results are very similar to those for the total sample.

The most notable finding is that the types nEOAC and Neoac are more frequent than all other types. These types are evaluatively positive or negative. However, it is important to realize that these types are not real personality types. Other research has demonstrated that the evaluative dimension in self-ratings of personality is mostly a rating or a perception bias (Anusic et al., 2009). Thus, individuals with a nEOAC profile do not have a better personality. Whether they simply rate themselves (other-deception) or actually see themselves (self-deception) as better than they are is currently unknown.

The next two types with above average frequency are nEoAC and NeOac. A simple explanation for this pattern is that openness is not highly evaluative and so some people will not inflate their openness scores, while they are still responding in a desirable way on the other four traits.

The third complementary pair are the neoAC and the NEOac types. This pattern can also be explained with rating biases because some people do not consider openness and extraversion desirable; so they will only show bias on neuroticism, agreeableness and conscientiousness. These people were called “Saints” by Paulhus and John (1998).

In short, one plausible explanation of the results is that all 32 personality types that can be created by combining high and low scores on the Big Five exists. Some types are more frequent than others, but at least some of this variation is explained by rating biases rather than by actual differences in personality.


The main contribution of this new look at personality types is to clarify some confusion about the notion of personality types. Previous researchers used the term types for prototypical personality profiles. This is unfortunate because it led to the misleading impression that there are only three personality types. You are either resilient, over-controlled, or under-controlled. In fact, even three profiles create more than three types. Moreover, the profiles are based on exploratory factor analyses of personality ratings and it is not clear why there are only three profiles. Big Five theory would predict five profiles where each profile is defined by items belonging to one of the Big Five factors. It is not clear why profile analyses yielded only three factors. One explanation could be that the item set did not capture some personality dimensions. For example, Robins et al.’s (1996) Q-sort did not seem to include many openness items.

Based on Big Five theory, one would expect 32 personality types that are about equally frequent. An analysis of a large data set showed that all 32 types exists, which is consistent with the idea that the Big Five are fairly independent dimensions that can occur in any combination. However, some types were more frequent than others. The most frequent combination was either desirable (nEOAC) or undesirable (Neoac). This finding is consistent with previous evidence that personality ratings are influenced by a general evaluative bias (Anusic et al., 2009). Additional types with higher frequencies can be attributed to variations in desirability. Openness and extraversion are not as desirable, on average, as low neuroticism and high agreeableness and conscientiousness. Thus, the patterns nEoAC and neoAC may also reflect desirability rather than actual personality structure. Multi-method studies or low evaluative items would be needed to examine this question.


Personality psychologists are frustrated that they have discovered the Big Five factors and created a scientific model of personality, but in applied settings the Myers-Briggs Type Indicator (MBTI) dominates personality assessment (Funder, 2019).

One possible reason is that the MBTI provides simple information about personality by classifying individuals into 16 types. These 16 types are defined by being high or low on four dimensions.

There is no reason, why personality psychologists could not provide simplified feedback about personality using a median split on the Big Five and assigning individuals to the 32 types that can be created by the Big Five factors. For example, I would be the NEOac type. Instead of using small caps and capitals, one could also use letters for both poles of the dimension, neurotic (N) vs. stable (S), extraverted (E) vs. introverted (I), variable (V) versus regular (R), agreeable (A) vs. dominant (D), and conscientious (C) vs. laid back (L). This would make me an NEVDL type. My son would be an SIRAC.

I see no reason why individuals would prefer Myer-Briggs types over Big Five types, given that the Big Five types are based on a well-established scientific theory. I believe the main problem in giving individuals feedback with Big Five scores is that many people do not think in terms of dimensions.

The main problem might be that we are assigning individuals to types even when their scores are close to the median and their classification is arbitrary. For example, I am not very high on E or low on C and it is not clear whether I am really an NEVDL or an NIVDC type. One possibility would be to use only scores that are one standard deviation above or below then mean or median. This would make me an N-VD- type.

To conclude, research on personality types has not made much progress for a good reason. The number of personality types depends on the number of attributes that are being considered and it is no longer an empirical question which types exists. With fairly independent dimensions all types exist and the number of types increases exponentially with the number of attributes. The Big Five are widely considered the optimal trade-off between accuracy and complexity. Thus, they provide an appealing basis for the creation of personality type and a viable alternative to the Myer-Briggs Type Indicator.

If you want to know what type you are, you can take the BFI online ( ). It provides feedback about your personality in terms of percentiles. To create your personality type, you only have to convert the percentiles into letters.

Negative Emotionality P < 50 = S P > 50 = N
Extraversion P < 50 = I P > 50 = E
Open-Mindedness P < 50 = R P > 50 = V
Agreeableness P < 50 = D P > 50 = A
Conscientiousness P < 50 = L P > 50 = C

However, keep in mind that your ratings and those of the comparison group are influenced by desirability.

If you are a NIRDL, you may have a bias to rate yourself as less desirable than you actually are

If you are an SEVAC, you may have a tendency to overrate your desirability.