“For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).
DEFINITION OF REPLICABILITY: In empirical studies with sampling error, replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion (Schimmack, 2017).
This post presented the first replicability ranking and explains the methodology that is used to estimate the typical power of a significant result published in a journal. The post provides an explanation of the new method to estimate observed power based on the distribution of test statistics converted into absolute z-scores. The method has been developed further to estimate power for a wider range of z-scores by developing a model that allows for heterogeneity in power across tests. A description of the new method will be published when extensive simulation studies are completed.
The R-Index can be used to predict whether a set of published results will replicate in a set of exact replication studies. It combines information about the observed power of the original studies with information about the amount of inflation in observed power due to publication bias (R-Index = Observed Median Power – Inflation). The R-Index has predicted the outcome of actual replication studies.
The Test of Insufficient Variance is the most powerful test of publication bias and/or dishonest reporting practices. It can be used even if only two independent statistical results are available, although power to detect bias increases with the number of studies. After converting test results into z-scores, z-scores are expected to have a variance of one. Unless power is very high, some of these z-scores will not be statistically significant (z .05 two-tailed). If these non-significant results are missing, the variance shrinks, and TIVA detects that the variance is insufficient. The observed variance is compared against the expected variance of 1 with a left-tailed chi-square test. The usefulness of TIVA is illustrated with Bem’s (2011) “Feeling the Future” data.
5. MOST VIEWED POST (with comment by Noble Laureate Daniel Kahneman)
This blog post examines the replicability of priming studies cited in Daniel Kahneman’s popular book “Thinking fast and slow.” The results suggest that many of the cited findings are difficult to replicate.
Stereotype-threat has been used by social psychologists to explain gender differences in math performance. Accordingly, the stereotype that men are better at math than women is threatening to women and threat leads to lower performance. This theory has produced a large number of studies, but a recent meta-analysis showed that the literature suffers from publication bias and dishonest reporting. After correcting for these effects, the stereotype-threat effect was negligible. This blog post shows a low R-Index for the first article that appeared to provide strong support for stereotype-threat. These results show that the R-Index can warn readers and researchers that reported results are too good to be true.
Some Bayesian statisticians have proposed Bayes-Factors to provide evidence for a Null-Hypothesis (i.e., there is no effect). They used Bem’s (2011) “Feeling the Future” data to argue that Bayes-Factors would have demonstrated that extra-sensory perception does not exist. This blog post shows that Bayes-Factors depend on the specification of the alternative hypothesis and that support for the null-hypothesis is often obtained by choosing an unrealistic alternative hypothesis (e.g., there is a 25% probability that effect size is greater than one standard deviation, d > 1). As a result, Bayes-Factors can favor the null-hypothesis when there is an effect, but the effect size is small (d = .2). A Bayes-Factor in favor of the null is more appropriately interpreted as evidence that the alternative hypothesis needs to decrease the probabilities assigned to large effect sizes. The post also shows that Bayes-Factors based on a meta-analysis of Bem’s data provide misleading evidence that an effect is present because Bayesian statistics do not take publication bias and dishonest reporting practices into account.
9. Hidden figures: Replication failures in the stereotype threat literature. A widespread problem is that failed replication studies are often not published. This blog post shows that another problem is that failed replication studies are ignored even when they are published. Selective publishing of confirmatory results undermines the credibility of science and claims about the importance of stereotype threat to explain gender differences in mathematics.
10. My journey towards estimation of replicability. In this blog post I explain how I got interested in statistical power and replicability and how I developed statistical methods to reveal selection bias and to estimate replicability.
Abstract: I examined spousal similarity in personality using 4-waves of data over a 12-year period in the German Socio-Economic Panel. There is very little spousal similarity in actual personality traits like the Big Five. However, there is a high similarity in the halo rating bias between spouses.
Spousal similarity in personality is an interesting topic for several reasons. First, there are conflicting folk ideas about spousal similarity. One saying assumes that “birds of the same feather flock together;” another says that “opposites attract.” Second, there is large interest in the characteristics people find attractive in a mate. Do extraverts find other extraverts more attractive? Would assertive (low agreeableness) individuals prefer a mate who is as assertive as they are or rather somebody who is submissive (high agreeableness)? Third, we might wonder whether spouses become more similar to each other over time. Finally, twin studies of heritability make the assumption that mating is random; an assumption that can be questionable.
Given so many reasons to study spousal similarity in personality, it is surprising how little attention this topic has received. A literature search retrieved only a few articles with few citations: Watson, Beer, McDade-Montez (2014) [20 citations], Humbad, Donnellan, Iacono McGue, & Burt (2010) [30 citations], Rammstedt & Schupp (2008) [25 citations]. One possible explanation for this lack of interest could be that spouses are not similar in personality traits. It is well-known that psychology has a bias against null-results; that is, the lack of statistical relationships. Another possibility is that spousal similarity is small and difficult to detect in small convenience samples that are typical in psychology. In support of the latter explanation, two of the three studies had large samples and did report spousal similarity in personality.
Humbad et al. (2010) found rather small correlations between husbands’ and wives’ personality scores in a sample of 1,296 married couples. With the exception of traditionalism, r = .49, all correlations were below r = .2, and the median correlation was r = .11. They also found that spousal similarity did not change over time, suggesting that the little similarity there is can be attributed to assortative mating (marrying somebody with similar traits).
Rammstedt and Schupp (2008) used data from the German Socio-Economic Panel (SOEP), an annual survey of representative household samples. In 2005, the SOEP included for the first time a short 15-item measure of the Big Five personality traits. The sample included 6,909 couples. This study produced several correlations greater than r = .2, for agreeableness, r = .25, conscientiousness, r = .31, and openness, r = .33. The lowest correlation was obtained for extraversion, r = .10. A cross-sectional analysis with length of marriage showed that spousal similarity was higher for couples who were married longer. For example, spousal similarity for openness increased from r = .26 for newlyweds (less than 5 years of marriage) to r = .47 for couples married more than 40 years.
A decade later it is possible to build on Rammstedt and Schupp’s results because the SOEP has collected three more waves with personality assessments in 2009, 2013, and 2017. This makes it possible to examine spousal similarity over time and to separate spousal similarity in stable dispositions (traits) and in deviations from the typical level (states).
I start with simple correlations, separately for each of the four waves using all couples that were available at a specific wave. The most notable observation is that the correlations do not increase over time. In fact, they even show a slight trend to decrease. This provides strong evidence that spouses are not becoming more similar to each other over time. An introvert who marries an extravert does not become more extraverted as a result or vice versa.
W1 (N = 6263)
W2 (N = 5905)
W3 (N = 5404)
W4 (N = 7805)
I repeated the analysis using only couples who stayed together and participated in all four waves. The sample size for this analysis was N = 1,860.
The correlations were not stronger and did not increase over time.
The next analysis examined correlations over time. If spousal similarity is driven by assortment on some stable trait, husbands’ scores in 2005 should still be correlated with wives’ scores in 2017 and vice versa. To ensure comparability for different time lags, I only used couples who stayed in the survey for all four waves (N = 1,860).
The results show more similarity on the same occasion (2005/2005) than across time. Across-time correlations are all below .2 and are decreasing. However, there are some small correlations of r = .1 for Openness, Agreeableness, and Conscientiousness, suggesting some spousal similarity in the stable trait variance. Another question is why spouses show similarity in the changing state variance.
There are two possible explanations for spousal similarity in personality state variance. One explanation is that spouses’ personality really changes in sync, just like their well-being changes in the same direction over time (Schimmack & Lucas, 2010). Another explanation is that spouses’ self-ratings are influenced by rating biases and that these rating biases are correlated (Anusic et al., 2009). To test these alternative hypotheses, I fitted a measurement model to the Big Five scales that distinguishes halo bias in personality ratings from actual variance in personality. I did this for the first and the last wave (2005, 2017) to separate similarity in the stable trait variance from similarity in state variance.
The key finding is that there is high spousal similarity in halo bias. Some couples are more likely to exaggerate their positive qualities than others. After removing this bias, there is relatively little spousal similarity for the actual trait variance.
In conclusion, spouses are not very similar in their personality traits. This may explain why this topic has received so little attention in the scientific literature. Null-results are often considered uninteresting. However, these findings do raise some questions. Why don’t extraverts marry extraverts or why don’t conscientious people not marry conscientious people. Wouldn’t they be happier with somebody who is similar in their personality? Research with the SOEP data suggests that that is also not the case. Maybe the Big Five traits are not as important for marital satisfaction as we think. Maybe other traits are more important. Clearly, human mating is not random, but it is also not based on matching personality traits.
John Bargh is a controversial social scientists with a knack of getting significant results when others cannot (Bargh in Bartlett, 2012). When somebody failed to replicate his most famous elderly-priming results (he published two exact replication studies, 2a and 2b, that were both successful, p < .05), he wrote a blog post. The blog-post blew up in his face and he removed it. For a while, it looked as if this historic document was lost, but it has been shared online. Here is another link to it : Nothing in their heads
Abstract: No robust and reliable interaction effects of the Big Five personality traits and unemployment on life-satisfaction in the German Socio-Economic Panel.
With the exception of late Walter Mischel, Lee Ross, and Dick Nisbett, we are all interactionists (ok, maybe Costa & Mcrae are guilty of dispositionism). As Lewin told every body in 1934, behaviour is a function of the person and the situation, and the a priori probability that the interaction effect between the two is zero (the nil-hypothesis is false) is pretty much zero. So, our journals should be filled with examples of personality x situation interactions. Right? But they are not. Every once in a while when I try to update my lecture notes and look for good examples of a personality x situation interaction I can’t find good examples. One reason is of course the long history of studying situations and traits separately. However, experience sampling studies emerged in the 1980s and the data are ideally suited to look for interaction effects. Another problem is that interaction effects can be difficult to demonstrate because you need large samples to get significant results.
This time I had a solution to my problems. I have access to the German Socio-Economic Panel (SOEP) data. The SOEP has a large sample (N > 10,000), measured the Big Five four times over a 12-year period and many measures of situations like marriage, child birth, or unemployment. So, I could just run an analysis and find a personality x situation interaction. After all, in large samples, you always get p < .05. Right? If you think so, you might be interested to read on and find out what happened.
The Big Five were measure the first time in 2005 (wave v). I picked unemployment and neuroticism as predictors because it is well-known that neuroticism is a personality predictor of life-satisfaction and unemployment is a situational predictor of life-satisfaction. It also made sense that neurotic people might respond more strongly to a negative life-event. However, contrary to these expectations, the interaction was far from significant (p = .5), while the main effects of unemployment (-1.5) and neuroticism (-.5) were highly significant. The effect of unemployment is equivalent to a change by three standard deviations in neuroticism.
Undeterred, I looked for interactions with the other Big Five dimensions. Surely, I would find an explanation for the interaction when I found one. To make things simple, I added all five interactions to the model and, hooray, a significant interaction with conscientiousness popped up, p = .02.
Was I the first to discover this? I quickly checked for articles and of course somebody else had beat me to the punch. There it was. In 2010, Boyce, Wood, and Brown had used the SOEP data to show that conscientious people respond more strongly to the loss of a job.
Five years later, a follow-up article came to the same conclusion.
A bit skeptical of p-values that are just below .02, I examined whether the interaction effect can be replicated. I ran the same analysis as I did with the 2005 data in 2009.
The effect size was cut in half and the p-value was no longer significant, p = .25. However, the results did replicate that none of the other four Big Five dimensions moderated the effect of unemployment.
So, what about the 2013 wave? Again not significant, although the effect size is again negative.
And what happened in 2017? A significant effect, hooray again, but this time the effect is positive.
Maybe the analyses are just not powerful enough. To increase power, we can include prior life-satisfaction as a predictor variable to control for some of the stable trait variance in life-satisfaction judgments. We are now only trying to predict changes in life-satisfaction in response to unemployment. In addition, we can include prior unemployment to make sure that the effect of unemployment is not due to some stable third variable.
We see that it is current unemployment that has a negative effect on life-satisfaction. Prior unemployment actually has a positive effect, suggesting some adaptation to long-term unemployment. Most important, the interaction between conscientiousness and current unemployment is not significant, p = .68.
The interaction was also non-significant in 2013, p = .69.
And there was no significant interaction in 2017, p = .38.
I am sure that I am not the first to look at this, especially given two published articles that reported a significant interaction. However, I suspect that nobody thought about sharing these results because the norm in psychology is still to report significant results. However, the key finding here appears to be that the Big Five traits do not systematically interact with a situation in explaining an important outcome.
So, I am still looking for a good demonstration of a personality x situation interaction that I can use for my lecture in the fall. Meanwhile, I know better than to use the published studies as an example.
it is very disappointing that you are not willing to retract an openly racist article that was published in your journal Social Psychological and Personality Science (SPPS) when Simine Varzire was editor of the journal and Lee Jussim was the action editor of the article in question (Cesario, Johnson, & Terrill, 2019). I have repeatedly pleaded with you to retract the article that draws conclusions on the basis of false assumptions. I am even more stunned by your decision because you rejected my commentary on this racist article with the justification that a better criticism was submitted. This criticism was just published (Ross et al., 2020). It makes the same observation that I made in my critique; that is, the conclusion that there is no racial bias in policing and the use of force rests entirely on an invalid assumption. The original authors simply assume that police officers only encounter violent criminals or that they only encounter violent criminals when they use deadly force.
Maybe you are not watching the news, but the Black Lives Matter movement started because police often use deadly force against non-violent African Americans. In some cases, this is even documented on video. Please watch the murder of Tamir Rice, George Floyd, Philando Castile, and Eric Garner and then tell their families and friends that police only kills violent criminals. That is what SPPS is telling everybody with the mantel of scientific truth, but is a blatantly false claim based on racists assumptions. So, why are you not retracting this offensive article?
So, why are you not retracting an article that makes an obviously false and offensive assumption? Do you think that a retraction would look badly on the reputation of your journal? In that case, you are mistaken. Research shows that journals that retract articles with false conclusions have higher impact factors and are more prestigious than journals that try to maintain a flawless image by avoiding retractions of bad science (Nature). So, your actions are not only offensive, but also hurt the reputation of SPPS and ultimately our science.
Your justification for not retracting the article is unconvincing.
“Just how to analyze data such as this is debated, mostly in criminology journals. (One can wonder what psychology was present in Cesario et al.’s study that led to publication in SPPS, but that’s another matter.) Cesario et al. made the important point that benchmarking with population data is problematic. Their methodology was imperfect. Ross et al. made important improvements. If one is interested in this question of police bias with benchmarking, the papers bring successive advances. ”
Your response implies that you did not fully understand Ross et al.’s criticism of the offensive article. The whole approach of “benchmarking” is flawed. So, publishing an article that introduces a flawed statistical approach from criminology to psychology is dangerous. What if we would start using this approach to study other disparities? Ross et al. show that this would be extremely harmful to psychological science. It is important to retract an article that introduces this flawed statistical approach to psychologists. As an editor it is your responsibility to ensure that this does not happen.
It is particular shocking and beyond comprehension that you resist retraction at the very same time many universities and academics are keenly aware of the systemic racism in academia. This article about an issue that affects every African American was based on research funding to White academics, reviewed by White academics, approved by White academics, and now defended and not retracted by a White academic. How does your action promote diversity and inclusion? It is even more surprising that you seem to be blind to this systemic racism in the publication of this racist article given your research on prejudice and the funding you received to study these issues (CV). Can you at least acknowledge that it is very offensive to Black people to attribute their losses of lives entirely to violent crime?
Academia is very slow in correcting itself. This is typically not a problem in psychological science because many articles do not have immediate real world consequences. However, when they do, it is important to correct mistakes as quickly as possible. The question whether (If there is any doubt about it) or how much racial bias in policing contributes to the racial disparity in victims of lethal use of force is one of them. While millions of Americans are demonstrating in the streets to support the Black Lives Matter movement, academics are slow to act and to show support for racial equality.
In 2019, the journal Social Psychological and Personality Science (SPPS) published an article by Cesario et al. with the controversial claim that there is no evidence that racial bias contributes to racial disparities in lethal use of force. The article even came to the opposite conclusion that police offers have a bias to shoot more White people than Black people. The article was edited and approved for publication by Lee Jussim, who is know for tirades against liberal-bias in academia. I cannot speak for him and he has repeatedly denied an opportunity to explain his decision. So, I have no evidence to disprove the hypothesis that he accepted the article because the conclusion fitted his conservative anti-anti-racism world-view. This would explain why he overlooked glaring mistakes in the article.
The main problem with this article is that it is unscientific. It is actually one of the worst articles I have ever seen and trust me, I have read and critiqued a lot of bad science. Don’t take my word for it. Aside from myself, SPPS received two other independent criticism of the article. My critique was rejected with the argument that one of the other criticisms was superior. After reading it, I agreed. It is a meticulous, scientific take-down of the garbage that Lee Jussim accepted for publication. I was happy that others agreed with me and made the point more clearly than I could. I was waiting patiently for it to be published. Then George Floyd was murdered on camera and the issue of racial bias in policing led to massive protests and swift actions.
During this time everybody was looking for the science on racial bias in policing. I know because my blog-posts about Cesarios’s fake science received a lot of views. The problem was that Cesario’s crappy science was published in prestigious, peer-reviewed journals, which made him the White expert on racial bias in policing. He happily responded to interview requests and presented his work as telling the true scientific story. The take down of his SPPS article that undercut his racist narrative was still not published.
On May 29, I emailed the current editor of SPPS to ask when the critique would be published.
“Dear. Dr. Monteith, given recent events, I am wondering where we are with the response to the SPPS article that makes false claims about lethal use of force against Black Americans. Is there a preprint of the response or anything that can be shared in public? “
Margo Monteith emailed me that there is no problem with sharing the article.
“I don’t see a problem with Cody putting his article online; SAGE has agreed that it will be an open access article (and they will feature on the SPPS website). I am only posting the main points to honor the request not to publish the entire article. “
I was waiting for it to be published by SPPS, but it is still not published, so I [edited on 6/19/20] shared it on June 17. It actually was published today on June 19th (pdf). ] Everybody needs to know that there is no scientific credibility to Ceario’s claims.
However, publishing a correction is not enough. Cesario and racists ideologists like Heather MacDonald will continue to use the published articles to make false claims in public. We cannot allow this. The critic of Cesario’s article is strong enough to show that the conclusions rest entirely on racists assumptions. In short, Cesario et al simply assume that police only kill violent criminals to end up with their conclusion that given crime rates, police are too soft on violent Black criminals. The problem with this racist conclusion is clear. The assumption that police only use lethal force against known violent criminals is plain wrong and we have many videos of innocent Black victims killed by police to prove it. If you draw conclusions from a false premise, your conclusions are false. It is as simple as that. The assumption is nothing but a racist stereotype about Black people. This racist assumption should never have been published in a scientific journal. The only way to rectify the mistake is to retract the article so that Cesario can no longer use the mantel of science to spread racist stereotypes about African Americans.
Please read the rebuttal (sorry, it is a bit statistics heavy, but you can get the main points without the formulas). If you agree that the original article is flawed, I ask you to show your support with BLM and your commitment to racial equality and let SPPS know that you think the original article needs to be retracted.
And yes, it doesn’t only happen to Black people, but contrary to the statistically flawed work by Cesario, Young Black unarmed men are more often the target of police brutality and the victims of lethal force errors (https://www.pnas.org/content/117/3/1263.short).
Some human behaviors attract more attention than others. Homicides are rare, but very salient human behaviors. Governments investigate and keep records of homicides and social scientists have developed theories of homicides.
In the 1960s, social scientists suggested that inequality can lead to more violence. One simple reason is that the rewards for poor people to commit violent crimes increase with greater inequality in wealth (Becker, 1968).
Cross-national studies confirm that societies with more income inequality have higher homicide rates (Avison & Loring, 1986; Blau & Blau, 1982; Chamlin & Cochran, 2006; Corcoran & Stark, 2020; Fajnzylber, Lederman & Loayza, 2002; Krahn et al., 1986; Pratt & Godsey, 2003; Pridemore, 2008).
A recent article in Psychological Science replicated this finding (Clark, Winegard, Beardslee, Baumeister, & Shariff, 2020). However, the main focus of the article was on personality attributes as a predictor of violence. The authors main claim was that religious people are less likely to commit crimes and that among non-religious individuals those with lower intelligence would be more likely to commit homicides.
A fundamental problem with this article is that the authors relied on an article by a known White-supremacist, Richard Lynn, to measure national differences in intelligence (Lynn & Meisenberg, 2010). This article with the title “National IQs calculated and validated for 108 nations” claims that the values used by Clark et al. (2020) do reflect actual differences in intelligence. The problem is that the article contains no evidence to support this claim. In fact, the authors reveal their racist ideology when they claim that a correlation between their scores and skin color of r = -.9 validates their measure as a measure of intelligence. This is not how scientific validation works. This is how racists abuse science to justify their racist ideology.
The article also makes a common mistake to impose a preferred causal interpretation on a correlation. Lynn and Meisenberg (2010) find that their scores correlate nearly perfectly with educational attainment. They interpret this as evidence that intelligence causes educational attainment and totally ignore the plausible alternative explanation that education influences performance on logical problems. This has important implications for Clark et al.’s (2020) article because the authors buy into Lynn and Meisenberg’s racist interpretation of the correlation between performance on logic problems and educational attainment. An alternative interpretation of their finding would be that religion interacts with education. In nations with low levels of formal education, religion provides a moral code that prevents homicides. In countries with more education, other forms of ethics can take the place of religion. High levels of homicides would be observed in countries where neither religion nor education teach a moral code.
Aside from this fundamental flaw in Clark et al.’s (2020) article, closer inspection of their data shows that they overlooked confounding factors and that their critical interaction is no longer significant when these factors are included in the regression model. In fact, financial and racial inequality are much better predictors of national differences in violence than religion and the questionable measure of intelligence. Below I present the statistical results that support this conclusion that invalidate Clark et al’s (2020) racist conclusions.
Not long ago, religion was a part of life in most countries. Only over the past century, some countries became more secular. Even today, most countries are very religious. Figure 1 shows the distribution of religiosity based on the Relig_ARDA variable in Clark et al.’s dataset. This skewed distribution can create problems when a variable is used in a regression model, especially if the variable is multiplied with another variable to test interaction effects.
It is common practice to transform variables to create a more desirable distribution for the purpose of statistical analysis. To do so, I reversed the item to measure atheism and then log-transformed the variable. To include countries that scored 100% on religiosity, I added 0.001 to all atheism scores before I carried out the log transformation. The distribution of log-atheism is less skewed.
The distribution of homicides (rates per 100,000 inhabitants) is also skewed.
Because homicide rates are right-skewed, a direct log-transformation can be applied to get a more desirable distribution. To include nations with a value of 1, I added a value of 1 before the log-transformation. The resulting distribution for log-homicides is more desirable.
The controversial IQ variable did not require a transformation.
The next figure shows a plot of homicides as a function of the questionable intelligence (QIM). There is a visible negative correlation. However, the plot also highlights countries in Latin America and the United States. These countries have in common that they were established by decimating the indigenous population and bringing slaves from Africa to work for the European colonialists. It is notable that nations with a history of slavery have higher homicide rates than other nations. Thus, aside from economic inequality, racial inequality may be another factor that contributes to violence even though slavery ended over 100 years ago, while racial inequality persists until today. Former slave countries also tend to score lower on the QIM measure. Thus, slavery may partially account for the correlation between QIM and homicide rates.
The next plot shows homicide rates as a function of atheism. A value of 0 would mean the country it totally atheistic, while more negative values show increasing levels of religion. There is no strong relationship between religion and homicide rates. This replicates the results in the original article by Clark et al. Remember that their key finding was a interaction between QIM and religion. However, the plot also shows a clear distinction between less religious countries. Former slave countries are low in religion and have high homicide rates, while other countries (mainly in Europe) are low in religion and have low homicide rates.
To examine the unique contribution of different variables to the prediction of homicide rates, I conducted several regression analyses. I started with the QIM x religion interaction to see whether the interaction is robust to transformations of the predictor variables. The results clearly show the interaction and main effects for QIM and religion (t-values > 2 are significant at p < .05).
Next I added slavery as a predictor variable.
The interaction is no longer significant. This shows that the interaction emerged because former slave countries tend to score low on QIM and religion.
I then added the GINI coefficient, the most widely used measure of income inequality, to the model. Income inequality was an additional predictor. The QIM x religion interaction remained non-significant.
I then added GDP to the model. Countries wealth is strongly related to many positive indicators. Given the skewed distribution, I used log-GDP as a predictor, which is also the most common way economists use GDP.
GDP is another significant predictor, while the QIM x religion interaction remains non-significant. Meanwhile, the strong relationship between QIM and homicide rates has decreased from b = -.71 without controls to b = -.25 with controls. However, it is still significant. As noted earlier, QIM may reflect education and Clark et al. (2020) included a measure of educational attainment in their dataset. It correlates r = .68 with QIM. I therefore substituted QIM with education.
However, education did not predict homicide rates. Thus, QIM scores capture something about nations that the education measure does not capture.
We can compare the social justice variables (slavery, GDP, GINI) with the personal-attribute (atheist, QIM) variables. A model with the social justice variables explains 62% of the variation in homicide rates across nations.
The personal-attribute model explains only 40% of the variance.
As these predictors overlap, the personal-attributes add only 3% additional variance to the variance that is explained by slavery, income inequality, and wealth.
Replicating Slavery’s Effect in the United States
The United States provide another opportunity to test the hypothesis that a legacy of slavery and racial inequality is associated with higher levels of homicides. I downloaded statistics about homicides (homicide stats). In addition, I used a measure of urbanization to predict homicides (urbanization). I also added a measure of income inequality (GINI). I classified states that fought for the confederacy as slave states (civil war facts). Results were similar for different years in which homicide rates were available from 1996 to 2018. So, I used the latest data.
In a model with all predictor variables, slavery was the only significant predictor. Income inequality showed a trend, and urbanization was not a unique predictor. When urbanization was removed from the model, the effect of income inequality was a bit stronger.
Overall, these results are consistent with the cross-national data and suggest that a history of slavery and persistent racial inequality create social conditions that lead to more violence and homicides. These results are consistent with recent concerns that systemic racism contributes to killing of civilians by civilians and police officers who historically had the role to enforce racial inequality.
Clark et al.’s (2020) article is flawed in numerous ways. Ideally, the authors would have the decency to retract it. The main flaw is the use of a measure with questionable validity and to never question the validity of the measure. This flaw is not unique to this article. It is a fundamental flaw that has also led to a large literature on implicit bias based on an invalid measure. The uncritical use of measures has to stop. A science without valid measures is not a science and statistical results that are obtained with invalid measures are not scientific results.
A second flaw of the article is that psychologists are trained to conduct randomized laboratory experiments. Random assignment makes it easy to interpret statistically significant results. Unless something went really wrong or sampling error produced a false result, a statistically significant result means that the experimental manipulation influenced the dependent variable. Causality is built into the design. However, things are very different when we look at naturally occurring covariation because everything is correlated with everything. Observed relationships may not be causal and they can be produced by variables that were not measured. The only way to deal with this uncertainty is to carefully test competing theories. It is also necessary to be careful in the interpretation of results. Clark et al. (2020) failed to do so and make overly strong statements based on their correlational findings.
“Many scholars have argued that religion reduces violent behavior within human social groups. Here, we tested whether intelligence moderates this relationship. We hypothesized that religion would have greater utility for regulating violent behavior among societies with relatively lower average IQs than among societies with relatively more cognitively gifted citizens. Two studies supported this hypothesis”
This statement would be fine if they had conducted an experiment, but of course, it is impossible to conduct an experiment to examine this question. This also means it is no longer possible to use evidence as support for a hypothesis. Correlational evidence simply cannot verify a hypothesis. It can only falsify wrong theories. Clark et al. (2020) failed to acknowledge competing theories of homicides and to test their theory against competing theories.
The last meta-scientific observation is that all conclusions in science rests on a combination of data and assumptions. When the same data lead to different conclusions, like they did here, we get insights into researchers’ assumptions. Clark et al.’s (2020) assumptions were (a) there are notable difference in intelligence between nations, (b) these differences are measured with high validity by Lynn and Weisenberg’s (2010) questionable IQ scores, and homicides are caused by internal dispositions like being an atheist with low intelligence. Given Lynn and Weisenberg’s finding that their questionable measure correlates highly with skin tone, they also implicitly share the racist assumption that dark skinned people are more violent because they are less intelligent. The present blog post shows that an entirely different story fits the data. Homicides are caused by injustice such as unfair distributions of wealth and discrimination and prejudice based on skin color. I am not saying that my interpretation of the data is correct because I am aware that alternative explanations are possible. However, I rather have a liberal/egalitarian bias than a racist bias.
The right-wing magazine Quilette just published an article by John McWhorter, who is an associate professor in the linguistics department at Columbia University with the title “Racist Police Violence Reconsidered” Given his training in writing, he knows how to draw his readers in with an emotional story about a White victim of lethal use of force to make the point that police sometimes kill White people, too. This is followed by the statement that “plenty of evidence indicates, however, that racism is less important to understanding police behavior than is commonly supposed”.
In a scientific article, this would be the time to mention the scientific evidence that is supposed to support this claim. But McWhorter is no scientist. He is a writer and cannot be held to the scientific standards of criminologists and other social scientists. With one sentence, a fact has been created. The idea that police is racially biased and kills disproportionately African Americans is wrong. But why does everybody believe it to be true? McWhorter has a ready explanation for this. The biased liberal, reverse-racist media cover police brutality only when the officer is White and the victim is Black. “Had Tony Timpa been black, we would all likely know his name by now. Had George Floyd been white, his name would likely be a footnote, briefly reported in Minneapolis local news and quickly forgotten.”
Well trained in propaganda, McWhorter then presented more cases of White victims in equal numbers to Black people. For every Black victim, there is a White victim in his narrative that is based on his personal selection of cases. After creating the illusion that there is a White victim for every Black victim, he is ready to repeat his claim that we have been manipulated by the liberal media. “So, the perception that the police regularly kill black people under circumstances in which white people would be merely disciplined is in fact a misperception.”
But is it a misperception? That would require actual scientific information about the influence of race on lethal use of force by police officers in the US. This evidence is reviewed after the conclusion has already been offered that the common assumption of racial bias against African Americans is a misperception.
McWhorter next explains correctly that African Americans are a minority in the United States. If police were unbiased in the lethal use of force, we would expect a lot more victims to be White than Black. He then correctly states that ” it remains true that black people are killed at a rate disproportionate to their percentage of the population.”
So, it is NOT a misperception that police kill disproportionately more African Americans. There is racial disparity in the use of force. This invalidates the claim that we all believe that racial bias exists because we have been manipulated by the liberal media.
McWhorter then makes a distinction between racial disparity and racial bias. “However, these figures are not necessarily evidence of police racism. According to the Washington Post‘s database, over 95 percent of the people fatally shot by police officers in 2019 were male, and no serious-minded person argues that this is evidence of systemic misandry. So what, then, accounts for the disproportionate representation of black men among those killed by cops?”
This is a controversial topic that has been examined in numerous articles by social scientists in scientific journals. But McWhorter does not cite these studies, presumably because he lacks the training to understand the sometimes complicated statistical methods that have been used in these articles.
Like a novelist, he creates facts with the stroke of a pen. “The socioeconomic gap between blacks and whites is doubtless an important contributing factor.” and “This disparity in poverty rates means black people are also disproportionately presented in violent rates of violent crime” Here we go again. Police is not racially biased. The real reason why they kill more Black people is because Black people are more criminal. Blame the victim. To give this biased narrative some credibility, McWhorter cites only one scientific article that supports his story. “Contrary to his expectations, Harvard economist Roland Fryer has found that while white men are actually more likely to be killed by cops,” as if an economist is more credible than criminologists or other scientists because he is at Harvard. That is not how science works. You also have to cite evidence that contradicts your claims (Young unarmed nonsuicidal male victims of fatal use of force are 13 times more likely to be Black than White).
In the end McWhorter softens his stance a bit. “This disparity cannot explain every fatal police shooting,” “This is not to say that race has nothing to do with policing issues in America.” But these sentences are mere rhetorical devices to signal that the author is balanced and reasonable, when the truth is that the author is ignorant about the science on racial bias in policing, including use of force.
I have no reason to believe that McWhorter wrote this terrible article because he is motivated by racism, but it is not clear to me why McWhorter wrote such a biased article that is so hurtful to many African Americans who are traumatized by the callus killing of innocent African Americans. All I can say is that McWhorter lacks the competence to write about this topic because he is either too lazy or not trained to follow the actual science on this topic. In Germany we say “Schuster blieb bei Deinen Leisten” (“Every man to his trade.”) Please follow this good advice, Dr. McWhorter.
A famous quote states that something that cannot be measured does not exist. This is of course not true, but if we want to move from anecdotal evidence to scientific evidence and theories of racism, we need valid measures of racism.
Social psychology has a long history of developing measures of racism and today there are dozens of different measures of racism. Unfortunately, psychologists are better at developing new measures than at validating existing ones. This makes research on racism vulnerable to criticism that racism measures are invalid or biased (Feldman & Huddy, 2005; Zigerell, 2015).
Take the item “Irish, Italians, Jewish and many other minorities overcame prejudice and worked their way up. Blacks should do the same without special favors” as an example. The item is one of several items that is used to measure a form of racism called symbolic racism.
Feldman and Huddy (2005) argue that items like this one have two components. A purely racist component where White people do not see Black people as equal citizens and a purely ideological component that opposes policies that favor any particular group, even if this group is disadvantaged by a history of racism. Whether the latter component is itself racist or not is not the topic of this blog post. My focus is rather on the separation of the two components. How can we separate agreement to the item that is based on racism from endorsement of the item for purely political reasons?
One solution to this problem is to see how endorsement of items with political content is related to items that have no political content. Using a statistical method called factor analysis it is then possible to separate the racial and the ideological component and to examine how much political orientation is related to the two components.
Indirect Measures of Racism
The problem with direct measures of racism is that open admission of racial bias has become less acceptable over time. This makes it harder to measure racism with items like “What about having a close relative marry a Black person? Would you be very in favor of it happening, somewhat in favor, neither in favor nor opposed to it happening, somewhat opposed, or very opposed to it happening?” Respondents may be unwilling to report their true feelings about this issue, especially when the interviewer is African American (Schaeffer, 1980; Schimmack, 2020).
Modern psychological testing with computers has made it possible to avoid these problems by measuring racism with computerized tasks that rely on participants’ behavior in response to racial stimuli. There are several tasks such as the evaluative priming task, the affective misattribution task and the popular Implicit Association Task (IAT). Unfortunately, the IAT has become known as a measure of implicit bias or implicit racism that is distinct from racism that can be measured with self-report measures. I have argued that there is no evidence that people can hide their feelings towards African Americans from themselves. It is more useful to see these tasks as alternative measures of racism that are less susceptible to fake responding. This does not mean that these tasks are perfect measures of racism because the use of computerized tasks creates new problems. Thus, there is no perfect measure of racism, but all valid measures of racism should be positively correlated with each other and the shared variance among these measures is likely to reflect variation in racism. The interesting question is whether political orientation is related to the shared variance among a variety of direct and indirect racism measures.
The data come from a study by Bar-Anan and Vianello (2018). The data have also been used in my critique of the IAT as a measure of implicit bias (Schimmack, 2020). The study assessed political orientation and racism with multiple measures. Political orientation was also measured with the standard and the brief IAT. In addition, participants reported whether they voted Republican or Democrat. Only White participants who reported voting were included in the analysis.
Racism was measured with the standard IAT, the brief IAT, the evaluative priming task, the Affective Missattribution Task, a direct rating of preference for White or Black people, and the Modern Racism Scale. Like other measures that have been criticized, the Modern Racism scale mixes racism and political ideology.
The interesting question is how much political orientation is related to the unique variance in the modern racism scale that is not shared with other racism measures and how much it is related to the shard variance with other racism measures.
The results show two clearly identified factors. The strong relationship between voting and the Republican factor (rep) shows that political orientation can be measured well with a direct question. In contrast, racism is more difficult to measure. The best measure in this study would be the direct preference rating (r_att) that is related .6 with the pro-White factor. But even this relationship implies that only about a third of the variance in the actual ratings reflects racism. The rest of the variance is measurement error. So, there is no gold-standard or perfect way to measure racism. There are only multiple imperfect ways. The results also show that the controversial Modern Racism Scale (mrs) reflects both racism (.444) and political orientation (.329). This shows that Republicans score high on Modern Racism in part because they reject social policies that favor minority groups independent of their attitudes towards Black Americans. However, the figure also shows that Republicans are more racist, as reflected in the relationship between the Republican and Racism factors (.416).
It is important that these results cannot be used to identify individuals or to claim that a particular Republican is a racist. The results do show however, that people who vote Republican are more likely to score higher on a broad range of racism measures whether they mention a political agenda or not.
Critics of racism research by social psychologists have argued that the research is biased because many social psychologists are liberal. The accusation is that social psychologists have created biased measures that conflate liberal policies with bigotry. Here I show that these critics have a valid point and that high scores on scales like the symbolic racism scale and the modern racism scale are influenced by attitudes towards egalitarian policies. However, I also showed that Republicans are more racist when racism is measured with a broad range of measures that have only racism as a common element.
Conservatives may be displeased by this finding, but recent events in 2020 have made it rather obvious that some Americans are openly racist and that these Americans are also openly supporting Trump. The real question for Republicans who oppose racism is how they can get rid of racism in their party.
Are Republicans more racist? I am not the only one who is asking themselves this question. In 2014, Nate Silver and Allison McCann examined this question using data from the General Social Survey (538.2014). The political landscape has changed since Trump was elected in 2016 and it is interesting to reexamine this question with the latest data.
Like Silver and McCann, I limit the analysis to White respondents because Black voters are much more likely to support the Democratic party. Thus, a trivial reason why Republicans are more likely to be racist is that they are more likely to be White. The more interesting question is whether there is an additional difference between White Republicans and White Democrats.
The General Social Survey (GSS) asks two questions about political orientation. One question asks about party identification. Responses range from strong democrat to strong republican. To identify conservative Republicans, I classified respondents who responded with “independent – close to Republican”, “not very strong Republican”, and “strong republican” as Republicans.
Another question asks respondents to locate their political orientation on a liberal versus conservative dimension on a 7-point scale ranging from extremely liberal to extremely conservative. To identify conservatives, I combined the three response categories slightly conservative, conservative, and extremely conservative.
Figure 1 shows that responses to these two questions are not perfectly aligned. There are conservative Democrats and liberal Republicans. There are several ways to deal with this inconsistency. I chose to contrast conservative Republicans with the other three groups (liberal Republicans, conservative Democrats, & liberal Democrats).
The GSS also asked about voting behavior in past elections. In the 2018 survey, most conservative Republicans voted for Trump.
Racism is a loaded word with no clear meaning. While the majority of Democrats considers Donald Trump to be a racist, most Republicans disagree. Yet, we need a term to describe people who do not see or treat Black and White people as equal. Given the history of the United States, it is difficult for a White American to live up to the ideal of racial equality. However, there is variation in the degree to which White Americans treat Black and White people as equal or not. Thus, the question is not whether somebody is a racist or not, but the degree to which White’s treat people equally or unequally.
The GSS includes several questions that measure racism. Some of these questions were discontinued and others were added in later years. I found five questions that were asked in later years that measure racism.
1. Feeling Thermometer. Feeling thermometers are a widely used measure of racism. One question is about feelings towards Black people. Another question is about feelings towards White people. The difference between these two ratings indicates how much people have more positive feelings towards White people than towards Black people. Feelings are assessed on a 9-point scale ranging from 1 = not at all close to 9 = very close. The difference score can range from -8 to + 8.
The Figure shows that most respondents indicate that they make no difference in their feelings based on race, while some participants report feeling closer to White people.
2. Interracial Marriages The United States have a long history of racism that is reflected in laws against interracial marriages (Loving vs. Virginia). Even today, some Americans do not approve of interracial marriages. The GSS asks respondents “What about having a close relative marry a Black person? Would you be very in favor of it happening, somewhat in favor, neither in favor nor opposed to it happening, somewhat opposed, or very opposed to it happening?” The Figure shows the distribution of the responses before and after 2016. The results show a decrease in opposition to mixed marriages, but even today many US American are against somebody in their family marrying an African American.
3. Like other Immigrants The third item asks respondents how much they agree with an item that ignores the special history of slavery, discrimination, and police brutality that African Americans are subjected to. “Irish, Italians, Jewish and many other minorities overcame prejudice and worked their way up. Blacks should do the same without special favors.” Responses are made on a 5-point scale ranging from 1=strongly disagree to 5=strongly agree. The Figure shows that even today a majority of White Americans agree with this statement.
4. Opposing Affirmative Action One question in the GSS asks about respondents’ attitudes towards affirmative actions.
“Some people say that because of past discrimination, blacks should be given preference in hiring and promotion. Others say that such preference in hiring and promotion of blacks is wrong because it discriminates against whites. What about your opinion — are you for or against preferential hiring and promotion of blacks?” Responses are favor strongly, favor, oppose, and oppose strongly.
It is important to point out that opposition to affirmative actions may be based on other consideration than racism. Nevertheless, the item can be used to measure racism because racists are much more likely to agree with the item than respondents who strive towards racial equality.
The advantage of this item for measuring racism is that it provides opportunity for plausible deniability. It is possible to endorse the item without revealing that the response is motivated by racism. This may explain the high proportion of Americans who express strong opposition to affirmative actions.
5. Spending on Improving the conditions of Blacks The fifth question is part of a series of questions about public spending. One question asks whether respondents think the government is spending too little or too much on “improving the conditions of Blacks.” As with the previous item, respondents may answer “too much” for several reasons. For example, some may oppose taxation and government spending on anything. Thus, responding “too much” does not directly imply that the response is racially motivated, but individuals who are more racists are more likely to answer “too much.”
A Measurement Model of Racism
Responses to each of the five items can be influenced by many factors (causes) other than racism. Thus, we cannot simply look at individual items to determine how racist somebody is. Psychologists have developed measurement models to examine whether responses to items are driven by a common cause. Finding evidence for a common cause does not conclusively show that the cause is racism, but it makes it more likely.
The Figure shows the measurement model of racism. In this model racism is an attribute (trait) that varies across individuals. Individuals with high racism are more likely to respond to the five items in ways that express a dislike of African Americans. The numbers next to the arrows show how much a specific item reflects racism. Interestingly, the feeling thermometer is a relatively weak measure of racism. One concern could be that the racism factor is dominated by issues related to money, but the opposition to a black family member shows that that racism is not limited to equity issues.
Another concern is that responses to a survey are different from actual traits. Some respondents may downplay their racism because they are aware that racism is offensive. The GSS makes it possible to test this hypothesis because respondents were sometimes interviewed by Black interviewers. The next figure shows that White respondents respond to the question in a less racist manner when they are interviewed by a black interviewer. This effect is general across all five items, but especially strong for the feeling thermometer and the spending question.
In short, measuring racism is difficult and no single question provides a clear and unambiguous measure of racism. However, there is clear evidence that people respond to the five questions in a consistent manner that is more or less racist, suggesting that the factor underlying their responses can be used as a measure of racism.
Political Orientation and Racism
The next analysis examined differences between conservative republicans and other Americans. The results show that conservative republicans are more racist (.769 standard deviations). In addition, they oppose affirmative actions and spending on Black issues. This shows that politics also influence responses to the items independent of racism, but even when we allow for such extra relationships, there is a strong relationship between identifying as a conservative republican and racism. In psychology a difference of .8 standard deviations is considered a large effect. In intelligence testing, a standard deviation is 15, and a difference of .769 would be a difference by 12 IQ points.
Political Orientation and Racism Over Time
The first figures showed that racism was lower in 2016 and 2018 than in the years before. To examine whether changes in racism differ as a function of political orientation, I added time to the model and examined different time trends for conservative republicans and others. I used presidents to measure time using the 1990s as the reference period and counted 2016 as a Trump year, although he was only elected at the end of the year to have two years of observations (2016, 2018). The results of the model showed a significant interaction between political orientation and time. While racism levels remained nearly unchanged for conservative republicans, they decreased for other Americans.
The figure shows a dramatic difference in racism during the Trump years. The difference of 1.4 standard deviations corresponds to a difference of 21 points in IQ testing. Interestingly, the difference is not due to increasing racism among conservative republicans, but rather decreasing racism among Americans.
It is tempting to suggest that this large difference in racism explains the much stronger response to George Floyd’s murder by a police officer, while similar events in previous years did not spark the widespread protests the US have seen in the last weeks. The data suggest that America as a whole is moving towards more racial equality and that many White Americans empathize with African Americans. However, conservative republicans are an exception. Many of them are opposed to racial equality.
Psychological research over the past two decades has focused a lot on implicit racism. The idea is that individuals can hide their racism from themselves. The notion of unconscious bias or racism is controversial and there is no solid scientific evidence for implicit racism. Most people know whether they would welcome a Black family member or not. A more common form of racism is a conscious choice to ignore the racism of others, whether it is a colleague, a friend, a family member or a party member. You know they are racist, but you don’t say anything about it. Most people, including conservative republicans, are not surprised by the finding that conservative republicans are more likely to be racist. We all know examples of racist conservative republicans and the Republican party carefully avoids alienating racist voters. The current Republican party suffers from complicit racism. Some party strategists even admit that complicit racism was used to win majorities. However, most Americans are increasingly appalled by the complicit racism that is inconsistent with modern American values. Just like the civil war and the civil rights movement, America today is still struggling with the “birth defect of slavery” (Condoleezza Rice).
Political Science and Political Bias
It is not hard to figure out my personal political views and I have no reason to hide them. I am not a fan of conservative republicans and their current leader. This may have influenced the way I analyzed and presented the results. Fortunately, the data are openly available and conservatives are welcome to analyze the data and present their findings. In this spirit, I consider this blog post as a thesis that is awaiting a critical examination and response. However, is it really in doubt that respondents who voted for Trump are more likely to oppose affirmative action and to object to a Black family member? Many probably think that these results merely show the obvious.