Dr. Ulrich Schimmack’s Blog about Replicability

For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).

DEFINITION OF REPLICABILITYIn empirical studies with sampling error, replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion (Schimmack, 2017).

BLOGS BY YEAR:  20192018, 2017, 2016, 2015, 2014

Featured Blog of the Month (January, 2020): Z-Curve.2.0 (with R-package) 




  1. 2018 Replicability Rankings of 117 Psychology Journals (2010-2018)

Rankings of 117 Psychology Journals according to the average replicability of a published significant result. Also includes detailed analysis of time trends in replicability from 2010 to 2018). 

Golden2.  Introduction to Z-Curve with R-Code

This post presented the first replicability ranking and explains the methodology that is used to estimate the typical power of a significant result published in a journal.  The post provides an explanation of the new method to estimate observed power based on the distribution of test statistics converted into absolute z-scores.  The method has been developed further to estimate power for a wider range of z-scores by developing a model that allows for heterogeneity in power across tests.  A description of the new method will be published when extensive simulation studies are completed.


3. An Introduction to the R-Index


The R-Index can be used to predict whether a set of published results will replicate in a set of exact replication studies. It combines information about the observed power of the original studies with information about the amount of inflation in observed power due to publication bias (R-Index = Observed Median Power – Inflation). The R-Index has predicted the outcome of actual replication studies.

Featured Image -- 203

4.  The Test of Insufficient Variance (TIVA)


The Test of Insufficient Variance is the most powerful test of publication bias and/or dishonest reporting practices. It can be used even if only two independent statistical results are available, although power to detect bias increases with the number of studies. After converting test results into z-scores, z-scores are expected to have a variance of one.   Unless power is very high, some of these z-scores will not be statistically significant (z .05 two-tailed).  If these non-significant results are missing, the variance shrinks, and TIVA detects that the variance is insufficient.  The observed variance is compared against the expected variance of 1 with a left-tailed chi-square test. The usefulness of TIVA is illustrated with Bem’s (2011) “Feeling the Future” data.

train-wreck-15.  MOST VIEWED POST (with comment by Noble Laureate Daniel Kahneman)

Reconstruction of a Train Wreck: How Priming Research Went off the Rails

This blog post examines the replicability of priming studies cited in Daniel Kahneman’s popular book “Thinking fast and slow.”   The results suggest that many of the cited findings are difficult to replicate.

http://schoolsnapshots.org/blog/2014/09/30/math-prize-for-girls-at-m-i-t/6. How robust are Stereotype-Threat Effects on Women’s Math Performance?

Stereotype-threat has been used by social psychologists to explain gender differences in math performance. Accordingly, the stereotype that men are better at math than women is threatening to women and threat leads to lower performance.  This theory has produced a large number of studies, but a recent meta-analysis showed that the literature suffers from publication bias and dishonest reporting.  After correcting for these effects, the stereotype-threat effect was negligible.  This blog post shows a low R-Index for the first article that appeared to provide strong support for stereotype-threat.  These results show that the R-Index can warn readers and researchers that reported results are too good to be true.

GPower7.  An attempt at explaining null-hypothesis testing and statistical power with 1 figure and 1500 words.   Null-hypothesis significance testing is old, widely used, and confusing. Many false claims have been used to suggest that NHST is a flawed statistical method. Others argue that the method is fine, but often misunderstood. Here I try to explain NHST and why it is important to consider power (type-II errors) using a picture from the free software GPower.


8.  The Problem with Bayesian Null-Hypothesis Testing


Some Bayesian statisticians have proposed Bayes-Factors to provide evidence for a Null-Hypothesis (i.e., there is no effect).  They used Bem’s (2011) “Feeling the Future” data to argue that Bayes-Factors would have demonstrated that extra-sensory perception does not exist.  This blog post shows that Bayes-Factors depend on the specification of the alternative hypothesis and that support for the null-hypothesis is often obtained by choosing an unrealistic alternative hypothesis (e.g., there is a 25% probability that effect size is greater than one standard deviation, d > 1).  As a result, Bayes-Factors can favor the null-hypothesis when there is an effect, but the effect size is small (d = .2).  A Bayes-Factor in favor of the null is more appropriately interpreted as evidence that the alternative hypothesis needs to decrease the probabilities assigned to large effect sizes. The post also shows that Bayes-Factors based on a meta-analysis of Bem’s data provide misleading evidence that an effect is present because Bayesian statistics do not take publication bias and dishonest reporting practices into account.

hidden9. Hidden figures: Replication failures in the stereotype threat literature.  A widespread problem is that failed replication studies are often not published. This blog post shows that another problem is that failed replication studies are ignored even when they are published.  Selective publishing of confirmatory results undermines the credibility of science and claims about the importance of stereotype threat to explain gender differences in mathematics.

20170620_14554410. My journey towards estimation of replicability.  In this blog post I explain how I got interested in statistical power and replicability and how I developed statistical methods to reveal selection bias and to estimate replicability.

Lies, Damn Lies, and Experiments on Attitude Ratings

Ten years ago, social psychology had a life-time opportunity to realize that most of their research is bullshit. Their esteemed colleague Daryl Bem published a hoax article about extrasensory perception in their esteemed Journal of Personality and Social Psychology. The editors felt compelled to write a soul searching editorial about research practices in their field that could produce such nonsense results. However, 10 years later social psychologists continue to use the same questionable practices to publish bullshit results in JPSP. Moreover, they are willfully ignorant of any criticism of their field that is producing mostly pseudo-scientific garbage. Just now, Wegener and Petty, two social psychologists at Ohio State University wrote an article that downplays the importance of replication failures in social psychology. At the same time, they publish a JPSP article that shows they haven’t learned anything from 10 years of discussion about research practices in psychology. I treat the first author as an innocent victim who is being trained in the dark art of research practices that have given us social priming, ego-depletion, and time-reversed sexual arousal.

The authors report seven studies. We don’t know how many other studies were run. The seven studies are standard experiments with one or two (2 x 2) experimental manipulations between subjects. The studies are quick online studies with Mturk samples. The main goal was to show that some experimental manipulations influence some ratings that are supposed to measure attitudes. Any causal effect on these measures is interpreted as a change in attitudes.

The problem for the author is that their experimental manipulations have small effects on the attitude measures. So, individually studies 1-6 would not show any effects. At no point did they consider this a problem and increase sample sizes. However, they were able to fix the problem by combining studies that were similar enough into one dataset. his was also done by Bem to produce significant results for time-reversed causality. It is not a good practice, but that doesn’t bother editors and reviewers at the top journal of social psychology. After all, they all do not know how to do science.

So, let’s forget about the questionable studies 1-6 and focus on the preregistered replication study with 555 Mturk workers (Study 7). The authors analyze their data with a mediation model and find statistically significant indirect effects. The problem with this approach is that mediation no longer has the internal validity of an experiment. Spurious relationships between mediators and the DV can inflate these indirect effects. So, it is also important to demonstrate that there is an effect by showing that the manipulation changed the DV (Baron & Kenny, 1986). The authors do not report this analysis. The authors also do not provide information about standardized effect sizes to evaluate the practical significance of their manipulation. However, the authors did provide covariance matrices in a supplement and I was able to run the analyses to get this information.

Here are the results.

The main effect for the bias manipulation is d = -.04, p = .38, 95%CI = -.12, .05

The main effect for the untrustworthiness manipulation is d = .01, p = .75, 95%CI = -.07, .10.

Both effects are not significant. Moreover, the effect size is so small and thanks to the large sample size the confidence intervals are so narrow that we can reject the hypothesis that the manipulations have at least a small effect, d = .2.

So, here we see the total failure of social psychology to understand what they are doing and their inability to make a real contribution to the understanding of attitudes and attitude change. This didn’t stop Rich Petty from co-authoring an article about psychology’s contribution to addressing the Covid-19 pandemic. Now, it would be unfair to blame 150,000 deaths on social psychology, but it is a fact that 40 years of trivial experiments have done little to help us change attitudes like attitudes towards wearing masks in the real world.

I can only warn young, idealistic students to consider social psychology as a career path. I speak form experience. I was a young idealistic student eager to learn about social psychology in the 1990s. If I could go back in time, I would have done something else with my life. In 2010, I thought social psychology might actually change for the better, but in 2020 it is clear that most psychologists want to continue with their trivial experiments that tell us nothing about social behaviour. If you just can’t help it and want to study social phenomena I recommend personality psychology or other social sciences.

Statues are Falling, but Intelligence Researchers Cling to Their Racist Past

Psychology wants to be a science. Unfortunately, respect and reputations need to be earned. Just putting the name science in your department name or in the title of your journals doesn’t make you a science. A decade ago, social psychologists were shocked to find out that for years one of their colleagues had just made up data and nobody had noticed it. Then, another social psychologists proved physics wrong and claimed to have evidence of time reversed causality in a study with erotic pictures and undergraduate student. This also turned out to be a hoax. Over the past decade, psychology has tried to gain respect by doing more replication studies of classic findings (that often fail), starting to preregister studies (which medicine has implemented years ago), and in general to analyze and report their results more honestly. However, another crisis in psychology is that most measures in psychology are used without evidence that they measure what they measure. Imagine a real science where scientists first ensure that their measurement instruments work and then use them to study distant planets or microorganisms. Not so psychology. Psychologists have found a way around proper measurement called operationalism. Rather than trying to find measures for constructs, constructs are defined by the measures. What is happiness? While philosophers have tried hard to answer this questions, psychologists cannot be bothered to spend time to think about this question. Happiness is whatever your rating on a happiness self-report measure measures.

The same cheap trick has been used by intelligence researchers to make claims about human intelligence. They developed a series of tasks and performance on these tasks is used to create a score. These scores could be given a name like “score that reflects performance on a series of tasks some White men (yes, I am a White male myself) find interesting,” but then nobody would care about these scores. So, they decided to call it intelligence. If pressed to define intelligence, they usually do not have a good answer to this question, but they also don’t feel the need to give an answer because intelligence is just a term for the test. However, the choice of the term is not an accident. It is supposed to sound as if the test measures something that corresponds to the everyday term intelligence to make the test more interesting. However, it is possible that the test is not the best measure of what we normally mean by intelligence. For example, performance on intelligence tests correlates only about r = .3 with self-ratings or ratings by close friends and family members of intelligence. While there can be measurement in self-ratings, there can also be measurement error in intelligence tests. Although intelligence researchers are considered to be intelligent, they rarely consider this possibility. After all, their main objective is to use these tests and to see how they relate to other measures.

Confusing labels for tests are annoying, but hardly worth to write a long blog post about. However, some racist intelligence researchers use the label to make claims about intelligence and skin color (Lynn & Meisenberg, 2010). Moreover, the authors even use their racist preconception that dark-skinned people are less intelligence to claim that intelligence tests measure intelligence BECAUSE performance on these tests correlates with skin color.

You don’t have to be a rocket scientists to realize that this is a circular argument. Intelligence tests are valid because they confirm a racist stereotype. This is not how real science works, but this doesn’t bother intelligence researchers. The questionable article has been cited 80 times.

I only came across this nonsense because a recent article used national IQ scores to make an argument about intelligence and homicides. After concerns about the science were raised, the authors retracted their article pointing to problems in the measurement of national differences in IQ. The editor of this journal, Psychological Science, wrote an editorial with “A Call for Greater Sensitivity in the Wake of a Publication Controversy.”

Greater sensitivity also means to clean the journals of unscientific and hurtful claims that serve no scientific purpose. In this spirit, I asked the current editor of Intelligence in an email on June 15th to retract Lynn and Meisenberger’s offensive article. Today, I received the response that the journal is not going to retract the article.

Richard Haier (Emeritus, Editor in Chief) Decision Letter

This decision just shows the unwillingness among psychologists to take responsibility for a lot of bad science that is published in their journals. This is unfortunately because it shows the low motivation to change and improve psychology. It is often said that science is the most superior method to gain knowledge because science is self-correcting. However, often scientists stand in the way of correction and the process of self-correction is best measured in decades or centuries. Max Plank famously observed that scientific self-correction often requires the demise of the old guard. However, it is also important not to hire new scientists who continue to abuse the freedom and resources awarded to scientists to spread racist ideology. Meanwhile, it is best to be careful and to distrust any claims about group differences in intelligence because intelligence researchers are not willing to clean up their act.

A Hierarchical Factor Analysis of Openness to Experience

In this blog post I report the results of a hierarchical factor analysis of 16 primary openness to experience factors. The data were obtained and made public by Christensen, Cotter, and Silvia (2019). The dataset contains correlations for 138 openness items taken from four different Big Five measures (NEO-PI3; HEXACO, BFAS, & Woo). The sample size was N = 802.

The authors used network analysis to examine the relationship among the items. In the network graph, the authors identified 10 clusters (communities) of items. Some of these clusters combine overlapping constructs in different questionnaires. For example, aesthetic appreciation is represented in all four questionnaires.

This is a good first step, but Figure 1 leaves many questions unanswered. Mainly, it does not provide quantitative information about the relationship of the clusters to each other. The main reason is that network analysis does not have a representation of the forces that bind items within a cluster together. This information was presented in a traditional correlation table based on sum scores of items. The problem with sum scores is that correlations between sum scores can be distorted by secondary loadings. Moreover, there is no formal test that 10 clusters provide an accurate representation of item-relationships. As a result, there is no test of this model against other plausible models. The advantage of structural equation modeling with latent variables is that it is possible to represent unobserved constructs like Openness and to test the fit of a model to the data.

Despite the advantages of structural equation modeling (SEM), many researchers are reluctant to use structural equation modeling for a number of unfortunate reasons. First, structural equation modeling has been called Confirmatory Factor Analysis (CFA). This has led to the misperception that SEM can only be used to test theoretical models. However, it is not clear how one would derive a theoretical that perfectly fits data without exploration. I use SEM to explore the structure of openness without an a priori theoretical model. This is no more exploratory than visual inspection of a network representation of a correlation matrix. There is no good term for this use of SEM because the term exploratory factor analysis is used for a different mathematical model. So, I simply call it SEM.

Another reason why SEM may not be used is that model fit can show that a specified model does not fit the data. It can be time consuming and require thought to create a model that actually fits the data. In contrast, EFA and network models always provide a solution even if the solution is suboptimal. This makes SEM harder to use than other exploratory methods. However, with some openness to new ideas and persistence, it is also always possible to find a fitting model with SEM. This does not mean it is the correct model, but it is also possible to compare models to each other with fit indices.

SEM is a very flexible tool and its capabilities have often not been fully recognized. While higher-order or two-level models are fairly common, models with more than two levels are rare, but can be easily fit to data that have a hierarchical structure. This is a useful feature of SEM because theoretical models have postulated that personality is hierarchically structured with several levels: The global level, aspects, facets, and even more specific traits called nuances below facets. However, nobody has attempted to fit a hierarchical model to see whether Openness has an aspect, a facet, and a nuance level. Christensen et al.’s data seemed ideally suited to examine this question.

One limitation of SEM is that modeling becomes increasingly more difficult as the number of items increases. On the other hand, three items per construct are sufficient to create a measurement model at the lowest level in the hierarchy. I therefore first conducted simple CFA analysis of items belong to the same scale and retained items with high loadings on the primary factor and no notable residual correlations with other items. I did not use the 20 aspect items because they were not designed to measure clean facets of Openness. This way, I only need to fit a total of 48 items for the 16 primary scales of Openness in the three questionnaires:

NEO: Artistic, Ideas, Fantasy, Feeling, Active, Values
HEXACO: Artistic, Inquisitive, Creative, Unconventional
Woo: Artistic, Culture, Tolerance, Creative, Depth, Intellect

Exploratory analysis showed that the creative scales in the HEXACO and Woo measures did not have unique variance and could be represented by a single primary factor. This was also the case for the artistic construct in the HEXACO and Woo measures. However, the NEO artistic items showed some unique variance and were modeled as a distinct construct, although this could just be some systematic method variance in the NEO items.

The final model (MPLUS syntax) had reasonably good fit to the data, RMSEA = .042, CFI = .903. This fit was obtained after exploratory analyses of the data and simply shows that it was possible to find a model that fits the data. A truly confirmatory test would require new data and fit is expected to decrease because the model may have overfitted the data. To obtain good model fit it was necessary to include secondary loadings of items. Cross-validation can be used to confirm that these secondary loadings are robust. All of this is not particularly important because the model is exploratory and provides a first attempt at fitting a hierarchical factor model to the Openness domain.

In Figure 2, the boxes represent primary factors that represent the shared variance among three items. The first noteworthy different to the network model is that there are 14 primary constructs compared to 10 clusters in the network model. However, Neo-Artistic (N-Artistic) is strongly related to the W/H-Artistic factor and could be combined while allowing some systematic measurement error in the NEO-items. So, conceptually, there are only 13 distinct constructs. This still leaves three more constructs than the network analysis identified. The reason for this discrepancy is that there is no strict criterion at which point a cluster may reflect to related sub-clusters.

Figure 2 shows a hierarchy with four levels. For example, creativity (W/H-Creative) is linked to Openness through an unmeasured facet (Facet-2) and artistic (W/H-Artistic). This also means that creative is only weakly linked to Openness as the indirect path is the product of the three links, .9 * .7 * .5 = .3. This means that Openness explains only 9% of the variance in the creativity factor.

In factor analysis it is common to use loadings greater than .6 for markers that can be used to measure a construct and to interpret its meaning. I highlighted constructs that are related .6 or higher with the Openness factor. The most notable marker is the NEO-Ideas factor with a direct loading of .9. This suggests that the core feature of Openness is to be open to new ideas. Another marker is Woo’s curiosity factor and mediated by the facet-2 factor, the HEXACO inquisitive factor. So, core features of Openness are being open to new ideas, being curious, and inquisitive. Although these labels sound very similar, the actual constructs are not redundant. The other indicators that meet the .6 threshold are artistic and unconventional.

Other primary factors differ greatly in their relatedness to the Openness factor. Openness to Feeling’s relationship is particularly weak, .4 * .4 = .16, and suggests that openness to feelings is not a feature of Openness or that the NEO-Feelings items are poor measures of this construct.

Finally, it is noteworthy that the model provides no support for the Big Five Aspects Model that postulates a level with two factors between Openness and Openness Factors. It is particularly troubling that the intellect aspect is most strongly related to Woo’s intellectual efficiency factor (W-Intellect, effect size r = .6), and only weakly related to the ideas factor (N-Ideas, r = .2), and the curiosity factor (W-Curious, r = .2). As Figure 2 shows, (self-rated) intellectual abilities are a distinct facet and not a broader aspect with several subordinate facets. The Openness facet is most strongly related to artistic (W/H artistic, r = .4), with weaker relationships to feelings, fantasy, and ideas (all r = .2). The problem with the development of the Big Five Aspects Model was that it relied on Exploratory Factor Analysis that is unable to test hierarchical structures in data. Future research on hierarchical structures of personality should use Hierarchical Factor Analysis.

In conclusion, SEM is capable of fitting hierarchical models to data. It is therefore ideally suited to test hierarchical models of personality. Why is nobody doing this. Orthodoxy has delegated SEM to confirmatory analysis of models that never fit the data because we need to explore before we can build theories. It requires high openness to new ideas, being unconventional, and curiosity, and inquisitiveness to break with conventions and to use SEM as a flexible and powerful statistical tool for data exploration.

Open SOEP: Spousal Similarity in Personality

Abstract: I examined spousal similarity in personality using 4-waves of data over a 12-year period in the German Socio-Economic Panel. There is very little spousal similarity in actual personality traits like the Big Five. However, there is a high similarity in the halo rating bias between spouses.

Spousal similarity in personality is an interesting topic for several reasons. First, there are conflicting folk ideas about spousal similarity. One saying assumes that “birds of the same feather flock together;” another says that “opposites attract.” Second, there is large interest in the characteristics people find attractive in a mate. Do extraverts find other extraverts more attractive? Would assertive (low agreeableness) individuals prefer a mate who is as assertive as they are or rather somebody who is submissive (high agreeableness)? Third, we might wonder whether spouses become more similar to each other over time. Finally, twin studies of heritability make the assumption that mating is random; an assumption that can be questionable.

Given so many reasons to study spousal similarity in personality, it is surprising how little attention this topic has received. A literature search retrieved only a few articles with few citations: Watson, Beer, McDade-Montez (2014) [20 citations], Humbad, Donnellan, Iacono McGue, & Burt (2010) [30 citations], Rammstedt & Schupp (2008) [25 citations]. One possible explanation for this lack of interest could be that spouses are not similar in personality traits. It is well-known that psychology has a bias against null-results; that is, the lack of statistical relationships. Another possibility is that spousal similarity is small and difficult to detect in small convenience samples that are typical in psychology. In support of the latter explanation, two of the three studies had large samples and did report spousal similarity in personality.

Humbad et al. (2010) found rather small correlations between husbands’ and wives’ personality scores in a sample of 1,296 married couples. With the exception of traditionalism, r = .49, all correlations were below r = .2, and the median correlation was r = .11. They also found that spousal similarity did not change over time, suggesting that the little similarity there is can be attributed to assortative mating (marrying somebody with similar traits).

Rammstedt and Schupp (2008) used data from the German Socio-Economic Panel (SOEP), an annual survey of representative household samples. In 2005, the SOEP included for the first time a short 15-item measure of the Big Five personality traits. The sample included 6,909 couples. This study produced several correlations greater than r = .2, for agreeableness, r = .25, conscientiousness, r = .31, and openness, r = .33. The lowest correlation was obtained for extraversion, r = .10. A cross-sectional analysis with length of marriage showed that spousal similarity was higher for couples who were married longer. For example, spousal similarity for openness increased from r = .26 for newlyweds (less than 5 years of marriage) to r = .47 for couples married more than 40 years.

A decade later it is possible to build on Rammstedt and Schupp’s results because the SOEP has collected three more waves with personality assessments in 2009, 2013, and 2017. This makes it possible to examine spousal similarity over time and to separate spousal similarity in stable dispositions (traits) and in deviations from the typical level (states).

I start with simple correlations, separately for each of the four waves using all couples that were available at a specific wave. The most notable observation is that the correlations do not increase over time. In fact, they even show a slight trend to decrease. This provides strong evidence that spouses are not becoming more similar to each other over time. An introvert who marries an extravert does not become more extraverted as a result or vice versa.

TraitW1 (N = 6263)W2 (N = 5905)W3 (N = 5404)W4 (N = 7805)

I repeated the analysis using only couples who stayed together and participated in all four waves. The sample size for this analysis was N = 1,860.


The correlations were not stronger and did not increase over time.

The next analysis examined correlations over time. If spousal similarity is driven by assortment on some stable trait, husbands’ scores in 2005 should still be correlated with wives’ scores in 2017 and vice versa. To ensure comparability for different time lags, I only used couples who stayed in the survey for all four waves (N = 1,860).

Trait2005 Trait2009 Trait2013 Trait2017 Trait
2005 Neuroticism.
2005 Extraversion.040-.02-.02
2005 Openness.
2005 Agreeableness.
2005 Conscientiousness.

The results show more similarity on the same occasion (2005/2005) than across time. Across-time correlations are all below .2 and are decreasing. However, there are some small correlations of r = .1 for Openness, Agreeableness, and Conscientiousness, suggesting some spousal similarity in the stable trait variance. Another question is why spouses show similarity in the changing state variance.

There are two possible explanations for spousal similarity in personality state variance. One explanation is that spouses’ personality really changes in sync, just like their well-being changes in the same direction over time (Schimmack & Lucas, 2010). Another explanation is that spouses’ self-ratings are influenced by rating biases and that these rating biases are correlated (Anusic et al., 2009). To test these alternative hypotheses, I fitted a measurement model to the Big Five scales that distinguishes halo bias in personality ratings from actual variance in personality. I did this for the first and the last wave (2005, 2017) to separate similarity in the stable trait variance from similarity in state variance.

The key finding is that there is high spousal similarity in halo bias. Some couples are more likely to exaggerate their positive qualities than others. After removing this bias, there is relatively little spousal similarity for the actual trait variance.

FactorTraitState 2005State 2017

In conclusion, spouses are not very similar in their personality traits. This may explain why this topic has received so little attention in the scientific literature. Null-results are often considered uninteresting. However, these findings do raise some questions. Why don’t extraverts marry extraverts or why don’t conscientious people not marry conscientious people. Wouldn’t they be happier with somebody who is similar in their personality? Research with the SOEP data suggests that that is also not the case. Maybe the Big Five traits are not as important for marital satisfaction as we think. Maybe other traits are more important. Clearly, human mating is not random, but it is also not based on matching personality traits.

We don’t forget and until Bargh apologizes we will not forgive

John Bargh is a controversial social scientists with a knack of getting significant results when others cannot (Bargh in Bartlett, 2012). When somebody failed to replicate his most famous elderly-priming results (he published two exact replication studies, 2a and 2b, that were both successful, p < .05), he wrote a blog post. The blog-post blew up in his face and he removed it. For a while, it looked as if this historic document was lost, but it has been shared online. Here is another link to it : Nothing in their heads

Personality x Situation Interactions: A Cautionary Note

Abstract: No robust and reliable interaction effects of the Big Five personality traits and unemployment on life-satisfaction in the German Socio-Economic Panel.

With the exception of late Walter Mischel, Lee Ross, and Dick Nisbett, we are all interactionists (ok, maybe Costa & Mcrae are guilty of dispositionism). As Lewin told every body in 1934, behaviour is a function of the person and the situation, and the a priori probability that the interaction effect between the two is zero (the nil-hypothesis is false) is pretty much zero. So, our journals should be filled with examples of personality x situation interactions. Right? But they are not. Every once in a while when I try to update my lecture notes and look for good examples of a personality x situation interaction I can’t find good examples. One reason is of course the long history of studying situations and traits separately. However, experience sampling studies emerged in the 1980s and the data are ideally suited to look for interaction effects. Another problem is that interaction effects can be difficult to demonstrate because you need large samples to get significant results.

This time I had a solution to my problems. I have access to the German Socio-Economic Panel (SOEP) data. The SOEP has a large sample (N > 10,000), measured the Big Five four times over a 12-year period and many measures of situations like marriage, child birth, or unemployment. So, I could just run an analysis and find a personality x situation interaction. After all, in large samples, you always get p < .05. Right? If you think so, you might be interested to read on and find out what happened.

The Big Five were measure the first time in 2005 (wave v). I picked unemployment and neuroticism as predictors because it is well-known that neuroticism is a personality predictor of life-satisfaction and unemployment is a situational predictor of life-satisfaction. It also made sense that neurotic people might respond more strongly to a negative life-event. However, contrary to these expectations, the interaction was far from significant (p = .5), while the main effects of unemployment (-1.5) and neuroticism (-.5) were highly significant. The effect of unemployment is equivalent to a change by three standard deviations in neuroticism.

Undeterred, I looked for interactions with the other Big Five dimensions. Surely, I would find an explanation for the interaction when I found one. To make things simple, I added all five interactions to the model and, hooray, a significant interaction with conscientiousness popped up, p = .02.

Was I the first to discover this? I quickly checked for articles and of course somebody else had beat me to the punch. There it was. In 2010, Boyce, Wood, and Brown had used the SOEP data to show that conscientious people respond more strongly to the loss of a job.

Five years later, a follow-up article came to the same conclusion.

A bit skeptical of p-values that are just below .02, I examined whether the interaction effect can be replicated. I ran the same analysis as I did with the 2005 data in 2009.

The effect size was cut in half and the p-value was no longer significant, p = .25. However, the results did replicate that none of the other four Big Five dimensions moderated the effect of unemployment.

So, what about the 2013 wave? Again not significant, although the effect size is again negative.

And what happened in 2017? A significant effect, hooray again, but this time the effect is positive.

Maybe the analyses are just not powerful enough. To increase power, we can include prior life-satisfaction as a predictor variable to control for some of the stable trait variance in life-satisfaction judgments. We are now only trying to predict changes in life-satisfaction in response to unemployment. In addition, we can include prior unemployment to make sure that the effect of unemployment is not due to some stable third variable.

We see that it is current unemployment that has a negative effect on life-satisfaction. Prior unemployment actually has a positive effect, suggesting some adaptation to long-term unemployment. Most important, the interaction between conscientiousness and current unemployment is not significant, p = .68.

The interaction was also non-significant in 2013, p = .69.

And there was no significant interaction in 2017, p = .38.

I am sure that I am not the first to look at this, especially given two published articles that reported a significant interaction. However, I suspect that nobody thought about sharing these results because the norm in psychology is still to report significant results. However, the key finding here appears to be that the Big Five traits do not systematically interact with a situation in explaining an important outcome.

So, I am still looking for a good demonstration of a personality x situation interaction that I can use for my lecture in the fall. Meanwhile, I know better than to use the published studies as an example.

Open Letter about Systemic Racism to the Editor of SPPS

Dear Margo Monteith,

it is very disappointing that you are not willing to retract an openly racist article that was published in your journal Social Psychological and Personality Science (SPPS) when Simine Varzire was editor of the journal and Lee Jussim was the action editor of the article in question (Cesario, Johnson, & Terrill, 2019). I have repeatedly pleaded with you to retract the article that draws conclusions on the basis of false assumptions. I am even more stunned by your decision because you rejected my commentary on this racist article with the justification that a better criticism was submitted. This criticism was just published (Ross et al., 2020). It makes the same observation that I made in my critique; that is, the conclusion that there is no racial bias in policing and the use of force rests entirely on an invalid assumption. The original authors simply assume that police officers only encounter violent criminals or that they only encounter violent criminals when they use deadly force.

Maybe you are not watching the news, but the Black Lives Matter movement started because police often use deadly force against non-violent African Americans. In some cases, this is even documented on video. Please watch the murder of Tamir Rice, George Floyd, Philando Castile, and Eric Garner and then tell their families and friends that police only kills violent criminals. That is what SPPS is telling everybody with the mantel of scientific truth, but is a blatantly false claim based on racists assumptions. So, why are you not retracting this offensive article?

Philando Castile: https://www.cnn.com/videos/us/2017/06/22/philando-castile-facebook-and-dashcam-full-mashup-video-ctn.cnn

Tamir Rice: https://www.theguardian.com/us-news/video/2014/nov/26/cleveland-video-tamir-rice-shooting-police

Eric Garner: https://www.theguardian.com/us-news/video/2014/dec/04/i-cant-breathe-eric-garner-chokehold-death-video

George Floyd:

So, why are you not retracting an article that makes an obviously false and offensive assumption? Do you think that a retraction would look badly on the reputation of your journal? In that case, you are mistaken. Research shows that journals that retract articles with false conclusions have higher impact factors and are more prestigious than journals that try to maintain a flawless image by avoiding retractions of bad science (Nature). So, your actions are not only offensive, but also hurt the reputation of SPPS and ultimately our science.

Your justification for not retracting the article is unconvincing.

Just how to analyze data such as this is debated, mostly in criminology journals. (One can wonder what psychology was present in Cesario et al.’s study that led to publication in SPPS, but that’s another matter.) Cesario et al. made the important point that benchmarking with population data is problematic. Their methodology was imperfect. Ross et al. made important improvements. If one is interested in this question of police bias with benchmarking, the papers bring successive advances. ”

Your response implies that you did not fully understand Ross et al.’s criticism of the offensive article. The whole approach of “benchmarking” is flawed. So, publishing an article that introduces a flawed statistical approach from criminology to psychology is dangerous. What if we would start using this approach to study other disparities? Ross et al. show that this would be extremely harmful to psychological science. It is important to retract an article that introduces this flawed statistical approach to psychologists. As an editor it is your responsibility to ensure that this does not happen.

It is particular shocking and beyond comprehension that you resist retraction at the very same time many universities and academics are keenly aware of the systemic racism in academia. This article about an issue that affects every African American was based on research funding to White academics, reviewed by White academics, approved by White academics, and now defended and not retracted by a White academic. How does your action promote diversity and inclusion? It is even more surprising that you seem to be blind to this systemic racism in the publication of this racist article given your research on prejudice and the funding you received to study these issues (CV). Can you at least acknowledge that it is very offensive to Black people to attribute their losses of lives entirely to violent crime?

Ulrich Schimmack

SPPS needs to retract Cesario’s False Claims about Racial Bias in Police Shootings

Academia is very slow in correcting itself. This is typically not a problem in psychological science because many articles do not have immediate real world consequences. However, when they do, it is important to correct mistakes as quickly as possible. The question whether (If there is any doubt about it) or how much racial bias in policing contributes to the racial disparity in victims of lethal use of force is one of them. While millions of Americans are demonstrating in the streets to support the Black Lives Matter movement, academics are slow to act and to show support for racial equality.

In 2019, the journal Social Psychological and Personality Science (SPPS) published an article by Cesario et al. with the controversial claim that there is no evidence that racial bias contributes to racial disparities in lethal use of force. The article even came to the opposite conclusion that police offers have a bias to shoot more White people than Black people. The article was edited and approved for publication by Lee Jussim, who is know for tirades against liberal-bias in academia. I cannot speak for him and he has repeatedly denied an opportunity to explain his decision. So, I have no evidence to disprove the hypothesis that he accepted the article because the conclusion fitted his conservative anti-anti-racism world-view. This would explain why he overlooked glaring mistakes in the article.

The main problem with this article is that it is unscientific. It is actually one of the worst articles I have ever seen and trust me, I have read and critiqued a lot of bad science. Don’t take my word for it. Aside from myself, SPPS received two other independent criticism of the article. My critique was rejected with the argument that one of the other criticisms was superior. After reading it, I agreed. It is a meticulous, scientific take-down of the garbage that Lee Jussim accepted for publication. I was happy that others agreed with me and made the point more clearly than I could. I was waiting patiently for it to be published. Then George Floyd was murdered on camera and the issue of racial bias in policing led to massive protests and swift actions.

During this time everybody was looking for the science on racial bias in policing. I know because my blog-posts about Cesarios’s fake science received a lot of views. The problem was that Cesario’s crappy science was published in prestigious, peer-reviewed journals, which made him the White expert on racial bias in policing. He happily responded to interview requests and presented his work as telling the true scientific story. The take down of his SPPS article that undercut his racist narrative was still not published.

On May 29, I emailed the current editor of SPPS to ask when the critique would be published.

“Dear. Dr. Monteith,    given recent events, I am wondering where we are with the response to the SPPS article that makes false claims about lethal use of force against Black Americans. Is there a preprint of the response or anything that can be shared in public? “

Margo Monteith emailed me that there is no problem with sharing the article.

“I don’t see a problem with Cody putting his article online; SAGE has agreed that it will be an open access article (and they will feature on the SPPS website). I am only posting the main points to honor the request not to publish the entire article. “

I was waiting for it to be published by SPPS, but it is still not published, so I [edited on 6/19/20] shared it on June 17. It actually was published today on June 19th (pdf). ] Everybody needs to know that there is no scientific credibility to Ceario’s claims.

However, publishing a correction is not enough. Cesario and racists ideologists like Heather MacDonald will continue to use the published articles to make false claims in public. We cannot allow this. The critic of Cesario’s article is strong enough to show that the conclusions rest entirely on racists assumptions. In short, Cesario et al simply assume that police only kill violent criminals to end up with their conclusion that given crime rates, police are too soft on violent Black criminals. The problem with this racist conclusion is clear. The assumption that police only use lethal force against known violent criminals is plain wrong and we have many videos of innocent Black victims killed by police to prove it. If you draw conclusions from a false premise, your conclusions are false. It is as simple as that. The assumption is nothing but a racist stereotype about Black people. This racist assumption should never have been published in a scientific journal. The only way to rectify the mistake is to retract the article so that Cesario can no longer use the mantel of science to spread racist stereotypes about African Americans.

Please read the rebuttal (sorry, it is a bit statistics heavy, but you can get the main points without the formulas). If you agree that the original article is flawed, I ask you to show your support with BLM and your commitment to racial equality and let SPPS know that you think the original article needs to be retracted.

Systemic Racism at Michigan State University

This is how three professors at MSU talk about innocent Black people being killed by police (podcast transcript at 25minuts and 40seconds into the clip).

Their discussion of tragic deaths suggests that Black lives don’t matter to Joseph Cesario (MSU), Steve Hsu (MSU), and Corey Washington (MSU)

Here is what those rare events look like. I dear everybody to watch them and then reflect on the words of these privileged professors.

Philando Castile: https://www.cnn.com/videos/us/2017/06/22/philando-castile-facebook-and-dashcam-full-mashup-video-ctn.cnn

Tamir Rice: https://www.theguardian.com/us-news/video/2014/nov/26/cleveland-video-tamir-rice-shooting-police

Eric Garner: https://www.theguardian.com/us-news/video/2014/dec/04/i-cant-breathe-eric-garner-chokehold-death-video

George Floyd:

And yes, it doesn’t only happen to Black people, but contrary to the statistically flawed work by Cesario, Young Black unarmed men are more often the target of police brutality and the victims of lethal force errors (https://www.pnas.org/content/117/3/1263.short).

See also: