Dr. Ulrich Schimmack’s Blog about Replicability

For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).

DEFINITION OF REPLICABILITYIn empirical studies with random error variance replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion.

BLOGS BY YEAR:  20192018, 2017, 2016, 2015, 2014

Featured Blog of the Month (January, 2019): 
Why Ionnidis’s Claim “Most published research findings are false” is false



  1. 2018 Replicability Rankings of 117 Psychology Journals (2010-2018)

Rankings of 117 Psychology Journals according to the average replicability of a published significant result. Also includes detailed analysis of time trends in replicability from 2010 to 2018). 

Golden2.  Introduction to Z-Curve with R-Code

This post presented the first replicability ranking and explains the methodology that is used to estimate the typical power of a significant result published in a journal.  The post provides an explanation of the new method to estimate observed power based on the distribution of test statistics converted into absolute z-scores.  The method has been developed further to estimate power for a wider range of z-scores by developing a model that allows for heterogeneity in power across tests.  A description of the new method will be published when extensive simulation studies are completed.


3. An Introduction to the R-Index


The R-Index can be used to predict whether a set of published results will replicate in a set of exact replication studies. It combines information about the observed power of the original studies with information about the amount of inflation in observed power due to publication bias (R-Index = Observed Median Power – Inflation). The R-Index has predicted the outcome of actual replication studies.

Featured Image -- 203

4.  The Test of Insufficient Variance (TIVA)


The Test of Insufficient Variance is the most powerful test of publication bias and/or dishonest reporting practices. It can be used even if only two independent statistical results are available, although power to detect bias increases with the number of studies. After converting test results into z-scores, z-scores are expected to have a variance of one.   Unless power is very high, some of these z-scores will not be statistically significant (z .05 two-tailed).  If these non-significant results are missing, the variance shrinks, and TIVA detects that the variance is insufficient.  The observed variance is compared against the expected variance of 1 with a left-tailed chi-square test. The usefulness of TIVA is illustrated with Bem’s (2011) “Feeling the Future” data.

train-wreck-15.  MOST VIEWED POST (with comment by Noble Laureate Daniel Kahneman)

Reconstruction of a Train Wreck: How Priming Research Went off the Rails

This blog post examines the replicability of priming studies cited in Daniel Kahneman’s popular book “Thinking fast and slow.”   The results suggest that many of the cited findings are difficult to replicate.

http://schoolsnapshots.org/blog/2014/09/30/math-prize-for-girls-at-m-i-t/6. How robust are Stereotype-Threat Effects on Women’s Math Performance?

Stereotype-threat has been used by social psychologists to explain gender differences in math performance. Accordingly, the stereotype that men are better at math than women is threatening to women and threat leads to lower performance.  This theory has produced a large number of studies, but a recent meta-analysis showed that the literature suffers from publication bias and dishonest reporting.  After correcting for these effects, the stereotype-threat effect was negligible.  This blog post shows a low R-Index for the first article that appeared to provide strong support for stereotype-threat.  These results show that the R-Index can warn readers and researchers that reported results are too good to be true.

GPower7.  An attempt at explaining null-hypothesis testing and statistical power with 1 figure and 1500 words.   Null-hypothesis significance testing is old, widely used, and confusing. Many false claims have been used to suggest that NHST is a flawed statistical method. Others argue that the method is fine, but often misunderstood. Here I try to explain NHST and why it is important to consider power (type-II errors) using a picture from the free software GPower.


8.  The Problem with Bayesian Null-Hypothesis Testing


Some Bayesian statisticians have proposed Bayes-Factors to provide evidence for a Null-Hypothesis (i.e., there is no effect).  They used Bem’s (2011) “Feeling the Future” data to argue that Bayes-Factors would have demonstrated that extra-sensory perception does not exist.  This blog post shows that Bayes-Factors depend on the specification of the alternative hypothesis and that support for the null-hypothesis is often obtained by choosing an unrealistic alternative hypothesis (e.g., there is a 25% probability that effect size is greater than one standard deviation, d > 1).  As a result, Bayes-Factors can favor the null-hypothesis when there is an effect, but the effect size is small (d = .2).  A Bayes-Factor in favor of the null is more appropriately interpreted as evidence that the alternative hypothesis needs to decrease the probabilities assigned to large effect sizes. The post also shows that Bayes-Factors based on a meta-analysis of Bem’s data provide misleading evidence that an effect is present because Bayesian statistics do not take publication bias and dishonest reporting practices into account.

hidden9. Hidden figures: Replication failures in the stereotype threat literature.  A widespread problem is that failed replication studies are often not published. This blog post shows that another problem is that failed replication studies are ignored even when they are published.  Selective publishing of confirmatory results undermines the credibility of science and claims about the importance of stereotype threat to explain gender differences in mathematics.

20170620_14554410. My journey towards estimation of replicability.  In this blog post I explain how I got interested in statistical power and replicability and how I developed statistical methods to reveal selection bias and to estimate replicability.

The Demise of the Solo Experiment

Wegner’s article “The Premature Demise of the Solo Experiment” in PSPB (1992) is an interesting document for meta-psychologists. It provides some insight into the thinking of leading social psychologists at the time; not only the author, but reviewers and the editor who found this article worthy of publishing, and numerous colleagues who emailed Wegner with approving comments.

The article starts with the observation that in the 1990s social psychology journals increasingly demanded that articles contain more than one study. Wegner thinks that the preference of multiple-study articles is a bias rather than a preference in favour of stronger evidence.

it has become evident that a tremendous bias against the “solo” experiment exists that guides both editors and reviewers” (p. 504).

The idea of bias is based on the assumption that rejection a null-hypothesis with a long-run error-probability of 5% is good enough to publish exciting new ideas and give birth to wonderful novel theories. Demanding even just one replication of this finding would create a lot more burden without any novel insights just to lower this probability to 0.25%.

But let us just think a moment about the demise of the solo experiment. Here we have a case in which skepticism has so overcome the love of ideas that we seem to have squared the probability of error we are willing to allow. Once, p < .05 was enough. Now, however, we must prove things twice. The multiple experiment ethic has surreptitiously changed alpha to .0025 or below.

That’s right. The move from solo-experiment to multiple-study articles shifted the type-I error probability. Even a pair of studies reduced the type-I error probability more than the highly cited and controversial call to move alpha from .05 to .005. A pair of studies with p < .05 reduces the .005 probability by 50%!

Wegner also explains why journals started demanding multiple studies.

After all, the statistical reasons for multiple experiments are obvious-what better protection of the truth than that each article contain its own replication? (p. 505)

Thus, concerns about replicabilty in social psychology were prominent in the early 1990s, twenty years before the replication crisis. And demanding replication studies was considered to be a solution to this problem. If researchers were able to replicate their findings, ideally with different methods, stimuli, and dependent variables, the results are robust and generalizable. So much for the claim that psychologists did not value or conduct replication studies before the open science movement was born in the early 2010.

Wegner also reports about his experience with attempting to replicate his perfectly good first study.

Sometimes it works wonderfully….more often than not, however, we find the second
experiment is harder to do than the first
Even if we do the exact same experiment again” (p. 506).

He even cheerfully acknowledge that the first results are difficult to replicate because the first results were obtained with some good fortune.

Doing it again, we will be less likely to find the same thing even if it is true, because the
error variance regresses our effects to the mean. So we must add more subjects right off the bat. The joy of discovery we felt on bumbling into the first study is soon replaced by the strain of collecting an all new and expanded set of data to fend off the pointers
[pointers = method-terrorists]” (p. 506).

Wegner even thinks that publishing these replication studies is pointless because readers expect the replication study to work. Sure, if the first study worked, so will the second.

This is something of a nuisance in light of the reception that our second experiment will likely get Readers who see us replicate our own findings roll their eyes and say “Sure,” and we wonder why we’ve even gone to the trouble.

However, he fails to examine more carefully why a successful replication study receives only a shoulder-shrug from readers. After all, his own experience was that it was quite difficult to get these replication studies to work. Doesn’t this mean readers should be at the edge of their seats and wonder whether the original result was a false positive or whether it can actually be replicated? Isn’t the second study the real confirmatory test where the rubber hits the road? Insiders of course know that this is not the case. The second study works because it would not have been included in the multiple-study article if it hadn’t worked. That is after all how the field operated. Everybody had the same problems to get studies to work that Wegner describes, but many found a way to get enough studies to work to meet the demands of the editor. The number of studies was just a test of the persistence of a researcher, not a test of a theory. And that is what Wegner rightfully criticized. What is the point of producing a set of studies with p < .05, if more studies do not strengthen the evidence for a claim. We might as well publish a single finding and then move on to find more interesting ideas and publish them with p-values less than .05. Even 9 studies with p < .05 don’t mean that people can foresee the future (Bem, 2011), but it is surely an interesting idea.

Wegner also comments on the nature of replication studies that are now known as conceptual replication studies. The justification for conceptual replication studies is that they address limitations that are unavoidable in a single study. For example, including a manipulation check may introduce biases, but without one, it is not clear whether a manipulation worked. So, ideally the effect could be demonstrated with and without a manipulation check. However, this is not how conceptual replication studies are conducted.

We must engage in a very delicate “tuning” process to dial in a second experiment that is both sufficiently distant from and sufficiently similar to the original. This tuning requires a whole set of considerations and skills that have nothing to do with conducting an experiment. We are not trained in multi experiment design, only experimental design, and this enterprise is therefore largely one of imitation, inspiration, and luck.

So, to replicate original results that were obtained with a healthy dose of luck, more luck is needed in finding a condition that works, or simply to try often enough until luck strikes again.

Given the negative attitude towards rigor, Wegner and colleagues also used a number of tricks to make replication studies work.

Some of us use tricks to disguise our solos. We run “two experiments” in the same session with the same subjects and write them up separately. Or we run what should rightfully be one experiment as several parts, analyzing each separately and writing it up in bite-sized pieces as a multi experiment Many times, we even hobble the first experiment as a way of making sure there will be something useful to do when we run another.” (p. 506).

If you think this sounds like some charlatans who enjoy pretending to be scientists, your impression is rather accurate because the past decade has shown that many of these internal replications in multiple study articles were obtained with tricks and provide no empirical test of empirical hypotheses; p-values are just for show so that it looks like science, but it isn’t.

My own views on this issue are that the multiple study format was a bad fix for a real problem. The real problem was that it was all to easy to get p < .05 in a single study to make grand claims about the causes of human behavior. Multiple-study articles didn’t solve this problem because researchers found ways to get significant results again and again even when their claims were false.

The failure of multiple-study articles to fix psychology has some interesting lessons for the current attempts to improve psychology. Badges for data sharing and preregistration will not improve psychology, if they are being gamed like psychologists gamed the multiple-study format. Ultimately, science can only advance if results are reported honestly and if results are finally able to falsify theoretical predictions. Psychology will only become a science when brilliant novel ideas can be proven false and scientific rigor is prized as much as the creation of interesting ideas. Coming up with interesting ideas is philosophy. Psychology emerged as a distinct discipline in order to subject those theories to empirical tests. After a century of pretending to do so, it is high time to do so for real.

Young Unarmed Non-Suicidal Male Victims of Fatal Use of Force are Thirteen Times more Likely to be Black than White

Rickard Carlsson and I think that Johnson, Tress, Burke, and Cesario misinterpreted the results of their regression models, when they concluded that there is no evidence of racial bias against Black civilians when police use fatal force. We show with their own data that there is a dramatic racial disparity for young unarmed males. We are pleased that PNAS made us revise our original comment (posted here) and accepted our revised comment (posted below) for publication. The 500 world limit made it impossible to say everything we wanted to say, but we think we got the main point across. The original PNAS article makes claims that are not supported by the data.

PNAS Letter (Accepted for publication, Dec/5/2019)

Ulrich Schimmack
University of Toronto Mississauga

Rickard Carlsson
Linnaeus University

A recent PNAS article reported “no evidence of anti-Black or anti-Hispanic disparities across [fatal] shootings” by police officers. [1]. This claim is based on the results of a regression model that suggested “a person fatally shot by police was 6.67 times less [italics added] likely (OR = 0.15 [0.09, 0.27]) to be Black than White” (1).  The article also claims the results “do not depend on which predictors are used” (p. 15881). These claims are misleading because the reported results apply only to a subset of victims and do not control for the fact that we would expect a higher number of White victims simply because the majority of US citizens are White.

The published odds-ratio of 0.15 is based on a regression model that made the intercept correspond to a county with 4 times more White (50%) than Black (12%) citizens.  In addition, the intercept of the model corresponds to a country where White homicides rates equal (a) Black homicide rates and (b) Hispanic homicide rates, and where victims are (c) average age (36.71y), and White and Black victims are equally likely to (d) have mental health problems, (e) be suicidal, (f) armed, and (g) attacking an officer.  We found that including suicidal as a predictor had the strongest effect on the intercept, which doubled the odds of the victim being White (OR = .24 vs. .49). In contrast, adjusting only for differences in Black and White homicide rates, left the intercept unchanged (OR = .48 vs. 49).  Thus, the main contribution of the regression analysis is to show that that the odds of a victim being White doubles when the percentage of suicidal victims increase from 11% in the actual population to a 50% in a hypothetical population. The fact that older suicidal victims are disproportionally more likely to be White shows that not all victims of lethal use of force are violent criminals.

Although use of force with citizens who suffer mental health problem is important, another group of interest are young, unarmed, mentally healthy (non-suicidal) men.  To examine racial disparities in this group, we specified an alternative model that focused on young (age 20), unarmed male victims that showed no signs of mental health problems and were not suicidal in a county with equal proportions of Black and White citizens. The intercept of this model suggested that victims with these characteristics are 13.67 times more likely to be Black than White, 95%CI = 6.65, 28.13 (https://osf.io/hm6f2/).The stark contrast between the published finding and our finding contradicts the authors claims that their results hold across subgroups of victims. Contrary to this claim, their data are entirely consistent with the public perception that young male victims of fatal use of force are disproportionally Black. Importantly, neither the original, nor our finding address the causes of racial disparities among victims of deadly use of force. Our results merely confirm other recent findings that racial disparities exist and that they are particularly large for young males (2)


1. Johnson, D. J., Trevor T., Nicole, B., Carley, T., & Cesario, J. (2019). Officer characteristics and racial disparities in fatal officer-involved shootings. Proceedings of the National Academy of Sciences, 116(32), 15877–15882.

2. Edwards, F., Lee, H., Esposito, M. (2019). Risk of being killed by police use of force in the United States by age, race-ethnicity, and sex. Proceedings of the National Academy of Sciences, 116(34), 16793-16798. doi: 10.1073/pnas.1821204116

Christopher J. Bryan claims replicators p-hack to get non-significant results. I claim he p-hacked his original results.

Open draft for a Response Article to be Submitted to PNAS. (1200 words, Commentary/Letters only allowed 500 words). Co-authors are welcome. Please indicate intent and make contributions in the comments section. This attack on valuable replication work needs a response.

Draft 12/2/19


Bryan, Walton, Rogers and Dweck reported three studies that suggested a slight change in message wording can have dramatic effects on voter turnout (1). Gerber, Huber, Biggers, and Hendry reported a failure to replicate this result (2). Bryan, Yeager, and O’Brien reanalyzed Gerber et al.’s data and found a significant result consistent with their original results (3). Based on this finding, Bryana et al. (2019) make two claims that go beyond the question about ways to increase voter turnout.  First, Bryana et al. accuse Gerber et al. (2019) of exploiting replicators’ degrees of freedom to produce a non-significant result.  Others have called this practice reverse p-hacking (4). Second, they claim that many replicators may engage in deceptive practices to produce non-significant results because these results are deemed easier to publish.  We take issue with these claims about the intentions and practices of researchers who conduct replication studies. Moreover, we present evidence that Bryan et al.’s (2011) results are likely to be based by the exploitation of researchers’ degrees of freedom. This conclusion is consistent with widespread evidence that social psychologists in 2011 were abusing statistical methods to inflate effect sizes in order to publish eye-catching results that often do not replicate (5). We argue that only a pre-registered replication study with high precision will settle the dispute about the influence of subtle linguistic cues on voter turnout.

Bryan et al. (2011)

Study 1 used a very small sample size of n = 16 participants in each condition. After transforming the dependent variable, a t-test produced a just significant result (p < .05 & p > .005), p = .044.  Study 2 had 88 participants, but power was reduced because the outcome variable was dichotomous. A chi-square test produced again a just significant result, p = .018.  Study 3 increased sample size considerably (N = 214), which should also increase power and produce a smaller p-value if the population effect size is the same as in Study 2. However, the observed effect size was weaker and the result was again just significant, p = .027.  In the wake of the replication crisis, awareness has increased that sampling error produces large variability in p-values and that a string of just-significant p-values is unlikely to occur by chance. Thus, the results reported by Bryan et al. (2011) suggest that researchers’ degrees of freedom were used to produce significant results (6).  For example, converted into observed power, the p-values imply 52%, 66%, and 60% power, respectively.  It is unlikely that three studies with average power of 60% can produce three significant results; the expected value is only 1.8 significant results.  These calculations are conservative because questionable research practices inflate estimates of observed power.  The replication index (R-Index) corrects for this bias by subtracting the inflation rate from the estimate of observed power (7). With 60% mean observed power and a 100% success rate, the inflation rate is 40 percentage points, and the R-Index is 60% – 40% = 20%. Simulations show that an R-Index of 20% is obtained when the null-hypothesis is true. Thus, the published results provide no empirical evidence that subtle linguistic cues influence voter turnout because the published results are incredible.

Gerber et al. (2016)

Gerber et al. conducted a conceptual replication study with a much larger sample (n = 2,236 noun condition, n = 2,232 verb condition).  Their effect was in the same direction, but much weaker and not statistically significant, 95%CI = -1.8 to 3.8. They also noted that the original studies were conducted on the day before elections or in the morning of election day and limited their analysis to the day of elections, and reported a non-significant result for this analysis as well. Gerber et al. discuss various reasons for their replication failure that assume the original results are credible (e.g., internet vs. phone contact).  They even consider the possibility that their finding could be a type-II error, although this implies that the population effect size is much smaller than the estimates in Bryant et al.’s (2011) study.

Bryan et al. (2019)

Bryan et al. (2019) noted that Gerber et al. never reported the results of a simple comparison of the two linguistic conditions, while limiting the sample to participants who were contacted on the day before elections. When they conducted this analysis with a one-sided test and alpha = .05, they obtained a significant result, p = .036. They consider these results a successful replication and they allege that Gerber et al. were intentionally not reporting this result. We do not know why Gerber et al. (2011) did not report this result, but we are skeptical that it can be considered a successful replication for several reasons. First, adding another just significant result to a series of just significant results makes the evidence weaker not stronger (5).  The reason is that a credible set of studies with modest power should contain some non-significant results. The absence of such non-significant result undermines the trustworthiness of the reported results.  The maximum probability of obtaining a just significant result (.05 to .005) is 33%.  The probability of this outcome in four out of four studies is just .33^4 = .012.  Thus, even if we consider Gerber et al.’s study a successful replication, the results do not provide strong support for the hypothesis that subtle linguistic manipulations have a strong effect on voter turnout.  Another problem with Bryan et al.’s conclusions is that they put too much weight on the point estimates of effect sizes. “In sum, the evidence across the expanded set of model specifications that includes the main analytical choices by Gerber et al. supports a substantial and robust effect consistent with the original finding by Bryan et al.” (p. 6). This claim ignores that just significant p-values imply that the corresponding confidence intervals barely exclude an effect size of zero (i.e., p = .05 implies that 0 is the lower bound of the 95%CI).  Thus, each result individually cannot be used to claim that the population effect size is large. It is also not possible to use standard meta-analysis to reduce sampling error because there is evidence of selection bias.  In short, the reanalysis found a significant result with a one-sided test for a subset of the data.  This finding is noteworthy, but hardly a smoking gun to make claims that reverse p-hacking was used to hide a robust effect.

Broader Implications for the Replication Movement

Bryan et al. (2019) generalize from their finding of a one-sided significant p-value in a conceptual replication study to replication studies in general.  Many of these generalizations are invalid because Bryan et al. do not differentiate between different types of replication studies. First, there are registered replication reports (6). Registered replication reports are approved before data are collected and are ensured publication independent of the study outcome.  Thus, Bryan et al.’s claim that replicators use researchers’ degrees of freedom to produce null-results because they are easier to publish do not apply to these replication studies.  Nevertheless, registered replication reports have shaken the foundations of social psychology by failing to replicate ego depletion or facial feedback effects.  Moreover, these replication failures were also predicted by incredible p-values in the original articles.  In contrast, bias tests fail to show reverse p-hacking in replication studies. Readers of Bryan et al. (2019) should therefore simply ignore their speculations about motives and practices of researchers who conduct replication studies. Our advice for Bryan et al. (2019) is to demonstrate that subtle linguistic cues can influence voter turnout with a preregistered replication report. The 2020 elections are just around the corner. Good luck, you guys.


(1) C. J. Bryan, G. M. Walton, T. Rogers, C. S. Dweck, Motivating voter turnout by invoking the self. Proc. Natl. Acad. Sci. U.S.A. 108, 12653–12656 (2011).

(2) A. S. Gerber, G. A. Huber, D. R. Biggers, D. J. Hendry, A field experiment shows that subtle linguistic cues might not affect voter behavior. Proc. Natl. Acad. Sci. U.S.A. 113, 7112–7117 (2016).

(3) C. J. Bryana, D. S. Yeager, J. M. O’Brien, Replicator degrees of freedom allow publication of misleading failures to replicate, Proc. Natl. Acad. Sci. U.S.A. 108, (2019)

(4) F. Strack, Reflection on the smiling registered replication report. Perspective on Psychological Science, 11, 929-930 (2016)

(5) G. Francis, The frequency of excess success for articles in Psychological Science. Psychonomic Bulletin & Review, 21, 1180-1187 (2014)

(6) U. Schimmack, The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551-566 (2012).

(7) U. Schimmack, A Revised Introduction to the R-Index.  https://replicationindex.com/2016/01/31/a-revised-introduction-to-the-r-index/

Racial Bias as a Trait

Prejudice is an important topic in psychology that can be examined from various perspectives. Nevertheless, prejudice research is typically studied by social psychologists. As a result, research has focused on social cognitive processes that are activated in response to racial stimuli (e.g., pictures of African Americans) and experimental manipulations of the situation (e.g., race of experimenter). Other research has focused on cognitive processes that can lead to the formation of racial bias (e.g., the minimal group paradigm). Sometimes this work has been based on a model of prejudice that assumes racial bias is a common attribute of all people (Devine, 1989) and that individuals only differ in their willingness or ability to act on their racial biases.

An alternative view is that racial biases vary across individuals and are shaped by experiences with out-group members. The most prominent theory is contact theory, which postulates that contact with out-group members reduces racial bias. In social psychology, individual differences in racial biases are typically called attitudes, where attitudes are broad dispositions to respond to a class of attitude objects in a consistent manner. For example, individuals with positive attitudes towards African Americans are more likely to have positive thoughts, feelings, and behaviors in interactions with African Americans.

The notion of attitudes as general dispositions shows that attitudes play the same role in social psychology that traits play in personality psychology. For example, extraversion is a general disposition to have more positive thoughts, feelings, and to engage more in social interactions. One important research question in personality psychology are the causes of variation in personality. Why are some people more extraverted than others? A related question is how stable personality traits are. If the causes of extraversion are environmental factors, extraversion should change when the environment changes. If the causes of extraversion are within the person (e.g., early childhood experiences, genetic differences), extraversion should be stable. Thus, the stability of personality traits over time is an empirical question that can only be answered in longitudinal studies that measure personality traits repeatedly. A meta-analysis shows that the Big Five personality traits are highly stable over time (Anusic & Schimmack, 2016).

In comparison, the stability of attitudes has received relatively little attention in social psychology because stable individual differences are often neglected in social cognitive models of attitudes. This is unfortunate because the origins of racial bias are important to the understanding of racial bias and to design interventions that help individuals to reduce their racial biases.

How stable are racial biases?

The lack of data has not stopped social psychologists from speculating about the stability of racial biases. “It’s not as malleable as mood and not as reliable as a personality trait. It’s in between the two–a blend of both a trait and a state characteristic” (Nosek in Azar, 2008). In 2019, Nosek was less certain about the stability of racial biases. “One is does that mean we have have some degree of trait variance because there is some stability over time and what is the rest? Is the rest error or is it state variance in some way, right. Some variation that is meaningful variation that is sensitive to the context of measurement. Surely it is some of both, but we don’t know how much” (The Psychology Podcast, 2019).

Other social psychologists have made stronger claims about the stability of racial bias. Payne argued that racial bias is a state because implicit bias measures show higher internal consistency than retest correlations (Payne, 2017). However, the comparison of internal consistency and retest correlations is problematic because situational factors may simply produce situation-specific measurement errors rather than reflecting real changes in the underlying trait; a problem that is well recognized in personality psychology. To examine this question more thoroughly, it is necessary to obtain multiple retests and decompose the variances into trait, state, and error variances (Anusic & Schimmack, 2016). Even this approach cannot distinguish between state variance and systematic measurement error, which requires multi-method data (Schimmack, 2019).

A Longitudinal Multi-Method Study of Racial Bias

A recent article reported the results of an impressive longitudinal study of racial bias with over 3,000 medical students who completed measures of racial bias and inter-group contact three times over a period of six year (first year of medical school, fourth year of medical school, 2nd year of residency) (Onyeador et al., 2019). I used the openly shared data to fit a multi-method state-trait-error model to the data (https://osf.io/78cqx/).

The model integrates several theoretical assumptions that are consistent with previous research (Schimmack, 2019). First, the model assumes that explicit ratings of racial bias (feeling thermometer) and implicit measures of racial bias (Implicit Association Test) are complementary measures of individual differences in racial bias. Second, the model assumes that one source of variance in racial bias is a stable trait. Third, the model assumes that racial bias differs across racial groups, in that Black individuals have more favorable attitudes towards Black people than members from other groups. Fourth, the model assumes that contact is negatively correlated with racial bias without making a strong causal assumption about the direction of this relationship. The model also assumes that Black individuals have more contact with Black individuals and that contact partially explains why Black individuals have less racial biases.

The new hypotheses that could be explored with these data concerned the presence of state variance in racial bias. First, state variance should produce correlations between the occasion specific variances of the two methods. That is, after statistically removing trait variance, residual state variance in feeling thermometer scores should be correlated with residual variances in IAT scores. For example, as medical students interact more with Black staff and patients in residency, their racial biases could change and this would produce changes in explicit ratings and in IAT scores. Second, state variance is expected to be somewhat stable over shorter time intervals because environments tend to be stable over shorter time intervals.

The model in Figure 1 met standard criteria of model fit, CFI = .997, RMSEA = .016.

Describing the model from left to right, race (0 = Black, 1 = White) has the expected relationship with quantity of contact (quant1) in year 1 (reflecting everyday interactions with Black individuals) and with the racial bias (att) factor. In addition, more contact is related to less pro-White bias (-.28). The attitude factor is a stronger predictor of the explicit trait factor (.78; ft; White feeling-thermometer – Black feeling-thermometer) than on the implicit trait factor (.60, iat). The influence of the explicit trait factor on measures on the three occasions (.58-.63) suggests that about one-third of the variance in these measures is trait variance. The same is true for individual IATs (.59-.62). The effect of the attitude factor on individual IATs (.60 * .60 = .36; .36^2 = .13 suggests that less than 20% of the variance in an individual IAT reflects racial bias. This estimate is consistent with the results from multi-method studies (Schimmack, 2019). However, these results suggests that the amount of valid trait variance can increase up to 36%, by aggregating scores of several IATs. In sum, these results provide first evidence that racial bias is stable over a period of six years and that both explicit ratings and implicit ratings capture trait variance in racial bias.

Turning to the bottom part of the model, there is weak evidence to suggest that residual variances (that are not trait variance) in explicit and implicit ratings are correlated. Although the correlation of r = .06 at time 1 is statistically significant, the correlations at time 2 (r = .03) and time 3 (r = .00) are not. This finding suggests that most of the residual variance is method specific measurement error rather than state-variance in racial bias. There is some evidence that the explicit ratings capture more than occasion-specific measurement error because state variance at time 1 predicts state variance at time 2 (r = .25) and from time 2 to time 3 (r = .20). This is not the case for the IAT scores. Finally, contact with Black medical staff at time 2 is a weak, but significant predictor of explicit measures of racial bias at time 2 and time 3, but it does not predict IAT scores at time 2 and 3. These findings do not support the hypothesis that changes in racial bias measures reflect real changes in racial biases.

The results are consistent with the only other multi-method longitudinal study of racial bias that covered only a brief period of three months. In this study, even implicit measures showed no convergent validity for the state (non-trait) variance on the same occasion (Cunningham, Preacher, & Banaji, 1995).


Examining predictors of individual differences in racial bias is important to understand the origins of racial biases and to develop interventions that help individuals to reduce their racial biases. Examining the stability of racial bias in longitudinal studies shows that these biases are stable dispositions and there is little evidence that they change with changing life-experiences. One explanation is that only close contact may be able to shift attitudes and that few people have close relationships with outgroup members. Thus stable environments may contribute to stability in racial bias.

Given the trait-like nature of racial bias, interventions that target attitudes and general dispositions may be relatively ineffective, as Onyeador et al.’s (2019) article suggested. Thus, it may be more effective to target and assess actual behaviors in diversity training. Expecting diversity training to change general dispositions may be misguided and lead to false conclusions about the effectiveness of diversity training programs.

Anti-Black Bias on the IAT predicts Pro-Black Bias in Behavior

Over 20 years ago, Anthony Greenwald and colleagues introduced the Implicit Association Test (IAT) as a measure of individual differences in implicit bias (Greenwald et al., 1998). The assumption underlying the IAT is that individuals can harbour unconscious, automatic, hidden, or implicit racial biases. These implicit biases are distinct from explicit bias. Somebody could be consciously unbiased, while their unconscious is prejudice. Theoretically, the opposite would also be possible, but taking IAT scores at face value, the unconscious is more prejudice than conscious reports of attitudes imply. It is also assumed that these implicit attitudes can influence behavior in ways that bypass conscious control of behavior. As a result, implicit bias in attitudes leads to implicit bias in behavior.

The problem with this simple model of implicit bias is that it lacks scientific support. In a recent review of validation studies, I found no scientific evidence that the IAT measures hidden or implicit biases outside of people’s awareness (Schimmack, 2019a). Rather, it seems to be a messy measure of consciously accessible attitudes.

Another contentious issue is the predictive validity of IAT scores. It is commonly implied that IAT scores predict bias in actual behavior. This prediction is so straightforward that the IAT is routinely used in implicit bias training (e.g., at my university) with the assumption that individuals who show bias on the IAT are likely to show anti-Black bias in actual behavior.

Even though the link between IAT scores and actual behavior is crucial for the use of the IAT in implicit bias training, this important question has been examined in relatively few studies and many of these studies had serious methodological limitations (Schimmack, 20199b).

To make things even more confusing, a couple of papers even suggested that White individuals’ unconscious is not always biased against Black people: “An unintentional, robust, and replicable Pro-Black bias in social judgment (Axt, Ebersole, & Nosek, 2016; Axt, 2017).

I used the open data of these two articles to examine more closely the relationship between scores on the attitude measures (the Brief Implicit Association Test & a direct explicit rating on a 7-point scale) and performance on a task where participants had to accept or reject 60 applicants into an academic honor society. Along with pictures of applicants, participants were provided with information about academic performance. These data were analyzed with signal-detection theory to obtain a measure of bias. Pro-White bias would be reflected in a lower admission standard for White applicants than for Black applicants. However, despite pro-White attitudes, participants showed a pro-Black bias in their admissions to the honor society.

Figure 1 shows the results for the Brief IAT. The blue lines show are the coordinates with 0 scores (no bias) on both tasks. The decreasing red line shows the linear relationship between BIAT scores on the x-axis and bias in admission decisions on the y-axis. The decreasing trend shows that, as expected, respondents with more pro-White bias on the BIAT are less likely to accept Black applicants. However, the picture also shows that participants with no bias on the BIAT have a bias to select more Black than White applicants. Most important, the vertical red line shows behavior of participants with the average performance on the BIAT. Even though these participants are considered to have a moderate pro-White bias, they show a pro-Black bias in their acceptance rates. Thus, there is no evidence that IAT scores are a predictor of discriminatory behavior. In fact, even the most extreme IAT scores fail to identify participants who discriminate against Black applicants.

A similar picture emerges for the explicit ratings of racial attitudes.

The next analysis examine convergent and predictive validity of the BIAT in a latent variable model (Schimmack, 2019). In this model, the BIAT and the explicit measure are treated as complementary measures of a single attitude for two reasons. First, multi-method studies fail to show that the IAT and explicit measures tap different attitudes (Schimmack, 2019a). Second, it is impossible to model systematic method variance in the BIAT in studies that use only a single implicit measure of attitudes.

The model also includes a group variable that distinguishes the convenience samples in Axt et al.’s studies (2016) and the sample of educators in Axt (2017). The grouping variable is coded with 1 for educators and 0 for the comparison samples.

The model meets standard criteria of model fit, CFI = .996, RMSEA = .002.

Figure 3 shows the y-standardized results so that relationships with the group variable can be interpreted as Cohen’s d effect sizes. The results show a notable difference (d = -59) in attitudes between the two samples with less pro-White attitudes for educators. In addition, educators have a small bias to favor Black applicants in their acceptance decisions (d = .19).

The model also shows that racial attitudes influence acceptance decisions with a moderate effect size, r = -.398. Finally, the model shows that the BIAT and the single-item explicit rating have modest validity as measures of racial attitudes, r = .392, .429, respectively. The results for the BIAT are consistent with other estimates that a single IAT has no more than 20% (.392^2 = 15%) valid variance. Thus, the results here are entirely consistent with the view that explicit and implicit measures tap a single attitude and that there is no need to postulate hidden, unconscious attitudes that can have an independent influence on behavior.

Based on their results, Axt et al. (2016) caution readers that the relationship between attitudes and behaviors is more complex than the common narrative of implicit bias assumes.

The authors “suggest that the prevailing emphasis on pro-White biases in judgment and behavior in the existing literature would improve by refining the theoretical understanding of under what conditions behavior favoring dominant or minority groups will occur.” (p. 33).


For two decades, the developers of the IAT have argued that the IAT measures a distinct type of attitudes that reside in individuals’ unconscious and can influence behavior in ways that bypass conscious control. As a result, even individuals who aim to be unbiased might exhibit prejudice in their behavior. Moreover, the finding that the majority of White people show a pro-White bias in their IAT scores was used to explain why discrimination and prejudice persist. This narrative is at the core of implicit bias training.

The problem with this story is that it is not supported by scientific evidence. First, there is no evidence that IAT scores reflect some form of unconscious or implicit bias. Rather, IAT scores seem to tap the same cognitive and affective processes that influence explicit ratings. Second, there is no evidence that processes that influence IAT scores can bypass conscious control of behavior. Third, there is no evidence that a pro-White bias in attitudes automatically produces a pro-White bias in actual behaviors. Not even Freud assumed that unconscious processes would have this effect on behavior. In fact, he postulated that various defense mechanisms may prevent individuals from acting on their undesirable impulses. Thus, the prediction that attitudes are sufficient to predict behavior is too simplistic.

Axt et al. (2016= speculate that “bias correction can occur automatically and without awareness” (p. 32). While this is an intriguing hypothesis, there is little evidence for such smart automatic control processes. This model also implies that it is impossible to predict actual behaviors from attitudes because correction processes can alter the influence of attitudes on behavior. This implies that only studies of actual behavior can reveal the ability of IAT scores to predict actual behavior. For example, only studies of actual behavior can demonstrate whether police officers with pro-White IAT scores show racial bias in the use of force. The problem is that 20 years of IAT research have uncovered no robust evidence that IAT scores actually predict important real-world behaviors (Schimmack, 2019b).

In conclusion, the results of Axt’s studies suggest that the use of the IAT in implicit bias training needs to be reconsidered. Not only are test scores highly variable and often provide false information about individuals’ attitudes; they also do not predict actual behavior of discrimination. It is wrong to assume that individuals who show a pro-White bias on the IAT are bound to act on these attitudes and discriminate against Black people or other minorities. Therefore, the focus on attitudes in implicit bias training may be misguided. It may be more productive to focus on factors that do influence actual behaviors and to provide individuals with clear guidelines that help them to act in accordance with these norms. The belief that this is not sufficient is based on an unsupported model of unconscious forces that can bypass awareness.

This conclusion is not totally new. In 2008, Blanton criticized the use of the IAT in applied settings (IAT: Fad or fabulous?)

“There’s not a single study showing that above and below that cutoff people differ in any way based on that score,” says Blanton.

And Brian Nosek agreed.

Guilty as charged, says the University of Virginia’s Brian Nosek, PhD, an IAT developer.

However, this admission of guilt has not changed behavior. Nosek and other IAT proponents continue to support Project Implicit that provided millions of visitors with false information about their attitudes or mental health issues based on a test with poor psychometric properties. A true admission of guilt would be to stop this unscientific and unethical practice.


Axt, J.R. (2017). An unintentional pro-Black bias in judgement among educators. British Journal of Educational Psychology, 87, 408-421.

Axt, J.R., Ebersole, C.R. & Nosek, B.A. (2016). An unintentional, robust, and replicable pro-Black bias in social judgment. Social Cognition34, 1-39.

Greenwald, A. G., McGhee, D. E., & Schwartz, J. L. K. (1998). Measuring individual differences in implicit cognition: The Implicit Association Test. Journal of Personality and Social Psychology, 74, 1464–1480.

Schimmack, U. (2019). The Implicit Association Test: A Method in Search of a construct. Perspectives on Psychological Sciencehttps://doi.org/10.1177/1745691619863798

Schimmack, U. (2019). The race IAT: A Case Study of The Validity Crisis in Psychology.

Are Positive Illusions Really Good for You?

With 4,366 citations in WebOfScience, Taylor and Brown’s article “ILLUSIONS AND WELL-BEING: A SOCIAL PSYCHOLOGICAL PERSPECTIVE ON MENTAL-HEALTH” is one of the most cited articles in social psychology.

The key premises of the article is that human information processing is faulty and that mistakes are not random. Rather human information processing is systematically biased.

Taylor and Brown (1988) quote Fiske and Taylor’s (1984) book about social cognitions to support this assumption. “Instead of a naïve scientist entering the environment in search of the truth, we find the rather unflattering picture of a charlatan trying to make the data come out in a manner most advantageous to his or her already-held theories” (p. 88). 

30 years later, a different picture emerges. First, evidence has accumulated that human information processing is not as faulty as social psychologists assumed in the early 1980s. For example, personality psychologists have shown that self-ratings of personality have some validity (Funder, 1995). Second, it has also become apparent that social psychologists have acted like charlatans in their research articles, when they used questionable research practices to make unfounded claims about human behavior. For example, Bem (2011) used these methods to show that extrasensory perception is real. This turned out to be a false claim based on shoddy use of the scientific method.

Of course, a literature with thousands of citations also has produced a mountain of new evidence. This might suggest that Taylor and Brown’s claims have been subjected to rigorous tests. However, this is actually not the case. Most studies that examined the benefits of positive illusions relied on self-ratings of well-being, mental-health, or adjustment to demonstrate that positive illusions are beneficial. The problem is evident. When self-ratings are used to measure the predictor and the criterion, shared method variance alone is sufficient to produce a positive correlation. The vast majority of self-enhancement studies relied on this flawed method to examine the benefits of positive illusions (see meta-analysis by Dufner, Gebauer, & Sedikides, 2019).

However, there are a few attempts to demonstrate that positive illusions about the self predict well-being measures is measured by informant ratings to reduce the influence of shared method variance. The most prominent example is Taylor et al. (2003) article ” Portrait of the self-enhancer: Well adjusted and well liked or maladjusted and friendless.”
[Sadly, this was published in the Personality section of JPSP]

The abstract gives the impression that the results clearly favored Taylor’s positive illusions model. However, a closer inspection of reality shows that the abstract is itself illusory and disconnected from reality.

First, the study had a small sample size (N = 92). Second, only about half of these participants . Informant ratings were obtained from a single friend, but only 55 participants identified a friend who provided informant ratings. Even in 2003, it was common to use larger samples and more informants to measure well-being (e.g., Schimmack, & Diener, 2003). Moreover, friends are not as good as family members to report on well-being (Schneider & Schimmack, 2009). It only attests to Taylor’s social power that such a crappy, underpowered study was published in JPSP.

The results showed no significant correlations between various measures of positive illusions (self-enhancement) and peer-ratings of mental health (last row).

Thus, the study provided no evidence for the claim in the abstract that positive illusions about the self predict well-being or mental health without the confound of shared method variance.


Dufner, Gebauer, Sedikides, and Denissen (2019) conducted a meta-analysis of the literature. The abstract gives the impression that there is a clear positive effect of positive illusions on well-being.

Not surprisingly, studies that used self-ratings of adjustment/well-being/mental health showed positive association. The more interesting question is how self-enhancement measures are related to non-self-report measures of well-being. Table 3 shows that the meta-analysis identified 22 studies with an informant-rating of well-being and that these studies showed a small positive relationship, r = .12.

I was surprised that the authors found 22 studies because my own literature research uncovered fewer studies. So, I took a closer look at the 22 studies included in the meta-analysis (see APPENDIX).

Many of the studies relied on measures of social desirable responding (Marlow-Crowne Social Desirability Scale, Balanced -Inventory-of-Desirable Responding) as a measure of positive illusions. The problem with these studies is that social desirability scales also contain a notable portion of real personality variance. Thus, these studies do not conclusively demonstrate that illusions are related to informant ratings of adjustment. Paulhus’s studies are problematic because adjustment ratings were based on first-impressions in a zero-acquaintance relationship, and the results changed over time. Self-enhancers were perceived as better adjusted in the beginning, but as less adjusted later on. The problem here is that well-being ratings in this context have low validity. Finally, most studies were underpowered given the estimated population effect size of r = .12. The only reasonably powered study by Church et al. with 900 participants produced a correlation of r = .17 with an unweighted measure and r = .08 with a weighted measure. Overall, this evidence does not provide clear evidence that positive illusions about the self have positive effects. They actually show that any beneficial effects would be small.

New Evidence

In a forthcoming JRP article, Hyunji Kim and I present the most comprehensive test of Taylor’s positive illusion hypothesis (Schimmack & Kim, 2019). We collected data from 458 triads (students with both biological parents living together). We estimated separate models for students, mothers, and fathers as targets. In each model, targets self-ratings of the Big Five personality ratings were modelled with the halo-alpha-beta model, where the halo factor represents positive illusions about the self (Anusic et al., 2009). The halo factor was then allowed to predict the shared variance in well-being ratings by all three raters, and well-being ratings were based on three indicators (global life-satisfaction, average domain satisfaction, and hedonic balance, cf. Zou, Schimmack, & Gere, 2013).

The structural equation model is shown in Figure 1. The complete data, MPLUS syntax and output files and a preprint of the article are available on OSF ( https://osf.io/6z34w/).

The key findings are reported in Table 6. There were no significant relationships between self-rated halo bias and the shared variance among ratings of well-being across the three raters. Although this finding does not prove that positive illusions are not beneficial, the results suggest that it is rather difficult to demonstrate these benefits even in reasonably powered studies to detect moderate effect sizes.

The study did replicate much stronger relationships with self-ratings of well-being. However, this finding begs the question whether positive illusions are beneficial only in ways that are not visible to close others or whether these relationships simply reflect shared method variance.


Over 30 years ago, Taylor and Brown made the controversial proposal that humans benefit from distorted perceptions of reality. Only this year, a meta-analysis claimed that there is strong evidence to support this claim. I argue that the evidence in support of the illusion model is itself illusory because it rests on studies that relate self-ratings to self-ratings. Given the pervasive influence of rating biases on self-ratings, shared method variance alone is sufficient to explain positive correlations in these studies (Campbell & Fiske, 1959). Only a few studies have attempted to address this problem by using informant ratings of well-being as an outcome measure. These studies tend to find weak relationships that are often not significant. Thus, there is currently no scientific evidence to support Taylor and Brown’s social psychological perspective on mental health. Rather, the literature on positive illusions provides further evidence that social and personality psychologists have been unable to subject the positive illusions hypothesis to a rigorous test. To make progress in the study of well-being it is important to move beyond the use of self-ratings to reduce the influence of method variance that can produce spurious correlations among self-report measures.


Article#TitleStudy InformantsNSRIR
1Do Chinese Self-Enhance or Self-Efface?
It’s a Matter of Domain
1Table 4helpfulnessneuroticism1300.480.01
2How self-enhancers adapt well to loss: the mediational role of loneliness and social functioning1BIDR-SDSR symptoms (reversed) / IR mental health570.240.34
3Portrait of the self- enhancer:Well- adjusted and well- liked or maladjusted and friendless?1
4Social Desirability Scales: More Substance Than Style1Table 2MCSDdepression (reversed)2150.490.31
5Substance and bias in social desirability responding.12 FriendsTable 2SDEneuroticism (reversed)670.390.26
6Interpersonal and intrapsychic adaptiveness of trait self-enhancement: A mixed blessing1aZero-AquaintanceTable 2 Time 1Trait SEAdjustment124NA0.36
6Interpersonal and intrapsychic adaptiveness of trait self-enhancement: A mixed blessing1bZero-AquaintanceTable 2 Time 2Trait SEAdjustment124NA-0.11
6Interpersonal and intrapsychic adaptiveness of trait self-enhancement: A mixed blessing2Zero-AquaintanceTable 4 Time 1Trait SEAdjustment89NA0.35
6Interpersonal and intrapsychic adaptiveness of trait self-enhancement: A mixed blessing2Zero-AquaintanceTable 4 Time 1Trait SEAdjustment89NA-0.22
7A test of the construct validity of the Five-Factor Narcissism Inventory11 PeerTable 1FFNI VulnerabilityNeuroticism2870.50.33
8Moderators of the adaptiveness of self-enhancement: Operationalization, motivational domain, adjustment facet, and evaluator13 Peers/Family MembersSelf-ResidualsAdjustment1230.22-0.2
9Grandiose and Vulnerable Narcissism: A Nomological Network Analysis1NANA
10Socially desirable responding in personality assessment: Still more substance than style1a1 RoommateTable 1MCSDneuroticism (reversed)1280.410.06
10Socially desirable responding in personality assessment: Still more substance than style1bParentsTable 1MCSDneuroticism (reversed)1280.410.09
11Two faces of human happiness: Explicit and implicit life-satisfaction1a1 PeerTable 1BIDR-SDPANAS1590.450.17
11Two faces of human happiness: Explicit and implicit life-satisfaction1b1 PeerTable 1BIDR-SDLS1590.36-0.03
12Socially desirable responding in personality assessment: Not necessarily faking and not necessarily substance11 roommateTable 2BIDR-SDneuroticism (reversed)6020.260.02
13Depression and the chronic pain experience1noneMCSDNANA
14Trait self-enhancement as a buffer against potentially traumatic events: A prospective study1FriendsTable 5BIDR-SDmental health32NA-0.01
15Big Tales and Cool Heads: Academic Exaggeration Is Related to Cardiac Vagal Reactivity162NANA
16Are Actual and Perceived Intellectual Self-enhancers Evaluated Differently bySocial Perceivers?11 FriendTable 1 / above diagonalSE intelligenceneuroticism (reversed)3370.170.15
16Are Actual and Perceived Intellectual Self-enhancers Evaluated Differently bySocial Perceivers?3Zero-AquaintanceTable 1 / below diagonalSE intelligenceneuroticism (reversed)1830.190.38
17Response artifacts in the measurement of subjective well-being17 friends / familyTable 1MCSDLS1080.30.36
18A Four-Culture Study of Self-Enhancement and Adjustment Using the1a6 friends/ familyTable 6 SRM unweightedSRMLS9000.530.17
18A Four-Culture Study of Self-Enhancement and Adjustment Using the1b6 friends/ familyTable 6 SRM weightedSRMLS9000.490.08
19You Probably Think This Paper’s About You: Narcissists’ Perceptions of Their Personality and Reputation1NANA
20What Does the Narcissistic Personality Inventory Really Measure?4RoommatesNPI-GrandioseCollege Adjustment2000.480.27
21Self-enhancement as a buffer against extreme adversity: Civil war in Bosnia and traumatic loss in the United States1Mental Health ExpertsSelf-Peer Disadjustment difficulties (reversed)780.470.27
21Self-enhancement as a buffer against extreme adversity: Civil war in Bosnia and traumatic loss in the United States2Mental Health ExpertsTable 2  25 monthsBIDR-SDself distress / MHE PTSD740.30.35
22Self-enhancement among high-exposure survivors of the September 11th terrorist attack: Resilience or social maladjustment1Friend/FamilyBIDR-SDself depression 18 months / mental health450.290.33
23Decomposing a Sense of Superiority: The Differential Social Impact of Self-Regard and Regard for Others1Zero-AquaintanceSRMneuroticism (reversed)235NA0.02
24Personality, Emotionality, and Risk Prediction194NANA
24Personality, Emotionality, and Risk Prediction2119NANA
25Social desirability scales as moderator and suppressor variables1MCSD300NANA

Personality, Partnership and Well-Being

Personality psychologists have been successful in promoting the Big Five personality factors as a scientific model of personality. Short scales have been developed that make it possible to include Big Five measures in studies with large nationally representative samples. These data have been used to examine the influence of personality on wellbeing in married couples (Dyrenforth et al., 2010).

The inclusion of partners’ personality in studies of well-being has produced two findings. First, being married to somebody with a desirable personality (low neuroticism, high extraversion, openness, agreeableness, and conscientiousness) is associated with higher well-being. Second, similarity in personality is not a predictor of higher well-being.

A recent JPSP article mostly replicated these results (van Scheppingen, Chopik, Bleidorn, & Denissen, 2019). “Similar to previous studies using difference scores and profile correlations, results from response surface analyses indicated that personality similarity explained a small amount of variance in well-being as compared with the amount of variance explained by linear actor and partner effects” (e51)

Unfortunately, personality psychologists have made little progress in the measurement of the Big Five and continue to use observed scale scores as if they are nearly perfect measures of personality traits. This practice is problematic because it has been demonstrated in numerous studies that a large portion of the variance in Big Five scale scores is measurement error. Moreover, systematic rating biases have been shown to contaminate Big Five scale scores.

Anusic et al. (2009) showed how at least some of the systematic measurement errors can be removed from Big Five scale scores by means of structural equation modelling. In a structural equation model, the shared variance due to evaluative biases can be modelled with a halo factor, while the residual variance is treated as a more valid measure of the Big Five traits.

The availability of partner data makes it possible to examine whether the halo biases of husbands and wives are correlated. It is also possible to see whether halo bias of a partner has positive effects on well-being. As halo bias in self-ratings is often considered a measure of self-enhancement, it is possible that partner who enhance their own personality have a negative effect on well-being. Alternative, partners who enhance themselves are also more likely to have a positive perception of their partner (Kim et al., 2012), which could increase well-being. An interesting question is how much partner’s actual personality influences well-being after halo bias is removed from partner’s ratings of personality.

It was easy to test these hypotheses with the correlations reported in Table 1 in van Scheppingen et al.’s article, which is based on N = 4,464 couples in in the Health and Retirement Study. Because information about standard deviations were not provided, all SDs were set to 1. However, the actual SDs of Big Five traits tend to be similar so that this is a reasonable approximation.

I fitted the Halo-Alpha-Beta model to the data, but as with other datasets alpha could not be identified. Instead, a positive correlation between agreeableness and extraversion was present in this Big Five measure, which may reflect some secondary loadings that could be modelled with items as indicators. I allowed for the two halo factors to be correlated and I allowed well-being to be predicted by actor-halo and partner-halo. I also allowed for spousal similarity for each Big Five dimension. Finally, well-being was influenced by self-neuroticism and partner-neuroticism because neuroticism is the strongest predictor of well-being. This model had acceptable fit, CFI = .981, RMSEA = .038.

Figure 1 shows the model and the standardized parameter estimates.

The main finding is that self-halo is the strongest predictor of self-rated well-being. This finding replicates Kim et al.’s (2012) results. True neuroticism (tna; i.e., variance in neuroticism ratings without halo bias) is the second strongest predictor. The third strongest predictor is partner’s true neuroticism, although it explains less than 1% of the variance in well-being. The model also shows a positive correlation between partners’ halo factors, r = .32. This is the first demonstration that spouses’ halos are positively correlated. More research is needed to examine whether this is a robust finding and what factors contribute to spousal similarity in halo. This correlation has implications for spousal similarity in actual personality traits. After removing shared halo variance, spousal similarity is only notable for openness, r = .19, and neuroticism, r = . 13.

The key implications of this model is that actual personality traits, at least those measured with the Big Five, have a relatively small effect on well-being. The only trait with a notable contribution is neuroticism, but partner’s neuroticism explains less than 1% of the variance in well-being. An open question is whether the effect of self-halo should be considered a true effect on well-being or whether it simply reflects shared method variance (Schimmack & Kim, in press).

It is well-known that well-being is relatively stable over extended periods of time (Anusic & Schimmack, 2016; Schimmack & Oishi; 2005) and that spouses have similar levels of well-being (Schimmack & Lucas, 2010). The present results suggest that the Big Five personality traits account only for a small portion of the stable variance that is shared between spouses. This finding should stimulate research that looks beyond the Big Five to study well-being in married couples. This blog post shows the utility of structural equation modelling to do so.