Dan Ariely and the Credibility of (Social) Psychological Science

It was relatively quiet on academic twitter when most academics were enjoying the last weeks of summer before the start of a new, new-normal semester. This changed on August 17, when the datacolada crew published a new blog post that revealed fraud in a study of dishonesty (http://datacolada.org/98). Suddenly, the integrity of social psychology was once again discussed on twitter, in several newspaper articles, and an article in Science magazine (O’Grady, 2021). The discovery of fraud in one dataset raises questions about other studies in articles published by the same researcher as well as in social psychology in general (“some researchers are calling Ariely’s large body of work into question”; O’Grady, 2021).

The brouhaha about the discovery of fraud is understandable because fraud is widely considered an unethical behavior that violates standards of academic integrity that may end a career (e.g., Stapel). However, there are many other reasons to be suspect of the credibility of Dan Ariely’s published results and those by many other social psychologists. Over the past decade, strong scientific evidence has accumulated that social psychologists’ research practices were inadequate and often failed to produce solid empirical findings that can inform theories of human behavior, including dishonest ones.

Arguably, the most damaging finding for social psychology was the finding that only 25% of published results could be replicated in a direct attempt to reproduce original findings (Open Science Collaboration, 2015). With such a low base-rate of successful replications, all published results in social psychology journals are likely to fail to replicate. The rational response to this discovery is to not trust anything that is published in social psychology journals unless there is evidence that a finding is replicable. Based on this logic, the discovery of fraud in a study published in 2012 is of little significance. Even without fraud, many findings are questionable.

Questionable Research Practices

The idealistic model of a scientist assumes that scientists test predictions by collecting data and then let the data decide whether the prediction was true or false. Articles are written to follow this script with an introduction that makes predictions, a results section that tests these predictions, and a conclusion that takes the results into account. This format makes articles look like they follow the ideal model of science, but it only covers up the fact that actual science is produced in a very different way; at least in social psychology before 2012. Either predictions are made after the results are known (Kerr, 1998) or the results are selected to fit the predictions (Simmons, Nelson, & Simonsohn, 2011).

This explains why most articles in social psychology support authors’ predictions (Sterling, 1959; Sterling et al., 1995; Motyl et al., 2017). This high success rate is not the result of brilliant scientists and deep insights into human behaviors. Instead, it is explained by selection for (statistical) significance. That is, when a result produces a statistically significant result that can be used to claim support for a prediction, researchers write a manuscript and submit it for publication. However, when the result is not significant, they do not write a manuscript. In addition, researchers will analyze their data in multiple ways. If they find one way that supports their predictions, they will report this analysis, and not mention that other ways failed to show the effect. Selection for significance has many names such as publication bias, questionable research practices, or p-hacking. Excessive use of these practices makes it easy to provide evidence for false predictions (Simmons, Nelson, & Simonsohn, 2011). Thus, the end-result of using questionable practices and fraud can be the same; published results are falsely used to support claims as scientifically proven or validated, when they actually have not been subjected to a real empirical test.

Although questionable practices and fraud have the same effect, scientists make a hard distinction between fraud and QRPs. While fraud is generally considered to be dishonest and punished with retractions of articles or even job losses, QRPs are tolerated. This leads to the false impression that articles that have not been retracted provide credible evidence and can be used to make scientific arguments (studies show ….). However, QRPs are much more prevalent than outright fraud and account for the majority of replication failures, but do not result in retractions (John, Loewenstein, & Prelec, 2012; Schimmack, 2021).

The good news is that the use of QRPs is detectable even when original data are not available, whereas fraud typically requires access to the original data to reveal unusual patterns. Over the past decade, my collaborators and I have worked on developing statistical tools that can reveal selection for significance (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020; Schimmack, 2012). I used the most advanced version of these methods, z-curve.2.0, to examine the credibility of results published in Dan Ariely’s articles.


To examine the credibility of results published in Dan Ariely’s articles I followed the same approach that I used for other social psychologists (Replicability Audits). I selected articles based on authors’ H-Index in WebOfKnowledge. At the time of coding, Dan Ariely had an H-Index of 47; that is, he published 47 articles that were cited at least 47 times. I also included the 48th article that was cited 47 times. I focus on the highly cited articles because dishonest reporting of results is more harmful, if the work is highly cited. Just like a falling tree may not make a sound if nobody is around, untrustworthy results in an article that is not cited have no real effect.

For all empirical articles, I picked the most important statistical test per study. The coding of focal results is important because authors may publish non-significant results when they made no prediction. They may also publish a non-significant result when they predict no effect. However, most claims are based on demonstrating a statistically significant result. The focus on a single result is needed to ensure statistical independence which is an assumption made by the statistical model. When multiple focal tests are available, I pick the first one unless another one is theoretically more important (e.g., featured in the abstract). Although this coding is subjective, other researchers including Dan Ariely can do their own coding and verify my results.

Thirty-one of the 48 articles reported at least one empirical study. As some articles reported more than one study, the total number of studies was k = 97. Most of the results were reported with test-statistics like t, F, or chi-square values. These values were first converted into two-sided p-values and then into absolute z-scores. 92 of these z-scores were statistically significant and used for a z-curve analysis.

Z-Curve Results

The key results of the z-curve analysis are captured in Figure 1.

Figure 1

Visual inspection of the z-curve plot shows clear evidence of selection for significance. While a large number of z-scores are just statistically significant (z > 1.96 equals p < .05), there are very few z-scores that are just shy of significance (z < 1.96). Moreover, the few z-scores that do not meet the standard of significance were all interpreted as sufficient evidence for a prediction. Thus, Dan Ariely’s observed success rate is 100% or 95% if only p-values below .05 are counted. As pointed out in the introduction, this is not a unique feature of Dan Ariely’s articles, but a general finding in social psychology.

A formal test of selection for significance compares the observed discovery rate (95% z-scores greater than 1.96) to the expected discovery rate that is predicted by the statistical model. The prediction of the z-curve model is illustrated by the blue curve. Based on the distribution of significant z-scores, the model expected a lot more non-significant results. The estimated expected discovery rate is only 15%. Even though this is just an estimate, the 95% confidence interval around this estimate ranges from 5% to only 31%. Thus, the observed discovery rate is clearly much much higher than one could expect. In short, we have strong evidence that Dan Ariely and his co-authors used questionable practices to report more successes than their actual studies produced.

Although these results cast a shadow over Dan Ariely’s articles, there is a silver lining. It is unlikely that the large pile of just significant results was obtained by outright fraud; not impossible, but unlikely. The reason is that QRPs are bound to produce just significant results, but fraud can produce extremely high z-scores. The fraudulent study that was flagged by datacolada has a z-score of 11, which is virtually impossible to produce with QRPs (Simmons et al., 2001). Thus, while we can disregard many of the results in Ariely’s articles, he does not have to fear to lose his job (unless more fraud is uncovered by data detectives). Ariely is also in good company. The expected discovery rate for John A. Bargh is 15% (Bargh Audit) and the one for Roy F. Baumester is 11% (Baumeister Audit).

The z-curve plot also shows some z-scores greater than 3 or even greater than 4. These z-scores are more likely to reveal true findings (unless they were obtained with fraud) because (a) it gets harder to produce high z-scores with QRPs and replication studies show higher success rates for original studies with strong evidence (Schimmack, 2021). The problem is to find a reasonable criterion to distinguish between questionable results and credible results.

Z-curve make it possible to do so because the EDR estimates can be used to estimate the false discovery risk (Schimmack & Bartos, 2021). As shown in Figure 1, with an EDR of 15% and a significance criterion of alpha = .05, the false discovery risk is 30%. That is, up to 30% of results with p-values below .05 could be false positive results. The false discovery risk can be reduced by lowering alpha. Figure 2 shows the results for alpha = .01. The estimated false discovery risk is now below 5%. This large reduction in the FDR was achieved by treating the pile of just significant results as no longer significant (i.e., it is now on the left side of the vertical red line that reflects significance with alpha = .01, z = 2.58).

With the new significance criterion only 51 of the 97 tests are significant (53%). Thus, it is not necessary to throw away all of Ariely’s published results. About half of his published results might have produced some real evidence. Of course, this assumes that z-scores greater than 2.58 are based on real data. Any investigation should therefore focus on results with p-values below .01.

The final information that is provided by a z-curve analysis is the probability that a replication study with the same sample size produces a statistically significant result. This probability is called the expected replication rate (ERR). Figure 1 shows an ERR of 52% with alpha = 5%, but it includes all of the just significant results. Figure 2 excludes these studies, but uses alpha = 1%. Figure 3 estimates the ERR only for studies that had a p-value below .01 but using alpha = .05 to evaluate the outcome of a replication study.

Figur e3

In Figure 3 only z-scores greater than 2.58 (p = .01; on the right side of the dotted blue line) are used to fit the model using alpha = .05 (the red vertical line at 1.96) as criterion for significance. The estimated replication rate is 85%. Thus, we would predict mostly successful replication outcomes with alpha = .05, if these original studies were replicated and if the original studies were based on real data.


The discovery of a fraudulent dataset in a study on dishonesty has raised new questions about the credibility of social psychology. Meanwhile, the much bigger problem of selection for significance is neglected. Rather than treating studies as credible unless they are retracted, it is time to distrust studies unless there is evidence to trust them. Z-curve provides one way to assure readers that findings can be trusted by keeping the false discovery risk at a reasonably low level, say below 5%. Applying this methods to Ariely’s most cited articles showed that nearly half of Ariely’s published results can be discarded because they entail a high false positive risk. This is also true for many other findings in social psychology, but social psychologists try to pretend that the use of questionable practices was harmless and can be ignored. Instead, undergraduate students, readers of popular psychology books, and policy makers may be better off by ignoring social psychology until social psychologists report all of their results honestly and subject their theories to real empirical tests that may fail. That is, if social psychology wants to be a science, social psychologists have to act like scientists.

The Myth of Lifelong Personality Development

The German term for development is Entwicklung and evokes the image of a blossom slowly unwrapping its petals. This process has a start and a finish. At some point the blossom is fully open. Similarly, human development has a clear start with conception and usually an end when an individual becomes an adult. Not surprisingly, developmental psychology initially focused on the first two decades of a human life.

At some point, developmental psychologists also started to examine the influence of age at the end of life. Here, the focus was on successful aging in the face of biological decline. The idea of development at the beginning of life and decline at the end of life is consistent with the circle of life that is observed in nature.

In contrast to the circular conception of life, some developmental psychologists propose that that some psychological processes continue to develop throughout adulthood. The idea of life-long development or growth makes the most sense for psychological processes that depend on learning. Over the life course, individuals acquire knowledge and skills. Although practice or the lack thereof may influence performance, individuals with a lot of experience are able to build on their past experiences.

Personality psychologists have divergent views about the development of personality. Some assume that personality is like many other biological traits. They develop during childhood when the brain establishes connections. However, when this process is completed, personality remains fairly stable. Moreover, new experiences may still change neural patterns and personality, but these changes will be idiosyncratic and differ from person to person. These theories do not predict a uniform increase in some personality traits during adulthood.

An alternative view is that we can distinguish between immature and mature personalities and that personality changes towards a goal of the completely mature personality, akin to the completely unfolded blossom. Moreover, this process of personality development or maturation does not end at the end of childhood. Rather, it is a lifelong process that continuous over the adult life-span. Accordingly, personality becomes more mature as individuals are getting older.

What is a Mature Personality?

The notion of personality development during adulthood implies that some personality traits are more mature than others. After all, developmental processes have an end goal and the end goal is the mature state of being.

However, it is difficult to combine the concepts of personality and development because personality implies variation across individuals, just like there is variation across different types of flowers in terms of the number, shape, and color of petals. Should we say that a blossom with more petals is a better blossom? Which shape or color would reflect a better blossom? The answer is that there is no optimal blossom. All blossoms are mature when they are completely unfolded, but this mature state can look every different for different flowers.

Some personality psychologists have not really solved this problem, but rather used the notion of personality development as a label for any personality changes irrespective of direction. “The term ‘personality development’, as used in this paper, is mute with regard to direction
of change. This means that personality development is not necessarily positive change due to functional adjustment, growth or maturation” (Specht et al., 2014, p. 217). While it is annoying that researchers may falsely use the term development when they mean change, it does absolve the researchers from specifying a developmental theory of personality development.

However, others take the notion of a mature personality more seriously (e.g., Hogan & Roberts, 2004, see also Specht et al., 2014). Accordingly, “a mature person from the observer’s viewpoint would be agreeable (supportive and warm), emotionally stable (consistent and positive), and conscientious (honoring commitments and playing by the rules)” (Hogan & Roberts, 2008, p. 9). According to this conception of a mature personality, the goal of personality development is to achieve a low level of neuroticism and high levels of agreeableness and conscientiousness.

Another problem for personality development theories is the existence of variation in mature traits in adulthood. If agreeableness, conscientiousness, and emotional stability are so useful in adult life, it is not clear why some individuals are biologically disposed to have low levels of these traits. The main explanation for variability in traits is that there are trade-offs and that neither extreme is optimal. For example, too much conscientiousness may lead to over-regulated behaviors that are not adaptive when life changes and being too agreeable makes individuals vulnerable to exploitation. In contrast, developmental theories imply that individuals with high levels of neuroticism and low levels of agreeableness or conscientiousness are not fully developed and would have to explain why some individuals do to achieve maturity.

Developmental processes also tend to have a specified time for the process to be completed. For example, flowers blossom at a specified time of year that is optimal for pollination. In humans, sexual development is completed by the end of adolescence to enable reproduction. So, it is reasonable to ask why development of personality should not also have a normal time of completion. If maturity is required to take on the tasks of an adult, including having children and taking care of them, the process should be completed during early adulthood, so that these trait are fully developed when they are needed. It would therefore make sense to assume that most of the development is completed by age 20 or at least age 30, as proposed by Costa and McCrae (cf. Specht et al., 2014). It is not clear why maturation would still occur in middle age or old age.

One possible explanation for late development could be that some individuals have a delayed or “arrested” development. Maybe some environmental factors impede the normal process of development, but the causal forces persist and can still produce the normative change later in adulthood. Another possibility is that personality development is triggered by environmental events. Maybe having children or getting married are life events that trigger personality development in the same way men’s testosterone levels appear to decrease when they enter long-term relationships and have children.

In short, a theory of lifelong development faces some theoretical challenges and alternative predictions about personality in adulthood are possible.

Empirical Claims

Wrzus and Roberts (2017) claim that agreeableness, conscientiousness, and emotional stability increase from young to middle adulthood citing Roberts et al. (2006), Roberts & Mroczek (2008), and Lucas and Donnellan (2011). They also propose that these changes co-occur with life transitions citing Bleidorn (2012, 2015), Le Donnellan, & Conger (2014), Lodi Smith & Roberts (2012), Specht, Egloff, and Schmukle (2011) and Zimmermann and Neyer (2013). A causal role of life events is implied by the claim that mean levels of the traits decrease in old age (Berg & Johansson, 2014; Kandler, Kornadt, Hagemeyer, & Neyer, 2015; Lucas & Donnellan, 2011; Mottus, Johnson, Starr, & Neyer, 2012). Focusing on work experiences, Asselmann and Specht (2020) propose that conscientiousness increases when people enter the workforce and decreases again at the time of retirement.

A recent review article by Costa, McCrae, and Lockenhoff (2019) also suggests that neuroticism decreases and agreeableness and conscientiousness increase over the adult life-span. However, they also point out that these age-trends are “modest.” They suggest that traits change by about one T-score per decade, which is a standardized mean difference of less than .2 standard deviations per decade. However, this effect size implies that changes may be as large as 1 standard deviation from age 20 to age 70.

More recently, Graham et al. (2020) summarized the literature with the claim that “during the emerging adult and midlife years, agreeableness, conscientiousness, openness, and extraversion tend to increase and neuroticism tends to decrease” (p. 303). However, when they conducted an integrated analysis of 16 longitudinal studies, the results were rather different. Most importantly, agreeableness did not increase. The combined effect was b = .02, with a 95%CI that included zero, b = -.02 to .07. Despite the lack of evidence that agreeableness increases with age during adulthood, the authors “tentatively suggest that agreeableness may increase over time” (p. 312).

The results for conscientiousness are even more damaging for the maturation theory. Here most datasets show a decrease in conscientiousness and the average effect size is statistically significant, b = -.05, 95%CI = -.09 to -.02. However, the effect size is small, suggesting that there is no notable age trend in conscientiousness.

The only trait that showed the predicted age-trend was neuroticism, but the effect size was again small and the upper bound of the 95%CI was close to zero, b = -.05, 95%CI = -.09 to -.01.

In sum, recent evidence from several longitudinal studies challenges the claim that personality develops during adulthood. However, longitudinal studies are often limited by rather short time-intervals of a few years up to one decade. If effect sizes over one decade are small, they can be easily masked by method artifacts (Costa et al., 2019). Although cross-sectional studies have their own problem, they have the advantage that it is much easier to cover the full age-range of adulthood. The key problem in cross-sectional studies is that age-effects can be confounded with cohort effects. However, when multiple cross-sectional studies from different survey years are available, it is possible to separate cohort effects and age-effects. (Fosse & Winship, 2019).

Model Predictions

The maturity model also makes some predictions about age-trends for other constructs. One prediction is that well-being should increase as personality becomes more mature because numerous meta-analyses suggest that emotional stability, agreeableness, and conscientiousness predict higher well-being (Anglim et al., 2020). That being said, falsification of this prediction does not invalidate the maturity model. It is possible that other factors lower well-being in middle age or that higher maturity does not cause higher well-being. However, if the maturity model correctly predicts age effects on well-being, it would strengthen the model. I therefore tested age-effects on well-being and examined whether they are explained by personality development.

Statistical Analysis

Fosse and Winship (2019) noted that “despite the existence of hundreds, if not thousands, of articles and dozens of books, there is little agreement on how to adequately analyze age, period, and cohort data” (p. 468). This is also true for studies of personality development. Many of these studies fail to take cohort effects into account or ignore inconsistencies between cross-sectional and longitudinal results.

Fosse and Winship point out that that there is an identification problem when cohort, period, and age effects are linear, but not if the trends have different distributions. For example, if age effects are non-linear, it is possible to distinguish between linear cohort effects, linear period effects, and non-linear age effects. As maturation is expected to produce stronger effects during early adulthood than in middle and may actually show a decline in older age, it is plausible to expect a non-linear age effect. Thus, I examined age-effects in the German Socio-Economic Panel using a statistical model that examines non-linear age effects, while controlling for linear cohort and linear period effects.

Moreover, I included measures of marital status and work status to examine whether age effects are at least partially explained by these life experiences. The inclusion of these measures can also help with model identification (Fosse & Winship, 2019). For example, work and marriage have well-known age-effects. Thus, any age-effects on personality that are mediated by age are easily distinguished from cohort or period effects.

Measurement of Personality

Another limitation of many previous studies is the use of sum scores as measures of personality traits. It is well-known that these sum scores are biased by response styles (Anusic et al., 2009). Moreover, sum scores are influenced by the specific items that were selected to measure the Big Five traits and specific items can have their own age effects (Costa et al., 2019; Terracciano, McCrae, Brant, & Costa, 2005). Using a latent variable approach, it is possible to correct for random and systematic measurement errors and age effects on individual items. I therefore used a measurement model of personality that corrects for acquiescence and halo biases (Anusic et al., 2009). The specification of the model and detailed results can be found on OSF (https://osf.io/vpcfd/).

A model that assumed only age effects did not fit the data as well as a model that also allowed for cohort and period effects, chi2(df = 211) = 6651, CFI = .974, RMSEA = .021 vs. chi2(df = 201) = 5866, CFI = .977, RMSEA = .020, respectively. This finding shows that age-effects are confounded with other effects in models that do not specify cohort or period effects.

Figure 1 shows the age effects for the Big Five traits.

The results do not support the maturation model. The most inconsistent finding is a strong negative effect of age on agreeableness. However, other traits also did not show a continuous trend throughout adulthood. Conscientiousness increased from age 17 to 35, but remained unchanged afterwards, whereas Openness decreased slightly until age 30 and then increased continuously.

To examine the robustness of these results, I conducted sensitivity analyses with varying controls. The results for agreeableness are shown in Figure 2.

All models show a decreasing trend, but the effect sizes vary. No controls, controlling for either cohort effects or time effects produces a decreasing age trend, but the effect size is small as most scores deviate less than .2 standard deviations from the mean (i.e., zero). However, controlling for time and cohort effects results in the strong decrease observed in Figure 1. Controlling for halo bias makes only a small difference. It is possible that the model that corrects for cohort and time effects overcorrects because it is difficult to distinguish age and time effects. However, none of these results are consistent with the predictions of the maturation model that agreeableness increases throughout adulthood.

Figure 3 takes a closer look at Neuroticism. Inconsistent with the maturation model, most models show a weak increase in neuroticism. The only model that shows a weak decrease controls for cohort effects only. One possible explanation for this finding is that it is difficult to distinguish between non-linear and linear age effects and that the negative time effect is actually an age effect. Even if this were true, the effect size of age is small.

The results for conscientiousness are most consistent with the maturation hypothesis. All models show a big increase from age 17 to age 20, and still a substantial increase from age 20 to age 35. At this point, conscientiousness levels remain fairly stable or decrease in the model that controls only for cohort effects. Although these results are most consistent with the maturation model, they do not support the prediction of a continuous process throughout adulthood. The increase is limited to early adulthood and is stronger at the beginning of adulthood, which is consistent with biological models of development (Costa et al., 2019).

Although not central to the maturation model, I also examined the influence of controls on age-effects for Extraversion and Openness.

Extraversion shows a very small increase over time in the model without controls and the model that controls only for period (time) effects. However, this trend turns negative in models that control for cohort effects. However, all effect sizes are small.

Openness shows different results for models that control for cohort effects or not. Without taking cohort effects into account, openness appears to decrease. However, after taking cohort effects into account, openness stays relatively unchanged until age 30 and then increases gradually. These results suggest that previous cross-section studies may have falsely interpreted cohort effects as age-effects and that openness does not decrease with age.

Work and Marriage as Mediators

Personality psychologists have focussed on two theories to explain increases in conscientiousness during early adulthood. Some personality psychologists assume that it reflects the end stage of a biological process that increases self-regulation throughout childhood and adolescence (Costa & McCrae, 2006; Costa et al., 2019). The process is assumed to be complete by age 30. The present results suggest that it may be a bit later at age 35. The alternative theory is the social roles influence personality (Roberts, Wood, & Smith, 2005). A key prediction of the social investment theory is that personality development occurs when adults take on important social roles such as working full time, entering long-term romantic relationships (marriage), or parenting.

The SOEP makes it possible to test the social investment theory because it included questions about work and marital status. Most young adults start working full-time during their 20s, suggesting that work experiences may produce the increase in conscientiousness during this period. In Germany, marriage occurs later when individuals are in their 30s. Therefore marriage provides a particularly interesting test of the social investment theory because marriage occurs when biological maturation is mostly complete.

Figure 7 shows the age effect for work status. The age effect is clearly visible for all models and only slightly influenced by controlling for cohort or time effects.

Figure 8 shows the figure for marital status with cohabitating participants counted as married. The figure confirms that most Germans enter long-term relationships in their 30s.

To examine the contribution of work and marriage to the development of conscientiousness, I included marriage and work as predictors of conscientiousness. In this model the age-effects on conscientiousness can be decomposed into (a) an effect mediated by work (age -> work -> C), (b) an effect mediated by marriage (age -> married -> C), and an effect of age that is mediated by unmeasured variables (e.g., biological processes). Results are similar for the various models and I present the results for the model that controls for cohort and time effects.

The results show no effect of marriage; that is the effect size for the indirect effect is close to zero, but both work and unmeasured mediators contribute to the total age effect. The unmeasured mediators produce a step increase in the early 20s. This finding is consistent with a biological maturation hypothesis. Moreover, the unmeasured mediators produce a gradual decline over the life span with a surprising uptick at the end. This trajectory may be a sign of cognitive decline. The work effect increases much more gradually and is consistent with the social-role theory. Accordingly, the decrease in conscientiousness after age 55 is related to retirement. The negative effect of retirement on conscientiousness raises some interesting theoretical questions about the definition of personality. Does retirement really alter personality or does it merely alter situational factors that influence conscientious behaviors? To separate these hypotheses, it would be important to examine behaviors outside of work, but the trait measure that was used in this study does not provide information about the consistency of behaviors across different situations.

The key finding is that the data are consistent with two theories that are often treated as mutually exclusive and competing hypotheses. The present results suggest that biological processes and social roles contribute to the development of conscientiousness during early adulthood. However, there is no evidence that this process continuous in middle or late adulthood and role effects tend to disappear as soon as individuals are retiring.

Personality Development and Well-Being

One view of personality assumes that variation is personality is normal and that no personality trait is better than another. In contrast, the maturation model implies that some traits are more desirable, if only because they are instrumental to fulfill roles of adult life like working or maintaining relationships (McCrea & Costa, 1991). Accordingly, more mature individuals should have higher well-being. While meta-analyses suggest that this is the case, they often do not control for rating biases. When rating biases are taken into account, the positive effects of agreeableness and conscientiousness are not always found and are small (Schimmack, Schupp, & Wagner, 2008; Schimmack & Kim, 2020).

Another problem for the maturation theory is that well-being tends to decrease from early to middle adulthood when maturation should produce benefits. However, it is possible that other factors explain this decrease in well-being and maturation buffers these negative effects. To test this hypothesis, I added life-satisfaction to the model and examined mediators of age-effects on life-satisfaction.

An inspection of the direct relationships of personality traits and life-satisfaction confirmed that life-satisfaction ratings are most strongly influenced by neuroticism, b = -.37, se = .01. Response styles also had notable effects; halo b = .15, se = .01, acquiescence, b = .19, se = .01. The effects of the remaining Big Five traits were weak: E b = .078, se = .01, A = .07, se = .01, C = .02, se = .005, O = .07, se = .01. The weak effect of conscientiousness makes it unlikely that age-effects on conscientiousness contribute to age-effects on life-satisfaction.

The next figure shows the age-effect for life-satisfaction. The total effect is rather flat and shows only an increase in the 60s.

The mostly stable level of life-satisfaction masks two opposing trends. As individuals enter the workforce and get married, life-satisfaction actually increases. The positive trajectory for work reverses when individuals retire, while the positive effect of marriage remains. However, the positive effects of work and marriage are undone by unexplained factors that decrease well-being until age 50, when a rebound is observed. Neuroticism is not a substantial mediator because there are no notable age-effects on neuroticism. Conscientiousness is not a notable mediator because it does not predict life-satisfaction.

The main insight from these findings is that achieving major milestones of adult life is associated with increased well-being, but that these positive effects are not explained by personality development.


Narrative reviews claim that personality develops steadily through adulthood. For example, in a just published review of the literature Roberts and Yoon claim that “agreeableness, conscientiousness, and emotional stability show increases steadily through midlife” (p. 10). Roberts and Yoon also claim that “forming serious partnerships is associated with decreases in neuroticism and increases in conscientiousness” (p. 11). The problem with these broad and vague statements is that they ignores inconsistencies across cross-sectional and longitudinal analyses (Lucas & Donnellan, 2011), inconsistencies across populations (Graham et al., 2020), and effect sizes (Costa et al., 2019).

The present results challenge this simplistic story of personality development. First, only conscientiousness shows a notable increase from late adolescence to middle age and most of the change occurs during early adulthood before the age of 35. Second, formation of long-term relationships had no effect on neuroticism or conscientiousness. Participation in the labor force did increase conscientiousness, but these gains were lost when older individuals retired. If conscientiousness were a sign of maturity, it is not clear why it would decrease after it was acquired. In short, the story of life-long development is not based on scientific facts.

The notion of personality development is also problematic from a theoretical perspective. It implies that some personality traits are better, more mature, than others. This has led to calls for interventions to help people to become more mature (Bleidorn et al., 2019). However, this proposal imposes values and implicitly devalues individuals with the wrong traits. An alternative view treats personality as variation without value judgment. Accordingly, it may be justified to help individuals to change their personality if they want to change their personality, just like gender changes are now considered a personal choice without imposing gender norms on individuals. However, it would be wrong to subject individuals to programs that aim to change their personality, just like it is now considered wrong to subject individuals to interventions that target their sexual orientation. Even if individuals want to change, it is not clear how much personality can be changed. Thus, another goal should be to help individuals with different personality traits to feel good about themselves and to live fulfilling lives that allow them to express their authentic personality. The rather weak relationships between many personality traits and well-being suggests that it is possible to have high well-being with a variety of personalities. The main exception is neuroticism, which has a strong negative effect on well-being. However, the question here is how much of this relationship is driven by mood disorders rather than normal variation in personality. The effect may also be moderated by social factors that create stress and anxiety.

In conclusion, the notion of personality development lacks clear theoretical foundations and empirical support. While there are some relatively small mean level changes in personality over the life span, they are relatively trivial compared to the large stable variance in personality traits across individuals. Rather than considering this variation as arrested forms of development, it should be celebrate as diversity that enriches everybody’s life.

Conflict of Interest: My views may be biased by my (immature) personality (high N, low A, low C).

P.S. I asked Brent W. Roberts for comments, but he declined the opportunity. Please share your comments in the comment section.

Most published results in medical journals are not false

Peer Reviewed by Editors of Biostatistics “You have produced a nicely written paper that seems to be mathematically correct and I enjoyed reading” (Professor Dimitris Rizopoulos & Professor Sherri Rose)

Estimating the false discovery risk in medical journals

Ulrich Schimmack
Department of Psychology, University of Toronto Mississauga 3359 Mississauga Road N. Mississauga, Ontario Canada ulrich.schimmack@utoronto.ca

Frantisek Bartos
Department of Psychology, University of Amsterdam;
Institute of Computer Science, Czech Academy of Sciences, Prague, Czech Republic


Jager and Leek (2014) proposed an empirical method to estimate the false discovery rate in top medical journals and found a false discovery rate of 14%. Their work received several critical com- ments and has had relatively little impact on meta-scientific discussions about medical research. We build on Jager and Leek’s work and present a new way  to estimate the false discovery risk.  Our results closely reproduce their original finding with a false discovery rate of 13%. In addition, our method shows clear evidence of selection bias in medical journals, but the expected discovery rate is 30%, much higher than we would expect if most published results were false. Our results provide further evidence that meta-science needs to be built on solid empirical foundations.

Keywords: False discovery rate; Meta-analysis; Science-wise false discovery rate; Significance testing; Statistical powe


The successful development of vaccines against Covid-19 provides a vivid example of a scientific success story. At the same time, many sciences are facing a crisis of confidence in published results. The influential article “Why most published research findings are false” suggested that many published significant results are false discoveries (Ioannidis, 2005). One limitation of Ioannidis’s article was the reliance on a variety of unproven assumptions. For  example, Ioannidis assumed  that only 1 out of 11 exploratory epidemiological studies tests a true hypothesis. To address this limitation, Jager and Leek (2014) developed a statistical model to estimate the percentage of false- positive results in a set of significant p-values. They applied their model to 5,322 p-values from medical journals and found that only 14% of the significant results may be false-positives. This     is a sizeable percentage, but it is much lower than the false-positive rates predicted by Ioannidis. Although Jager and Leek’s article was based on actual data, the article had a relatively week impact on discussions about false-positive risks. So far, the article has received only 73 citations   in WebOfScience. In comparison, Ioannidis’s purely theoretical article has been cited 518 times in 2020 alone. We believe that Jager and Leek’s article deserves a second look and that discussions about the credibility of published results benefit from empirical investigations

Estimating the False Discovery Risk

To  estimate the false discovery rate, Jager and Leek developed a model with two populations   of studies. One population includes studies in which the null-hypothesis is true (H0). The other population includes studies in which the null-hypothesis is false; that is, the alternative hypothesis is true (H1). The model assumes that the observed distribution of significant p-values is a mixture of these two populations.

One problem for this model is that it can be difficult to distinguish between studies in which H0 is true and studies in which H1 is true, but it was tested with low statistical power. Furthermore,  the distinction between the point-zero null-hypothesis, the nil-hypothesis (Cohen, 1994), and alternative hypotheses with very small effect sizes is rather arbitrary. Many effect sizes may not   be exactly zero but too small to have practical significance. This makes it difficult to distinguish clearly between the two populations of studies and estimates based on models that assume distinct populations may be unreliable.

To avoid the distinction between two  populations of  p-values, we  distinguish between the  false discovery rate and the false discovery risk. The false discovery risk does not aim to estimate the actual rate of H0  among significant p-values. Rather, it provides an estimate of the worst-case scenario with the highest possible amount of false-positive results. To estimate the false discovery risk, we take advantage of Soric’s (1989) insight that the maximum false discovery rate is limited by statistical power to detect true effects. When power is 100%, all non-significant results are produced by testing false hypotheses (H0). As this scenario maximizes the number of non-significant H0, it also maximizes the number of significant H0 tests and the false discovery rate.  Soric  showed  that  the  maximum  false  discovery  rate  is  a  direct  function  of  the  discovery rate. For example, if 100 studies produce 30 significant results, the discovery rate is 30%. And when the discovery rate is 30%, the maximum false discovery risk with α = 5% is 0.12. In general, the false discovery risk is a simple transformation of the discovery rate, such as

false discovery risk = (1/discovery rate 1) × (α/(1 − α)).

Our suggestion to estimate the false discovery risk rather than the actual false discovery rate addresses concerns about Jager and Leek’s two-population model that were raised in several commentaries (Gelman and O’Rourke, 2014; Benjamini and Hechtlinger, 2014; Ioannidis, 2005; Goodman, 2014).

If all conducted hypothesis tests were reported, the false discovery risk could be determined simply by computing the percentage of significant results. However, it is well-known that journals are more likely to publish significant results than non-significant results. This selection bias renders  the  observed  discovery  rate  in  journals  uninformative  (Bartoˇs  and  Schimmack,  2021; Brunner and Schimmack, 2020). Thus, a major challenge for any empirical estimates of the false discovery risk is to take selection bias into account.

Biostatistics published several commentaries to Jager and Leek’s article. A commentary by Ioannidis (2014) may have contributed to the low impact of Jager and Leek’s article. Ioannidis claims that Jager and Leek’s  results  can  be  ignored  because  they  used  automatic  extraction of p-values, a wrong method, and unreliable data. We address these concerns by  means of a new extraction method, a new estimation method, and new simulation studies that evaluate the performance of Jager and Leek’s original method and a new method. To  foreshadow the main results, we  find that Jager and Leek’s method can sometimes produce biased estimates of the   false discovery risk. However, our improved method produces even lower estimates of the false discovery risk. When we applied this method to p-values from medical journals, we obtained an estimate of 13% that closely matches Jager and Leek’s original results. Thus, although Ioannidis (2014) raised some valid objections, our results provide further evidence that false discovery rates in medical research are much lower than Ioannidis (2005) predicted.


Jager and Leek (2014) proposed a selection model that could be fitted to the observed distribution of significant p-values. This model assumed a flat distribution for p-values from the H0 population and a beta distribution for p-values from the H1 population. A single beta distribution can only approximate the actual distribution of p-values; a better solution is to use a mixture of several beta-distributions or, alternatively, convert the p-values into z-scores and model the z-scores with several truncated normal distributions (similar to the suggestion by Cox, 2014). Since reported p-values often come from two-sided tests, the resulting z-scores need to be converted into absolute z-scores that can be modeled as a mixture of truncated folded normal distributions (Bartoˇs and Schimmack, 2021). The weights of the mixture components can then be used to compute the average power of studies that produced a significant result. As this estimate is limited to the set of studies that were significant, we refer to it as the average power after selection for statistical significance. As power determines the outcomes of replication studies, the average power after selection for statistical significance is an estimate of the expected replication rate.

Although an estimate of the expected replication rate is valuable in its own right, it does not provide an estimate of the false discovery risk because it is  based  on  the  population of studies after selection for statistical significance. To estimate the expected discovery rate, z-curve models the selection process operating on the significance level and assumes that studies produce  a statistically significant result proportionally to their power. For example, studies  with  50% power produce one non-significant result for every significant result and studies with 20% power produce four statistically non-significant results for every significant result. It is therefore possible to estimate the average power before selection of statistical significance based on the weights of the mixture components that are obtained by  fitting the model to only significant results. As  power determines the percentage of significant results, we refer to average power before selection for statistical significance as the expected discovery rate.Extensive simulation studies have demonstrated that z-curve produces good large-sample estimates  of  the  expected  discovery  rate  with  exact  p-values  (Bartoˇs  and  Schimmack,  2021). Moreover, these simulation studies showed that z-curve produces  robust  confidence  intervals with good coverage. As the false discovery risk is a simple transformation of the EDR, these confidence intervals also provide confidence intervals for estimates of the false discovery risk. To use z-curve for p-values from medical abstracts, we extended z-curve’s expectation-maximization (EM) algorithm (Dempster and others, 1977) to incorporate rounding and censoring similarly to Jager and Leek’s model. To demonstrate that z-curve can obtain valid estimates of the false discovery risk for medical journals, we conducted a simulation study that compared Jager and Leek’s method with z-curve.

Simulation Study

We extended the simulation performed by Jager and Leek in several ways.  Instead of simulating  H1 p-values directly from a beta distribution, we used power estimates from individual studies based on meta-analyses (Lamberink and others, 2018) and simulated p-values of two-sided z-tests with corresponding power (excluding all power estimates based meta-analyses with non-significant results). This allows us to assess the performance of the methods under heterogeneity of power to detect H1 corresponding to the actual literature. To simulate H0 p-values, we used a uniform distribution.

We manipulated the true false discovery rate from 0 to 1 with a step size of 0.01 and simulated 10,000 observed significant p-values. Similarly to Jager and Leek, we performed four simulation scenarios with an increasing percentage of imprecisely reported p-values. Scenario A used exact p-values, scenario B rounded p-values to three decimal places (with p-values lower than 0.001 censored at 0.001), scenario C rounds 20% p-values to two decimal places (with p-values rounded to 0 censored at 0.01), and scenario D first rounds 20% p-values to two decimal places and further censors 20% p-values at on of the closest ceilings (0.05, 0.01, or 0.001).

Figure 1 displays the true (x-axis) vs. estimated (y-axis) false discovery rate (FDR) for Jager and Leek’s method and the false discovery risk for z-curve across the different scenarios (panels). We see that when precise p-values are reported (panel A in the upper left corner), z-curve can handle the heterogeneity in power very well across the whole range of false discovery rates and produces accurate estimates of false discovery risks. Higher estimate than the actual false discovery rates are expected because the false discovery risk is an estimate of the maximum false discovery rate. Discrepancies are especially expected when power of true hypothesis tests is low. For the simulated scenarios, the discrepancies are less than 20 percentage points and decrease as  the true false discovery rate increases. Even though Jager and Leek’s method aims to estimate the true false discovery rates, it produces higher estimates than z-curve. This is problematic because the method produces inflated estimates of the true false discovery rate. Even if the estimates were interpreted as maximum estimates, the method is less sensitive to the actual variation in the false discovery rate than the z-curve method.

Panel B shows that the z-curve method produces similar results when p-values are rounded to three decimals. The Jager and Leek’s method however experiences estimation issues, especially in the lower spectrum of the true false discovery rate since the current implementation only allows to deal with rounding to two decimal places (we also tried specifying the p-values as a rounded input; however, the optimizing routine failed with several errors).

Panel C shows a surprisingly similar performance of the two methods when 20% of p-values are rounded to two decimals, except for very high levels of true false discovery rates, where Jager and Leek’s method starts to underestimate the false discovery rate. Despite the similar performance, the results have to be interpreted as estimates of the false discovery risk (maximum false discovery rate) because both methods overestimate the true false discovery rate for low false discovery rates.

Panel D shows that both methods have problems when 20% of p-values are at the closest ceiling of .05, .01, or .001 without providing clear information about the exact p-value. Z-curve does a little bit better than Jager and Leek’s method. Underestimation of true false discovery rates over 40% is not a serious problem because any actual false discovery rate over 40% is unacceptably high. One solution to the underestimation problem is to exclude p-values that are reported in this way from analyses.

Root mean square error and bias of the false discovery rate estimates for each scenario summa- rized in Table 1 show that z-curve produces estimates with considerably lower root mean square error. The results for bias show that both methods tend to produce higher estimates than the true false discovery rate. For z-curve this is expected because it aims to estimate the maximum false discovery rate. It would only be a problem if estimates of the false discovery risk were lower than the actual false discovery rate. This is only the case in Scenario D, but as shown previously, underestimation only occurs when the true false discovery rate is high.

To summarize, our simulation confirms that Jager and Leek’s method provides meaningful estimates of the false discovery risk and that the method is likely to overestimate the true false discovery rate. Thus, it is likely that the reported estimate of 14% for top medical journals overestimates the actual false discovery rate. Our results also show that z-curve improves over the original method and that the modifications can handle rounding and imprecise reporting when the false discovery rates are below 40%.

Application to Medical Journals

Commentators raised more concerns about Jager and Leek’s mining of p-values than about their estimation method. To address these concerns, we extended Jager and Leek’s data mining approach in the following ways; (1) we extracted p-values only from abstracts labeled as “randomized controlled trial” or “clinical trial” as suggested by  Goodman (2014); Ioannidis (2014); Gelman  and O’Rourke (2014), (2) we improved the regex script for extracting p-values to cover more possible notations as suggested by Ioannidis (2014), (3) we extracted confidence intervals from abstracts not reporting p-values as suggested by Ioannidis (2014); Benjamini and Hechtlinger (2014). We further scraped p-values from abstracts in “PLoS Medicine” to compare the false discovery rate estimates to a less-selective journal as suggested by Goodman (2014). Finally, we randomly subset the scraped p-values to include only a single p-value per abstract in all analyses, thus breaking the correlation between the estimates as suggested by Goodman (2014). Although there are additional limitations inherent to the chosen approach, these improvements, along with our improved estimation method, make it possible to test the prediction by several commentators that the false discovery rate is well above 14%.

We executed the scraping protocol on July 2021 and scraped abstracts published since 2000 (see Table 2 for a summary of the scraped data). Interactive visualization of the individual abstracts and scraped values can be accessed at https://tinyurl.com/zcurve-FDR.

Figure 2 visualizes the estimated false discovery rates based on z-curve and Jager and Leek’s method based on scraped abstracts from clinical trials and randomized controlled  trials  and  further divided by  journal and whether the article was published before (and including) 2010   (left) or after 2010 (right). We see that, in line with the simulation results, Jager and Leek’s  method produces slightly higher false discovery rate estimates. Furthermore, z-curve produced considerably wider bootstrapped confidence intervals, suggesting that the confidence interval reported by Jager and Leek (± 1 percentage point) was too narrow.

A comparison of the false discovery estimates based on data before (and including) 2010 and after 2010 shows that confidence intervals overlap, suggesting that false discovery rates have not changed. Separate analyses based on clinical trials and randomized controlled trials also showed no significant differences (see Figure 3). Therefore, to reduce the uncertainty about the false discovery rate, we estimate the false discovery rate for each journal irrespective of publication year. The resulting false discovery rate estimates based z-curve and Jager and Leek’s method are summarized in Table  3. We  find that all false discovery rate estimates fall within a .05     to .30 interval. Finally, further aggregating data across the journals provides a false discovery rate estimate of 0.13, 95% [0.08, 0.21] based on z-curve and 0.19, 95% [0.17, 0.20] based on Jager and Leek’s method. This finding suggests that Jager and Leek’s extraction method slightly underestimate the false discovery rate, whereas their model overestimated the false discovery rate.

Additional Z-Curve Results

So far, we used the expected discovery rate only to estimate the false discovery risk, but the expected discovery rate provides valuable information in itself. Ioannidis’s predictions of the false discovery rate were based on scenarios that assumed that less than 10% of all hypothesis are true hypothesis. The same assumption was made to recommend lowering α from .05 to .005 (Benjamin and others, 2018). If all true hypotheses were tested with 100% power, the discovery rate would match the percentage of true hypotheses plus the false-positive results; 10% + 90% × .05 =  14.5%. Because the actual power is less than 100%, the discovery rate would be even less, but    the estimated expected discovery rate for top medical journals is 30% with a confidence interval ranging from 20% to 41%. Thus, our results suggest that previous speculations about discovery rates were overly pessimistic.

The expected discovery rate also provides valuable information about the extent of selection bias in medical journals. While the expected discovery rate is only 30%, the observed discovery rate (i.e., the percentage of significant results in abstracts) is more than double (69.7%). This discrepancy is visible in Figure 4. The histogram of observed non-significant z-scores does not match the predicted distribution (blue curve). This evidence of selection bias implies that reported effect sizes are inflated by selection bias. Thus, follow-up studies need to adjust effect sizes when planning the sample sizes via power analyses.

Z-curve also provides information about the replicability of significant results in medical abstracts. The expected replication rate is 65% with a confidence interval  ranging from 61% to    69%. This result suggests that sample sizes should be increased to meet the recommended level    of 80% power. Furthermore, this estimate may be overly optimistic because comparisons of actual replication rates and z-curve predictions show lower success rates for actual replication studies (Bartoˇs and Schimmack, 2021). One reason could be that exact replication studies are impossible and changes in population will result in lower power due to selection bias and regression to the mean. In the worst case, the actual replication rate might be as low as the expected discovery rate. Thus, our results predict that the success rate of actual replication studies in medicine will be somewhere between 30% and 65%.

Finally, z-curve can be used to adjust the significance level α retrospectively to maintain a false discovery risk of less than 5% Goodman. To do so, it is only necessary to compute the expected discovery rate for different levels of α. With α = .01, the expected discovery rate decreases to 20% and the false discovery risk decreases to 4%. Adjusting α to the recommended level of .005 reduced the expected discovery rate to 17% and the false discovery risk to 2%. Based on these results, it is possible to use α = .01 as a criterion to reject the null-hypothesis while maintaining a false positive risk of 5%.


Like many other human activities, science relies on trust. Over the past decade, it has become clear that some aspects of modern science undermine trust. The biggest problem remains the prioritization of new discoveries that meet the traditional threshold of statistical significance. The selection for significance has many undesirable consequences. Although medicine has responded to this problem by demanding preregistration of clinical trials, our results suggest that selection for significance remains a pervasive problem in medical research. As a result, the observed discovery rate and reported effect sizes provide misleading information about the robustness of published results. To maintain trust in medical research, it is important to take selection bias into account. Concerns about the replicability of published results have led to the emergence of meta- science as an active field of research over the past decade. Unlike meta-physics, meta-science is an empirical enterprise that uses data to investigate science. Data can range from survey studies of research practices to actual replication studies. Jager and Leek made a valuable contribution to meta-science by developing a method to estimate the false discovery rate based on published p-values using a statistical model that takes selection bias into account. Their work stimulated discussion, but their key finding that false discovery rates in medicine are not at an alarmingly high rate was ignored. We followed up on Jager and Leek’s seminal contribution with a different estimation model and an improved extraction method to harvest results from medical abstracts. Despite these methodological improvements, our results firmly replicated Jager and Leek’s key finding that false discovery rates in top medical journals are between 10% and 20%.

We also extended the meta-scientific investigation of medical research in several ways. First,  we demonstrated that the false discovery risk can be reduced to less than 5% by lowering the criterion for statistical significance to .01. This recommendation is similar to other proposals     to lower α to .005, but our proposal is based on empirical data. Moreover, the α level can be modified for different fields of studies or it can be changed in the future in response to changes  in research practices. Thus, rather than recommending one fixed α, we recommend to justify α (Lakens and others, 2018). Fields with low discovery rates should use a lower α than fields with high discovery rates to maintain a false discovery risk below 5%.

We also demonstrated that medical journals have substantial selection bias. Whereas the percentage of significant results in abstracts is over 60%, the expected discovery rate is only 30%. This test for selection bias is important because it would be unnecessary to use selection models if selection bias were negligible. Evidence of substantial selection bias may also help to change publication practices in order to reduce selection bias. For example, journals could be evaluated on the basis of the amount of selection bias just like they are being evaluated in terms of impact factors.

Finally, we provided evidence that the average power of studies with significant results is 65%. As power increases for studies with lower p-values, this estimate implies that power for studies  that are significant at p < .01 to produce a p-value below .05 in a replication study would be  even higher. Based on these findings, we  would predict that at least 50% of results that achieved   p < .01 can be successfully replicated. This is comparable to cognitive psychology, where 50% of significant results at p < .05 could be successfully replicated (Open Science Collaboration, 2015).

Limitations and Future Directions

Even though we were able to address several of the criticisms of Jager and Leek’s seminal article, we were unable to address all of them. The question is whether the remaining concerns are sufficient to invalidate our results. We think this is rather unlikely because our results are in line with findings in other fields. The main remaining concern is that mining p-values and confidence intervals from abstracts creates a biased sample of results. The only way to address this concern is to read the actual articles and to pick the focal hypothesis test for the z-curve analysis. Unfortunately, nobody seems to have taken on this daunting task for medical journals. However, social psychologists have hand-coded a large, representative sample of test-statistics (Motyl and others, 2017). The coding used the actual test statistics rather than p-values. Thus, exact p-values were computed and no rounding or truncation problems are present in these data. A z-curve analysis of these data estimated an expected discovery rate of 19%, 95% CI = 6% to 36% (Schimmack, 2020). Given the low replication rate of social psychology, it is not surprising that the expected discovery rate is lower than for medical studies (Open Science Collaboration, 2015). However, even a low expected discovery rate of 19% limits the false discovery risk at 22%, which is not much higher than the false discovery risk in medicine and does not justify the claim that most published results are false. To provide more conclusive evidence for medicine, we strongly encourage hand-coding of medical journals and high powered replication studies. Based on the present results, we predict false positive rates well below 50%.


The zcurve R package is available from https://github.com/FBartos/zcurve/tree/censored.

Supplementary Material

Supplementary Materials including data and R scripts for reproducing the simulations, data scraping, and analyses are available from https://osf.io/y3gae/.


Conflict of Interest: None declared.


Bartos,  Frantisek  and  Schimmack,  Ulrich.  (2021).   Z-curve  2.0:  Estimating  replication rates and discovery rates. Meta-Psychology.

Benjamin, Daniel J, Berger, James O, Johannesson, Magnus, Nosek, Brian A, Wagenmakers,  E-J,  Berk,  Richard,  Bollen,  Kenneth  A,  Brembs,  Bjorn,  Brown, Lawrence, Camerer, Colin and others. (2018). Redefine statistical significance. Nature Human Behaviour 2(1), 6–10.

Benjamini, Yoav and Hechtlinger, Yotam. (2014). Discussion: An estimate of the science- wise false discovery rate and applications to top medical journals by jager and leek. Biostatistics 15(1), 13–16.

Brunner, Jerry and Schimmack, Ulrich. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta-Psychology 4.

Cohen, Jacob. (1994). The earth is round (p ¡.05). American Psychologist 49(12), 997.

Cox, David R. (2014). Discussion: Comment on a paper by jager and leek. Biostatistics 15(1), 16–18.

Dempster, Arthur P, Laird, Nan M and Rubin, Donald B. (1977). Maximum likelihood  from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 39(1), 1–22.

Gelman, Andrew and O’Rourke, Keith. (2014). Difficulties in making inferences about scientific truth from distributions of published p-values. Biostatistics 15(1), 18–23.

Goodman, Steven N. (2014). Discussion: An estimate of the science-wise false discovery rate and application to the top medical literature. Biostatistics 15(1), 13–16.

Ioannidis, John PA. (2005). Why most  published  research  findings  are  false.  PLoS medicine 2(8), e124.

Ioannidis, John PA.  (2014).  Discussion: Why “an estimate of the science-wise false discovery  rate and application to the top medical literature” is false. Biostatistics 15(1), 28–36.

Jager, Leah R  and  Leek,  Jeffrey T. (2014).  An estimate of the science-wise false discovery  rate and application to the top medical literature. Biostatistics 15(1), 1–12.

Lakens, Daniel, Adolfi, Federico G, Albers, Casper J, Anvari, Farid,  Apps, Matthew AJ, Argamon, Shlomo E, Baguley, Thom, Becker, Raymond B, Benning, Stephen D, Bradford, Daniel E and others. (2018). Justify your alpha. Nature Human Behaviour 2(3), 168–171.

Lamberink, Herm J, Otte, Willem M, Sinke, Michel RT, Lakens, Daniel, Glasziou, Paul P, Tijdink, Joeri K and Vinkers, Christiaan H. (2018). Statistical power of clinical trials increased while effect size remained stable: an empirical analysis of 136,212 clinical trials between 1975 and 2014. Journal of Clinical Epidemiology 102, 123–128.

Motyl, Matt, Demos, Alexander P, Carsel, Timothy S, Hanson, Brittany E, Melton, Zachary J, Mueller, Allison B, Prims, JP, Sun, Jiaqing, Washburn, An- thony N, Wong, Kendal M and others. (2017). The state of social and personality science: Rotten to the core, not so bad, getting better, or getting worse? Journal of Personality and Social Psychology 113(1), 34.

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science 349(6251).

Schimmack, Ulrich.  (2020).  A  meta-psychological  perspective  on  the  decade  of  replication failures in social psychology. Canadian Psychology/Psychologie canadienne.

Soric,  Branko.  (1989).    Statistical  “discoveries”  and  effect-size  estimation.   Journal  of  the American Statistical Association 84(406), 608–610.

How to Build a Monster Model of Well-Being: Part 8

So far, I have built a model that relates the Big Five personality traits to well-being. In this model well-being is defined as the weighted average of satisfaction with life domains, positive affect (happy) and negative affect (sad). I showed that most of the personality effects of the Big Five were mediated by the cheerfulness facet of extraversion and the depressiveness facet of neuroticism. I then showed that that there were no gender differences in well-being because women score higher on depressiveness and cheerfulness. Finally, I showed that middle aged parents of students have lower well-being than students and that these age effects were mediated by lower cheerfulness and lower satisfaction with several life domains. The only exception was romantic satisfaction that was higher among parents than among students
(Part 1, Part 2, Part 3 , Part 4. Part 5, Part 6, Part7). Part 8 examines the relationship between positive illusions and well-being.

Positive Illusions and Well-Being

Philosophers have debated whether positive illusions should be allowed to contribute to individuals’ well-being (Sumner, 1996). Some philosophers demand true happiness, where illusions can produce experiences of happiness, but these experiences do not count towards an individual’s well-being. Other theories, most prominently hedonism, have no problem with illusory happiness. Ideally, we would just live a perfect simulated live (think The Matrix) and not care one bit about the fact that our experiences are not real. A third version allows for illusions to contribute to our well-being, if we choose a sweet lie over a bitter truth.

Psychologists tried to settle these questions empirically. An influential article by Taylor and Brown (1988) declared that positive illusions are good for us (our well-being and mental health) and that realistic perceptions of our lives may be maladaptive and may cause depression. In a world of Covid-19, massive forest fires and flooding, this view rings true. However, positive illusions may also have negative effects that can undermine short-lived benefits of positive illusions.

Diener et al. (1999) list a few studies that seemed to support the view that individuals with positive illusions have higher levels of well-being. However, a key problem in research on positive illusions and well-being is that positive illusions and well-being are often measured with self-ratings. It is therefore unclear whether a positive correlation between these two measures reveals a substantial relationship or simply shared method variance. Relatively few studies have tackled this problem and the results are inconsistent (Dufner et al., 2019). Studies that use informant ratings of well-being are particularly rare and suggest that any effect of positive illusions is at best small (Kim, Schimmack, & Oishi, 2012; Schimmack & Kim, 2020).

The monster-model uses the Mississauga Family Study data that were used by Schimmack and Kim (2020). Thus, no effect of positive illusions on well-being is expected. However, the present model examines a new hypothesis that was not investigated by Schimmack and Kim (2020) because the model focussed on the Big Five and did not include facet measures of cheerfulness and depressiveness. The present study examined whether positive illusions in perceptions of the self are related to cheerfulness and depressiveness. To test this hypothesis, the positive illusion factor of self-ratings was related to cheerfulness, depressiveness as well as to experiences of positive affect, negative affect, and life-satisfaction.

The model is illustrated in Figure 1.

Figure 1 is a bit messy and it may be helpful to read previous posts for the basic model that connects factors (in black) with each other (Black lines). Each factor is based on four indicators (self-ratings, informant ratings by students, informant ratings by mothers, and informant ratings by fathers). Figure 2 shows only the self-ratings as orange boxes marked with sr next to each factor. it is assumed that all of these self-ratings share method variance due to a general evaluative bias factor. This factor is represented as the bigger orange box marked as SR in capital letters. It is assumed that all self-ratings load on this factor (orange arrows). Furthermore, the model assumes a positive effect of cheerfulness on evaluative biases (green arrow) and that positive experiences (happy) are influenced by evaluative biases (another green arrow). Depressiveness is expected to be a negative predictor of the evaluative bias factor (red arrow) and evaluative biases are assumed to have a negative effect on sadness (also a red arrow).

Fitting this model to the data reduced model fit, chi2( 1591) = 2655, CFI = .960, RMSEA = .022. The reason is that the general evaluative factor did not explain all of the residual correlations among self-ratings. To improve model fit, additional correlated residuals were allowed. For example, residual variance in self-ratings of recreation satisfaction and friendship satisfaction were correlated. These residual correlations were freed to maintain good fit. The fit of the final model was close to the fit to a model that allowed all correlated residual to be correlated, chi2(1542) = 2082, CFI = .980, RMSEA = .016.

The first important finding was that all self-ratings showed a significant loading (p < .001) on the evaluative bias factor in the predicted direction. The lowest loading was observed for extraversion, b = .18, se = .04, Z = 4.8. The highest loading was observed for self-ratings of positive affect, b = .62, se = .04, Z = 15.1. The loading for self-ratings of life-satisfaction was b = .51, se = .04, Z = 13.7. These results confirm that evaluative biases make a substantial contribution to self-ratings of well-being.

Reproducing Schimmack and Kim’s results, evaluative biases did not predict life-satisfaction (i.e., the shared variance by self-ratings and informant ratings), b = .03, se = .04, Z = 0.7. Evaluative biases also predicted neither positive affect (happy), b = .02, se = .04, Z = 0.4, nor negative affect (sadness), b = -.04, se = .05, Z = 0.8.

The new findings were that cheerfulness was not a significant predictor of evaluative biases, b = .09, se = .06, Z = 1.5, and that depressiveness was a positive rather than negative predictor of evaluative biases, b = .13, se = .06, Z = 2.7. Thus, there is no evidence that individuals with a depressive personality have a negative bias about their personalities or lives. The positive relationship might be a statistical fluke or it might show some deliberate rating bias to overcorrect for negative biases.

As hinted at in Part 7, the evaluative bias factor was significantly correlated with age, b = .37, se = .05, Z = 7.5. At least in this study, parents provided more favorable ratings of themselves than students. Whether this finding shows a general age trend remains to be examined. However, the finding casts a shadow on studies that rely on self-ratings to study personality development. Maybe some of the positive trends such as increased agreeableness or decreased neuroticism are inflated by these biases. It is therefore important to study personality development with measurement models that control for evaluative biases in personality ratings.


The present results challenge the widely held believe that positive illusions are beneficial for well-being and that the absence of positive illusions is associated with depression. At the same time, the present study did replicate previous findings that measures of positive illusions are correlated with self-ratings of well-being. In my opinion, this finding merely reveals that a positive rating bias also influences self-ratings of well-being. Future research needs to ensure that method bias does not produce spurious correlations between measures of positive illusions and measures of well-being. It is sad but true that thirty years of research have been wasted on studies that did not control for method variance even though method variance in personality ratings has been demonstrated over 100 years ago (Thorndike, 1920) and is one of the most robust and well-replicated findings in personality research (Campbell & Fiske, 1959).

How to Build a Monster Model of Well-Being: Part 7

The first five parts built a model that related personality traits with well-being. Part six added sex (male/female) to the model. It may not come as a surprise that part 7 adds age to the model because sex and age are two commonly measured demographic variables.

Age and Wellbeing

Diener et al.’s (1999) review article pointed out that early views of old age as a period of poor health and misery was not supported by empirical studies. Since then, some studies with national representative samples have found a U-shaped relationship between age and well-being. Accordingly, well-being decreases from young adulthood to middle age and then increases again into old age before well-being declines at the end of life. Thus, there is some evidence for a mid-life crisis (Blanchflower, 2021).

The present dataset cannot examine this U-shaped pattern because data are based on students and their parents, but the U-shaped pattern would predict that students have higher well-being than their middle-aged parents.

McAdams, Lucas, and Donnellan (2012) found that the relationship between age and life-satisfaction was explained by effects of age on life-domains. According to their findings in a British sample, health satisfaction decreased with age, but housing satisfaction increased with age. The average trend across domains mirrored the pattern for life-satisfaction judgments.

Based on these findings, I expected that age was a negative predictor of life-satisfaction and that this negative relationship is mediated by domain satisfaction. To test this prediction I added age as a predictor variable. As for sex, age is an exogeneous variable because age can influence personality and well-being, but personality cannot influence (biological) age. Although age was added as a predictor for all factors in the model, overall model fit decreased, chi2(1478) = 2198, CFI = .973, RMSEA = .019. This can happen when a new variable is also related to the unique variances of indicators. Inspection of the modification indices showed some additional relationships with self-ratings that suggested older respondents have a positive bias in their self-ratings. To allow for this possibility, I allowed all self-ratings to be influenced by age. This modification substantially increased model fit, chi2(1462) = 1970, CFI = .981, RMSEA = .016. I will further examine this positivity bias in the next model. Here I focus on the findings for age and well-being.

As expected, age was a negative predictor of life-satisfaction, b = -.21, se = .04, Z = 5.5. This effect was fully mediated. The direct effect of age on life-satisfaction was close to zero and not significant, b = -.01, se = .04, Z = 0.34. Age also had no direct effect on positive affect (happy), b = .00, se = .00, Z = 0.44, and only a small effect on negative affect (sadness), b = -.03, se = .01, Z = 2.5. Yet, the sign of this relationship shows lower levels of sadness in middle age, which does not explain the lower level of life-satisfaction. In contrast, age was a negative predictor of average domain satisfaction (DSX) and the effect size was close to the effect size for life-satisfaction, b = -.20, se = .05, Z = 4.1. This results replicates McAdams et al.’s (2012) finding that domain satisfaction mediates the effect of age on life-satisfaction.

However, the monster model shows that domain satisfaction is influenced by personality traits. Thus, it is possible that some of the age effects on domain satisfaction are not only influenced by objective domain aspects, but also by top-down effects of personality traits. To examine this, I traced the indirect effects of age on average domain satisfaction.

Age was a notable negative predictor of cheerfulness, b = -.29, se = .04, Z = 7.5. This effect was partially mediated by extraversion, b = -.07, se = 02, Z = 3.5 and agreeableness, b = -.08, se = .02, Z = 4.5, while some of the effect was direct, b = -.14, se = .03, Z = 4.4. There was no statistically significant effect of age on depressiveness, b = .07, se = 04, Z = 1.9.

Age also had direct relationships with some life domains. Age was a positive predictor of romantic satisfaction, b = .36, se = .04, Z = 8.2. Another strong relationship emerged for health satisfaction, b = -.36, se = .04, Z = 8.4. Another negative relationship was observed for work, b = -.26, se = .04, Z = 6.4, reflecting the difference between studying and working. Age was also a negative predictor of housing satisfaction, b = -.10, se = .04, Z = 2.8, recreation satisfaction, b = -.15, se = .05, Z = 3.4, financial satisfaction, b = -.10, se = .05, Z = 2.1, and friendship satisfaction, b = -.09, se = .04, Z = 2.1. In short, age was a negative predictor of satisfaction with al life domains even after controlling for the effects of age on cheerfulness.

The only positive effect of age was an increase in conscientiousness, b = .15, se = .04, Z = 3.7, which is consistent with the personality literature (Roberts, Walton, & Viechtbauer, 2006). However, the indirect positive effect on life-satisfaction is small, b = .04

In conclusion, the present results replicate that well-being decreases from young adulthood to middle age. The effect is mainly explained by a decrease in cheerfulness and decreasing satisfaction with a broad range of life domains. The only exception was a positive effect on romantic satisfaction. These results have to be interpreted in the context of the specific sample. Younger participants were students. It is possible that young adults who already join the workforce have lower well-being than students. The higher romantic satisfaction for parents may also be due to the recruitment of parents who remained married with children. Singles and divorced middle-aged individuals show lower life-satisfaction. The fact that age effects were fully mediated shows that studies of age and well-being can benefit from the inclusion of personality measures and the measurement of domain satisfaction (McAdams et al., 2012).

How to Build a Monster Model of Well-Being: Part 6

The first five parts of this series built a model that related the Big Five personality traits as well as the depressiveness facet of neuroticism and the cheerfulness facet of extraversion to well-being. In this model, well-being is conceptualized as a weighted average of satisfaction with life domains and experiences of happiness and sadness (Part 5).

Part 6 adds sex/gender to the model. Although gender is a complex construct, most individuals identify as either male or female. As sex is frequently assessed as a demographic characteristic, the simple correlations of sex with personality and well-being are fairly well known and were reviewed by Diener et al. (1999).

A somewhat surprising finding is that life-satisfaction judgments show hardly any sex differences. Diener et al. (1999) point out that this finding seems to be inconsistent with findings that women report higher levels of neuroticism (neuroticism is a technical term for a disposition to experience more negative affects and does not imply a mental illness), negative affect, and depression. Accordingly, gender could have a negative effect on well-being that is mediated by neuroticism and depressiveness. To explain the lack of a sex difference in well-being, Diener et al. proposed that women also experience more positive emotions. Another possible mediator is agreeableness. Women consistently score higher in agreeableness and agreeableness is a positive predictor of well-being. Part 5 showed that most of the positive effect of agreeableness was mediated by cheerfulness. Thus, agreeableness may partially explain higher levels of cheerfulness for women. To my knowledge, these mediation hypotheses have never been formally tested in a causal model.

Adding sex to the monster model is relatively straightforward because sex is an exogeneous variable. That is causal paths can originate from sex, but no causal path can be pointed at sex. After all, we know that sex is determined by the genetic lottery at the moment of conception. It is therefore possible to add sex as a cause to all factors in the model. Despite adding all causal pathways, model fit decreased a bit, chi2(1432) = 2068, CFI = .976, RMSEA = .018. The main reason for reduced fit would be that sex predicts some of the unique variances in individual indicators. Inspection of modification indices showed that sex was related to higher student ratings of neuroticism and lower ratings of neuroticism by mothers’ as informants. While freeing these parameters improved model fit, the effect on sex differences in neuroticism were opposite. Assuming (!) that mothers’ underestimate neuroticism, increased sex differences in neuroticism from d = .69, se = .07 to d = .81, se = .07. Assuming that students’ overestimate neuroticism resulted in a smaller sex difference of d = .54, se = .08. Thus, the results suggest that sex differences in neuroticism are moderate to large (d = .5 to .8), but there is uncertainty due to some rating biases in ratings of neuroticism. A model that allowed for both biases had even better fit and produced the compromise effect size estimate of d = .67, se = .08. Overall fit was now only slightly lower than for the model without sex, chi2(1430) = 2024, CFI = .978, RMSEA = .017. Figure 2 shows the theoretically significant direct effects of sex with effect sizes in units of standard deviations (Cohen’s d).

The model not only replicated sex differences in neuroticism. It also replicated sex differences in agreeableness, although the effect size was small, d = .29, se = .08, Z = 3.7. Not expected was the finding that women also scored higher in extraversion, d = .38, se = .07, Z = 5.6, and conscientiousness, d = .36, se = .07, Z = 5.0. The only life domain with a notable sex difference was romantic relationships, d = -.41, se = .08, Z = 5.4. The only other statistically significant difference was found for recreation, d = -.19, se = .08, Z = 2.4. Thus, life domains do not contribute substantially to sex differences in well-being. Even the sex difference for romantic satisfaction is not consistently found in studies of marital satisfaction.

The model indirect results replicated the finding that there are no notable sex differences in life-satisfaction, total effect d = -.07, se = .06, Z = 1.1. Thus, tracing the paths from sex to life-satisfaction provides valuable insights into the paradox that women tend to have higher levels of neuroticism, but not lower life-satisfaction.

Consistent with prior studies, women had higher levels of depressiveness and the effect size was small, d = .24, se = .08, Z = 3.0. The direct effect was not significant, d = .06, se = .08, Z = 0.8. The only positive effect was mediated by neuroticism, d = .42, se = .06, Z = 7.4. Other indirect effects reduced the effect of sex on depressiveness. Namely, women’s higher conscientiousness (in this sample) reduced depressiveness, d = -.14, as did women’s higher agreeableness, d = -.06, se = .02, Z = 2.7, and women’s higher extraversion, d = -.04, se = .02, Z = 2.4. These results show the problem of focusing on neuroticism as a predictor of well-being. While neuroticism shows a moderate to strong sex difference, it is not a strong predictor of well-being. In contrast, depressiveness is a stronger predictor of well-being, but has a relatively small sex difference. This small sex difference partially explains why women can have higher levels of neuroticism without lower levels of well-being. Men and women are nearly equally disposed to suffer from depression. Consistent with this finding, men are actually more likely to commit suicide than women.

Consistent with Diener et al.’s (1999) hypothesis, cheerfulness also showed a positive relationship with sex. The total effect size was larger than for depressiveness, d = .50, se = .07, Z = 7.2. The total effect was partially explained by a direct effect of sex on cheerfulness, d = .20, se = .06, Z = 3.6. Indirect effects were mediated by extraversion, d = .27, se = .05, Z = 5.8, agreeableness d = .11, se = .03, Z = 3.6, and conscientiousness, d = .05, se = .02, Z = 3.2. However, neuroticism reduced the effect size by d = -.12, se = .03, Z = 4.4.

The effects of gender on depressiveness and cheerfulness produced corresponding differences in experiences of NA (sadness) and PA (happiness), without additional direct effects of gender on the sadness or happiness factors. The effect on happiness was a bit stronger, d = .35, se = .08, Z = 4.6 than the effect on sadness, d = .28, se = .07, Z = 4.1.


In conclusion, the results provide empirical support for Diener et al.’s hypothesis that sex differences in well-being are small because women have higher levels of positive affect and negative affect. The relatively large difference in neuroticism is also deceptive because neuroticism is not a direct predictor of well-being and gender differences in depressiveness are weaker than gender differences in neuroticism or anxiety. In the present sample, women also benefited from higher levels of agreeableness and conscientiousness that are linked to higher cheerfulness and lower depressiveness.

The present study also addresses concerns that self-report biases may distort gender differences in measures of affect and well-being (Diener et al., 1999). In the present study, well-being of mothers and fathers was not just measured by their self-reports, but also by students’ reports of their parents’ well-being. I have also asked students in my well-being course whether their mother or father has higher life-satisfaction. The answers show pretty much a 50:50 split. Thus, at least subjective well-being does not appear to differ substantially between men and women. This blog post showed a theoretical model that explains why men and women have similar levels of well-being.

Continue here to Part 7.

How to Build a Monster Model of Well-Being: Part 5

This is Part 5 of the blog series on the monster model of well-being. The first parts developed a model of well-being that related life-satisfaction judgments to affect and domain satisfaction. I then added the Big Five personality traits to the model (Part 4). The model confirmed/replicated the key finding that neuroticism has the strongest relationship with life-satisfaction, b ~ .3. It also showed notable relationships with extraversion, agreeableness, and conscientiousness. The relationship with openness was practically zero. The key novel contribution of the monster model is to trace the effects of the Big Five personality traits on well-being. The results showed that neuroticism, extraversion, and agreeableness had broad effects on various life domains (top-down effects) that mediated the effect on global life-satisfaction (bottom-up effect). In contrast, conscientiousness was only instrumental for a few life domains.

The main goal of Part 5 is to examine the influence of personality traits at the level of personality facets. Various models of personality assume a hierarchy of traits. While there is considerable disagreement about the number of levels and the number of traits on each level, most models share a basic level of traits that correspond to traits in the everyday language (talkative, helpful, reliable, creative) and a higher-order level that represents covariations among basic traits. In the Five factor model, the Big Five traits are five independent higher-order traits. Costa and McCrae’s influential model of the Big Five recognizes six basic-level traits called facets for each of the Big Five traits. Relatively few studies have conducted a comprehensive examination of personality and well-being at the facet level (Schimmack, Oishi, Furr, & Funder, 2004). A key finding was that the depressiveness facet of neuroticism was the only facet with unique variance in the prediction of life-satisfaction. Similarly, the cheerfulness facet of extraversion was the only extraversion facet that predicted unique variance in life-satisfaction. Thus, the Mississauga family study included measures of these two facets in addition to the Big Five items.

In Part 5, I add these two facets to the monster model of well-being. Consistent with Big Five theory, I allowed for causal effects of Extraversion on Cheerfulness and from Neuroticism to Depressiveness. Strict hierarchical models would assume that each facet is related to only one broad factor. However, in reality basic-level traits can be related to multiple higher-order factors, but not much attention has been paid to secondary loadings of the depressiveness and cheerfulness facets on the other Big Five factors. In one study that controlled for evaluative bias, I found that depressiveness had a negative loading on conscientiousness (Schimmack, 2019). This relationship was confirmed in this dataset. However, additional relations improved model fit. Namely, cheerfulness was related to lower neuroticism and higher agreeableness and depressiveness was related to lower extraversion and agreeableness. Some of these relations were weak and might be spurious due to the use of short three-item scales to measure the Big Five.

The monster model combines two previous mediation models that link the Big Five personality traits to well-being. Schimmack, Diener, and Oishi (2002) proposed that affective experiences mediate the effects of extraversion and neuroticism. Schimmack, Oishi, Furr, and Funder (2004) suggested that the Depressiveness and Cheerfulness facets mediate the effects of Extraversion and Neuroticism. The monster model proposes that extraversion’s effect is mediated by trait cheerfulness which influences positive experiences, whereas neuroticism’s effect is mediated by trait depressiveness which in turn influences experiences of sadness.

When this model was fitted to the data, depressiveness and cheerfulness fully mediated the effect of extraversion and neuroticism. However, extraversion became a negative predictor of well-being. While it is possible that the unique aspects of extraversion that are not shared with cheerfulness have a negative effect on well-being, there is little evidence for such a negative relationship in the literature. Another possible explanation for this finding is that cheerfulness and positive affect (happy) share some method variance that inflates the correlation between these two factors. As a result, the indirect effect of extraversion is overestimated. When this shared method variance is fixed to zero and extraversion is allowed to have a direct effect, SEM will use the free parameter to compensate for the overestimation of the indirect path. The ability to model shared method variance is one of the advantages of SEM over mediation tests that rely on manifest variables and assume perfect measurement of constructs. Figure 1 shows the correlation between measures of trait PA (cheerfulness) and experienced PA (happy) as a curved arrow. A similar shared method effect was allowed for depressiveness and experienced sadness (sad), although it turned out be not significant.

Exploratory analysis showed that cheerfulness and depressiveness did not fully mediate all effects on well-being. Extraversion, agreeableness, and conscientiousness had additional direct relationships on some life-domains that contribute to well-being. The final model remained good overall fit and modification indices did not show notable additional relationships for the added constructs, chi2(1387) = 1914, CFI = .980, RMSEA = .017.

The standardized model indirect effects were used to quantify the effect of the facets on well-being and to quantify indirect and direct effects of the Big Five on well-being. The total effect of Depressiveness was b = -.47, Z = 8.8. About one-third of this effect was directly mediated by sadness, b = -.19. Follow-up research needs to examine how much of this relationship might be explained by risk factors for mood disorders as compared to normal levels of depressive moods. Valuable new insights can emerge from integrating the extensive literature on depression and life-satisfaction. The remaining effects were mediated by top-down effects of depressiveness on domain satisfactions (Payne & Schimmack, 2020). The present results show that it is important to control for these top-down effects in studies that examine the bottom-up effects of life domains on life-satisfaction.

The total effect of cheerfulness was as large as the effect of depressiveness, b = .44, Z = 6.6. Contrary to depressiveness, the indirect effect through happiness was weak, b = .02, Z = 0.6 because happy did not make a significant unique contribution to life-satisfaction. Thus, all of the effects were mediated by domain satisfaction.

In sum, the results for depressiveness and cheerfulness are consistent with integrated bottom-up-top-down models that postulate top-down effects of affective dispositions on domain satisfaction and bottom-up effects from domain satisfaction to life-satisfaction. The results are only partially consistent with models that assume affective experiences mediate the effect (Schimmack, Diener, & Oishi, 2002).

The effect of neuroticism on well-being, b = -.36, Z = 10.7, was fully mediated by depressiveness, b = -.28 and cheerfulness, b = -.08. Causality is implied by the assumption that neuroticism is a common cause of specific dispositions for anger, anxiety, depressiveness and other negative affects that is made in hierarchical models of personality traits. If this assumption were false, neuroticism would only be a correlate of well-being and it would be even more critical to focus on depressiveness as the more important personality trait related to well-being. Thus, future research on personality and well-being needs to pay more attention to the depressiveness facet of neuroticism. Too many short neuroticism measures focus exclusively or predominantly on anxiety.

Following Costa and McCrae (1980), extraversion has often been considered a second important personality trait that influences well-being. However, quantitatively the effect of extraversion on well-being is relatively small, especially in studies that control for shared method variance. The effect size for this sample was b = .12, a statistically small effect, and a much smaller effect than for its cheerfulness facets. The weak effect was a combination of a moderate positive effect mediated by cheerfulness, b = .32, and a negative effect that was mediated by direct effects of extraversion on domain satisfactions, b = -.23. These results show how important it is to examine the relationship between extraversion and well-being at the facet level. Whereas cheerfulness explains why extraversion has positive effects on well-being, the relationship of other facets with well-being require further investigation. The present results make it clear that a simple reason for positive relationships between extraversion and well-being is the cheerfulness facet. The finding that individuals with a cheerful disposition evaluate their lives more positively may not be surprising or may even appear to be trivial, but it would be a mistake to omit cheerfulness from a causal theory of well-being. Future research needs to uncover the determinants of individual differences in cheerfulness.

Agreeableness had a moderate effect on well-being, b = .21, Z = 5.8. Importantly, the positive effect of agreeableness was fully mediated by cheerfulness, b = .17 and depressiveness, b = .09, with a small negative direct effect on domain satisfactions, b = -.05, which was due to lower work satisfaction for individuals high in agreeableness. These results replicate Schimmack et al.’s (2004) findings that agreeableness was not a predictor of life-satisfaction, when cheerfulness and depressiveness were added to the model. This finding has important implications for theories of well-being that see a relationship between morality, empathy, and prosociality and well-being. The present results do not support this interpretation of the relationship between agreeableness and well-being. The results also show the importance of taking second order relationships more seriously. Hierarchical models consider agreeableness to be unrelated to cheerfulness and depressiveness, but simple hierarchical models do not fit actual data. Finally, it is important to examine the causal relationship between agreeableness and affective facets. It is possible that cheerfulness influences agreeableness rather than agreeableness influencing cheerfulness. In this case, agreeableness would be a predictor but not a cause of higher well-being. However, it is also possible that an agreeable disposition contributes to a cheerful disposition because agreeableness people may be more easily satisfied with reality. In any case, future studies of agreeableness and related traits and well-being need to take potential relationships with cheerfulness and depressiveness into account.

Conscientiousness also has a moderate effect on well-being, b = .19, Z = 5.9. A large portion of this effect is mediated by the Depressiveness facet of Neuroticism, b = .15. Although a potential link between Conscientiousness and Depressiveness is often omitted from hierarchical models of personality, neuropsychological research is consistent with the idea that conscientiousness may help to regulate negative affective experiences. Thus, this relationship deserves more attention in future research. If causality were reversed, conscientiousness would have only a trivial causal effect on well-being.

In short, adding cheerfulness and depressiveness facets to the model provided several new insights. First of all, the results replicated prior findings that these two facets are strong predictors of well-being. Second, the results showed that Big Five predictors are only weak unique predictors of well-being when their relationship with Cheerfulness and Depressiveness is taken into account. Omitting these important predictors from theories of well-being is a major problem of studies that focus on personality traits at the Big Five level. It also makes theoretical sense that cheerfulness and depressiveness are related to well-being. These traits influence the emotional evaluation of people’s lives. Thus, even when objective life circumstances are the same, a cheerful individual is likely to look at the bright side and see the their lives with rose colored glasses. In contrast, depression is likely to color live evaluations negatively. Longitudinal studies confirm that depressive symptoms, positive affect, and negative affect are influenced by stable traits (Anusic & Schimmack, 2016; Desai et al., 2012). Furthermore, twin studies show that shared genes contribute to the correlation between life-satisfaction judgments and depressive symptoms (Nes et al., 2013). Future research needs to examine the biopsychosocial factors that cause stable variation in dispositional cheerfulness and depressiveness that contribute to individual differences in well-being.

Continue here to Part 6.

The Race Implicit Association Test Is Biased

This is a preprint (not yet submitted to a journal) of a manuscript that examines the validity of the race IAT as a measure of in-group and out-group attitudes for African and White Americans. We show that research on intergroup relationships and attitudes benefits from insights (insights by means of being inside the experience) by African Americans that are often ignored by White psychologists. Data and Syntax are here (https://osf.io/rvfz8/)

The Race Implicit Association Test is Biased: Most African Americans Have Positive Attitudes Towards Their In-Group

Ulrich Schimmack
University of Toronto Mississauga

Alicia Howard
Music Wellbeing


Explicit ratings of attitudes show a preference for the in-group for African Americans and White participants. However, the average score of African Americans on the race Implicit Association Test is close to zero. This finding has been interpreted as evidence that many African Americans have unconsciously internalized negative attitudes towards their group. We conducted a multi-method study of this hypothesis with various implicit measures (Single-Target IAT, Evaluative Priming, Affective Misattribution Procedure) that distinguish between in-group and out-group attitudes. Our main finding is that African Americans have positive attitudes towards their in-group on a latent factor that reflects the valid variance across measures. In addition, the race IAT scores of African Americans are unrelated to in-group and out-group attitudes. Moreover, White American’s race IAT scores are biased and exaggerate in-group preferences. These findings are discussed in terms of the unique aspects of the race IAT that may activate cultural stereotypes. The results have ethical implications for the practice of providing individuals with feedback about their unconscious biases with an invalid measure. It is harmful to African Americans to suggest that they unconsciously dislike African Americans and to exaggerate prejudice of White Americans. Ongoing discrimination may be better explained by explicit prejudice of a minority of White Americans than pervasive, uncontrollable implicit biases of most White Americans.


With 1,277 citations in WebOfScience, Jost, Banaji, and Nosek’s (2004) article “A Decade of System Justification Theory: Accumulated Evidence of Conscious and Unconscious Bolstering of the Status Quo” is easily the most cited article in the journal Political Psychology. The second most cited article has less than half the number of citations (523 citations). The abstract of this influential article states the authors’ main thesis clearly and succinctly. They postulate a general motive to support the existing social order. This motive contributes to internalization of inferiority of disadvantaged groups. Most important for this article is the claim that this internalization of inferiority is “observed most readily at an implicit, nonconscious level of awareness” (p. 881).

The theory is broadly applied to a wide range of stigmatized groups and its validity has to be evaluated for each group individually. Our focus is on the African American community. Jost et al. (2004) assume that system justification theory is applicable to African Americans because they show different evaluations of their in-group on explicit measures and on the Implicit Association Test (IAT; Greenwald, McGhee, & Schwartz, 1998). On explicit measures, like the feeling thermometer, African Americans show higher in-group favoritism than White Americans (standardized mean differences d = .8 vs. .6). However, IAT scores show greater in-group favoritism for White Americans than for African Americans (d = .9 vs. 0).  IAT scores close to zero for African Americans have been interpreted as evidence that “sizable proportions of members of disadvantaged groups – often 40% to 50% or even more exhibit implicit (or indirect) biases against their own group and in favor of more advantaged groups” (Jost, 2019, p. 277).

This pattern of results is based on large samples and has been replicated in several studies. Thus, we are not questioning the empirical facts. Our concern is that Jost and colleagues misinterpret these results. In the early 2000s, it was common to assume that explicit and implicit group evaluations reflect different constructs (Nosek, Greenwald, & Banaji, 2005). This dual-attitude model allows for different evaluations of the in-group at a conscious and an unconscious level. Evidence for this model rested mostly on the finding that race IAT scores and self-ratings are only weekly correlated, r ~ .2 (Hofmann, Gawronski, Gschwendner, Le, & Schmitt, 2005). However, these studies did not correct for measurement error. After correcting for measurement error, the correlation increases to r = .8 (Schimmack, 2021a). The race IAT also has little incremental predictive validity over explicit measures (Schimmack, 2021b). This new evidence renders it less likely that explicit and implicit attitudes can diverge. In fact, there exists no evidence that attitudes are hidden from consciousness. Thus, there may be an alternative explanation for African Americans’ scores on the race IAT.

White Psychologists’ Theorizing about African Americans

Before we propose an alternative explanation for African Americans’ neutral scores on the race IAT, we would like to make the observation that Jost et al.’s (2004) claims about African Americans follow a long tradition of psychological research on African Americans by mostly White psychologists. Often this research ignores the lived experience of African Americans, which often leads to false claims (cf. Adams, 2010). For example, since the beginning of psychology, White psychologists assumed that African Americans have low self-esteem and proposed several theories for this seemingly obvious fact. However, in 1986 Rosenberg ironically pointed out that “everything stands solidly in support of this conclusion except the facts.” Since then, decades of research have shown that African Americans have the same or even higher self-esteem than White Americans (Twenge & Crocker, 2002). Just like White theorists’ claims about self-esteem, Jost et al.’s claims about African Americans’ unconscious are removed from African Americans’ own understanding of their culture and identity and disconnected from other findings that are in conflict with the theory’s predictions. The only empirical support for the theory is the neutral score of African Americans on the race IAT.

African American’s Resilience in a Culture of Oppression

We are skeptical about the claim that most African-Americans secretly favor the out-group based on the lived experience of the second author. Alicia Howard is an African-American from a predominantly White, small town in Kentucky. She grew up surrounded by a large family and attended a Black church. Her identity was shaped by role-models from this Black in-group and not by some idealized abstract image of the White out-group. Also, contrary to the famous doll-studies from the 1960s, she had White and Black dolls and got excited when a new Black doll came out. Alicia studied classical music at the historically Black college and university Kentucky State University. Even though her admired composers like Rachmaninov were White, she looked up to Black classical musicians like Andre Watts, Kathleen Battle, Leontyne Price, and Jesse Norman as role models. It is of course possible that her experiences are unique and not representative of African-Americans. However, no one in her family or among her Black friends showed signs that they preferred to be White or liked White people more than Black people. In small towns, the lives of Black and White people are also more similar than in big cities. Therefore, the White out-group was not all that different from the Black in-group. Although there are Black individuals who seem to struggle with their Black identity, there are also White people who suffer from White guilt or assume a Black identity for other reasons. Thus, from an African American perspective, system justification theory does not seem to characterize most African Americans’ attitudes to their in-group.

The Race IAT Could Be Biased

We are not the first to note that the race IAT may not be a pure measure of attitudes (Olson & Fazio, 2004). The nature of the task may activate cultural stereotypes that are normally not activated when African Americans interact with each other. As a result, the mean score of African Americans on the race IAT may be shifted towards a pro-White bias because negative cultural stereotypes persist in US American culture. The same influence of cultural stereotypes would also enhance the pro-White bias for White Americans. Thus, an alternative explanation for the greater in-group bias for White Americans than for African Americans on the race IAT is that attitudes and cultural stereotypes act together for White Americans, whereas they act in opposite directions for African Americans.

One way to test this hypothesis is to examine in-group biases with alternative implicit measures that do not activate stereotypes. The most widely used alternative implicit measures are the Affective Misattribution Procedure (AMP; Payne, Cheng, Govorun, & Stewart, 2005) and the evaluative priming task (EPT, Fazio, Jackson, Dunton, & Williams, 2005). Only recently it has been noted that these implicit measures produce different results (Teige-Mocigemba, Becker, Sherman, Reichardt, & Klauer, 2017). A study in the United States, examined the differences between African American and White respondents on three implicit measures (Figure 1, Bar-Anan & Nosek, 2014).

Known-group differences are much more pronounced for the race IAT than the other two implicit tasks. The authors interpret this finding as evidence that the race IAT has higher validity. That is, under the assumption that (mostly) White participants have a strong preference for their in-group, a positive mean is predicted, and the more positive the mean is, the more valid a measure is. However, alternative explanations are possible. One alternative explanation is that only the race IAT activates cultural stereotypes and produces a high pro-White mean as a result. In contrast, the other tasks are better measures of attitudes and the results show that prejudice is much less pronounced than the race IAT suggests. That is, the race IAT is biased because it activates cultural stereotypes that are not automatically activated with other implicit tasks.

Another limitation of the race IAT is that preferences for the in-group and the out-group are confounded. In contrast, the other two tasks can be scored separately to obtain measures of the strength of preferences for the in-group and the out-group. This is particularly helpful to make sense of the neutral score of African Americans on the race IAT. One explanation for a weaker in-group bias is simply that African Americans are less biased against the out-group than White Americans. Thus, a better test of African Americans’ attitudes towards their own group is to examine how positive or negative African American’s responses are to African American stimuli.

In short, published studies reveal that different implicit tasks produce different results and that the race IAT shows stronger pro-White biases than other tasks. However, it has not been systematically explored whether this finding reveals higher or lower validity of the race IAT. We used Bar-Anan and Nosek’s (2014) data to explore this question.



The data are based on a voluntary online sample. The total sample size is large (N = 23,413).  However, participants completed only some of the tasks that included implicit measures of political orientation and self-esteem. Table 1 shows the number of African American and White participants for six measures.


Race IAT.  The race IAT is the standard Implicit Association Test, although the specific stimuli that represent the African American group and the White American group were different. However, this does not appear to have influenced responses as seen by similar means for African American and White American participants.  The race IAT was scored so that higher values represented a pro-White bias for White participants and a pro-Black bias for Black participants.

Single Target IAT. The single-target IAT (ST-IAT) is a variation of the race IAT. The main difference is that participants only have to classify one racial group along with classifications of positive and negative stimuli. As a result, the ST-IAT reflects only evaluations of one group and provides distinct information about evaluations of the in-group and out-group. It is particularly interesting how Black participants perform on the in-group ST-IAT with Black targets. System justification theory predicts a score close to zero that would reflect an over all neutral attitude and at least 50% of participants who may hold negative views of the in-group. 

Evaluative Priming Task. The Evaluative Priming Task (EPT) was developed by Fazio et al. (1995). In a practice block, participants classified words as “good” or “bad.” In the next three blocks, target stimuli were primed with pictures of African American and White Americans. In-group bias was the response time to same-group primes for negative words minus response times to same-group primes for positive words. Out-group bias was the response time to other-group primes for negative words minus response times to other-group primes for positive words.

Affective Misattribution Procedure. The Affective Misattribution was invented by Payne et al. (2005). Pictures of African Americans or White Americans are quickly followed by a Chinese character and a mask. Participants are instructed to rate the Chinese character as more or less pleasant than the average Chinese character. They were instructed not to let the pictures influence their evaluation of the target stimuli. The in-group score was the percentage of more pleasant responses after an in-group picture. The out-group score was the percentage of more pleasant responses after an out-group picture.

Feeling Thermometer. Self-reports of in-group and out-group attitudes were measured with feeling thermometers. Participants rated how warm or cold they feel toward the in-group and the out-group on an 11-point scale ranging from 0 = coldest feelings to 10 = warmest feelings.

For all measures, participants scores were divided by the standard deviation so that means can be interpreted as standardized effect sizes assuming that a mean of zero reflects a neutral attitude, positive scores reflect positive attitudes, and negative scores reflect negative attitudes.


The data were analyzed using structural equation modeling with MPLUS8.2 (Muthen & Muthen (2017), A multi-group model was specified with African Americans and White Americans as separate groups. The model was developed iteratively using the data. Thus, all results are exploratory and require validation in a separate sample. Due to the small number of Black participants, it was not possible to cross-validate the model with half of the sample. Moreover, tests of group differences have low power and a study with a larger sample of African Americans is needed to test equivalence of parameters. Cherry picking of data, models, and references undermines psychological science. To avoid this problem, we also constructed a model that assumes some implicit measures are biased and inflate in-group attitudes of African Americans. To identify the means of the latent in-group and out-group factors, we chose the single-target IAT because it shows the least positive attitudes of African Americans towards their in-group. We then freed other parameters to maximize model fit. We then freed other parameters to maximize model fit. The data, input syntax, and the full outputs have been posted online (https://osf.io/rvfz8/).

Preferred Model

Overall fit of the final model meets standard fit criteria (RMSEA < .06, CFI > .95), CFI (78) = 133.37, RMSEA = .012, 90%CI = .009 to .016, CFI = .981. However, models with low coverage (many missing data) may overestimate model fit. A follow-up study that administers all tasks to all participants should be conducted to provide a stronger test of the model. Nevertheless, the model is parsimonious and there were no modification indices greater than 20. This suggests that there are no major discrepancies between the model and the data.

Figure 2 shows a measurement of attitudes towards the in-group and out-group. The key unobserved variables in this model are the attitude towards the in-group factor (ig) and the attitude towards the out-group factor (og). Each construct is measured with four indicators, namely scores on the single-target IAT (satig/satog), scores on the evaluative priming task (epig, epog), scores on the affective misattribution procedure (ampig/ampog), and scores on the explicit feeling thermometer ratings (thermoig/thermoog). For ease of interpretation, Figure 2 shows standardized coefficients that range from -1 to 1.

The first finding is that loadings of the measures on the IG factor (.3-.4) and on the outgroup factor (.4) are modest. They suggest that less than 20% of the variance in a single measure is valid variance. However, the model clearly identified latent factors that show individual differences in attitudes towards in-group and out-group for Black and White Americans. The second noteworthy finding is that loadings for African Americans and White Americans were similar. Thus, the multi-method measurement model was able to identify variation in in-group and out-group attitudes for both groups.

A third finding is that for White participants.54^2 = 29% of the variance in race IAT reflects attitudes towards African Americans (i.e., prejudice). This is a bit higher than previous estimates, which were in the 10% to 20% range (Schimmack, 2021). However, the lower limit of the 95%CI overlapped with this range of possible values, .43^2 = 18%.

Most important is the finding that race IAT scores for African Americans were unrelated to the attitudes towards the in-group and out-group factors. Thus, scores on the race IAT do not appear to be valid measures of African Americans’ attitudes. This finding has important implications for Jost et al.’s (2021) reliance on race IAT scores to make inferences about African Americans’ unconscious attitudes towards their in-group. This interpretation assumed that race IAT scores do provide valid information about African American’s attitudes towards the in-group, but no evidence for this assumption was provided. The present results show 20 years later that this fundamental assumption is wrong. The race-IAT does not provide information about African Americans’ attitudes towards the in-group as reflected in other implicit measures.

An additional interesting finding was that in-group and out-group attitudes were unrelated. This suggests that prejudice does not enhance pro-White attitudes for White participants. It also suggests that Black pride does not have to devalue the White outgroup.

Finally, the model shows that three methods show strong method variance. All three methods measured in-group and out-group attitudes within a single experimental block. The main difference is the single-target IAT that is conducted once with one target (Black) and once with the other target (White). Separating the assessment of in-group and out-group attitudes for the other tasks might reduce the amount of systematic measurement error. However, less systematic measurement error does not seem to translate into more valid variance as the single-target IAT was not more valid than the other measures. The results for the commonly used feeling thermometer are particularly noteworthy. While this measure shows some modest validity, the present results also show that this single-item measure has poor psychometric properties. An important goal for future research is to develop more valid measures of attitudes towards in-groups and out-groups. Until then, researchers should use a multi-method approach.  

Figure 3 shows the model for the means. While standardized coefficients are easier to interpret for the measurement model, means are easier to interpret in the units of the measures, which were scaled so that means can be interpreted as Cohen’s d values.

The most important finding is that African Americans’ mean for the in-group factor is positive, d = 1.07, 95%CI = 0.98 to 1.16. Thus, the data provide no support for the claim that most African Americans evaluate their in-group negatively. With a normal distribution centered at 1.07, only 14% of African Americans would have a negative (below 0) attitude towards the in-group. White Americans also show a positive evaluation of the in-group, but to a lesser extent, d = 0.62; 95%CI = 0.58, 0.66. The confidence intervals are tight and clearly do not overlap, and constraining these two coefficients to be equal reduced model fit, chi2(79) = 228.43, Δchi2(1) = 95.06, p = 1.85e-22.  Thus, this model suggests that African Americans have an even more positive attitude towards their in-group than White Americans.

As expected, out-group attitudes are less positive than in-group attitudes for both groups. Also expected was the finding that out-group attitudes of African Americans, d = .42, 95%CI , are more favorable than out-group attitudes of White Americans, d = .20, 95%CI. However, even White Americans’ out-group attitudes are on average positive. This finding is in marked contrast to the common finding with the race IAT that most White Americans show a pronounced pro-White bias, which has often been interpreted as evidence of widespread prejudice. However, this interpretation is problematic for two reasons. First, it confounds in-group and out-group attitudes. Prejudice is defined as White American’s attitude towards African Americans. The race IAT is not a direct measure of prejudice because it measures relative preferences. Of course, in-group favoritism alone can lead to discrimination and racial disparities when one group is dominant, but these consequences can occur without actual prejudice against African Americans. The present results suggest that African American also have an in-group bias. Thus, it is important to distinguish between in-group favoritism, which applies to both groups, from prejudice which applies uniquely to White Americans towards African Americans.

The bigger problem for the race IAT is that White Americans’ scores on the race IAT are systematically biased towards a pro-White score, d = .78, whereas African Americans’ scores are only slightly biased towards a pro-Black score, d = -.19. This finding shows that IAT scores provide misleading information about the amount of in-group favoritism. Thus, support for the system justification theory rests on a measurement artifact.

Alternative Model

It is possible that our modeling decisions exaggerated the positivity of African Americans’ in-group attitudes. To address this concern, we tried to find an alternative model that fits the data with the lowest amount of African American’s in-group bias. This alternative model fit the data as well as our preferred model, CFI (77) = 134.24, RMSEA = .013, 90%CI = .009 to .016, CFI = .980. Thus, the data cannot distinguish between these two models. The covariance structure was identical. Thus, we only present the means structure of the model (Figure 4).

The main difference between the models is that African Americans’ attitudes towards the ingroup are less favorable (d = 1.07 vs. d = .54). The discrepancy is explained by the assumptions that African Americans have a positive bias on the feeling-thermometer and by assuming that African Americans’ responses to White targets on the AMP are negatively biased (ampog = -.72). The most important finding is that African Americans’ in-group attitudes remain positive, d = .54, although they are now slightly less favorable than White Americans’ in-group attitudes, d = .62.  

Proponents of system justification theory might argue that attitudes towards the in-group have to be evaluated in relative terms. Viewed from this perspective, the results still show relatively more in-group favoritism for White Americans, d = .62 – .20 = .42 than African Americans, d = .54 – .40 = .14. However, out-group attitudes contribute more to this difference, d = .40 = .20 = .20, than in-group differences, d = .62 – .54 = .08. Thus, one reason for the difference in relative preferences is that African Americans attitudes towards Whites are more positive than White Americans’ attitudes towards African Americans. It would be a mistake to interpret this difference in evaluations of the out-group as evidence that African Americans have internalized negative stereotypes about their in-group.

The alternative model does not alter the fact that scores on the race IAT are biased and provide misleading information about in-group and out-group attitudes.


After its introduction in 1998, the Implicit Association Test has been quickly accepted as a valid measure of attitudes that individuals are unwilling or unable to report on self-report measures. Mean scores of White Americans were interpreted as evidence that prejudice is much more widespread and severe than self-report measures suggest. Mean scores of African Americans were interpreted as evidence of unconscious self-loathing. The present results suggest that millions of African American and White visitors of the Project Implicit website were given false feedback about their attitudes. For White Americans, the race IAT does appear to reflect individual differences in out-group attitudes (prejudice). However, the scoring of the IAT in terms of deviations from a value of zero is invalid because the mean is biased towards pro-White scores. Even the amount of valid variation is modest and insufficient to provide individualized feedback.

Implications for African American’s In-Group and Out-Group Attitudes

Our investigation started with the surprising suggestions that African Americans are motivated to justify racism and are supposed to have internalized negative stereotypes and attitudes towards their group. This view of African Americans is detached from their history and evidence of high self-esteem among African Americans. The only evidence for this claim was the finding that African Americans do not show a strong in-group preference on the race IAT.

Our results suggest that this finding is due to the low validity of the race IAT as a measure of African Americans’ attitudes. African American’s race IAT scores were unrelated to their in-group attitudes and out-group attitudes as measured by other measures, including the single-target variant of the IAT.

This raises the question in which way the race IAT differs from other measures. We are not the first to suggest that the race IAT activates negative cultural stereotypes (Olson & Fazio, 2004). These stereotypes are known to African Americans and may influence their performance on the IAT, even if African Americans do not endorse these stereotypes and these stereotypes are rarely activated in real life. Thus, the mean close to zero may not reflect the fact that 50% of African Americans have negative attitudes towards their group. Rather, it is possible that the neutral score reflects a balanced influence of positive attitudes and negative stereotypes.

Another noteworthy difference between other implicit tasks and the race IAT is that other tasks rely on pictures of individual members to elicit a valenced response. In contrast, the race IAT focuses on the evaluation of the abstract category “Black.” It is possible that African Americans have more positive attitudes to (pictures of) members of the group than to the concept of being “Black,” which is a fuzzy category at best. Similarly, old people seem to have a negative attitude to the concept of being “old,” but this does not imply that they do not like old people. This has important implications for the predictive validity of the IAT. In everyday life, we encounter individuals and not abstract categories. Thus, even if the race IAT were a valid measure of attitudes towards abstract categories, it would be a weak predictor of actual behaviors.

In sum, the only empirical support for system justification theory was African Americans’ neutral score on the race IAT. We show that the race IAT lacks validity and that African Americans have positive attitudes towards their in-group on all other measures. We also find that they have positive attitudes towards the White outgroup. This has important implications for the assessment of racial attitudes of White participants. If most White participants have negative attitudes towards Black people and these attitudes consistently influence White Americans behaviors, African Americans would experience discrimination from most White Americans. In this case, we would expect negative attitudes towards the out-group. As the data show, this is not the case. This does not mean that discrimination is rare. Rather, it is possible that most acts of discrimination are committed by a relatively small group of White Americans (Campbell & Brauer, 2021).

Implications for White American’s In-Group and Out-Group Attitudes

Banaji and Greenwald’s (2013) popular book was largely responsible for claims that implicit bias is real, widespread, and explains racial discrimination. The book ends with several conclusions. Two conclusions are widely accepted among social psychologists and a majority of US Americans, namely Black disadvantage exists and racial discrimination at least partially contributes to this disadvantage. However, other conclusions were not generally accepted and were not clearly supported by evidence, namely attitudes have both reflective and automatic form, people are often unaware of their automatic attitudes, and implicit bias is pervasive, and implicit racial attitudes contribute to discrimination against Black Americans. The claim that implicit biases are widespread was based entirely on the finding that 75% of US Americans show a clear pro-White bias on the race IAT. The present results suggest that this finding is unique to the race IAT and not found with other implicit measures.

Once more, we are not the first to point out that scoring of the race IAT may have exaggerated the pervasiveness of racial biases among White Americans (Blanton et al., 2006, 2009, 2015; Oswald et al., 2013, 2015). However, so far this criticism has fallen on deaf ears and Project Implicit continues to provide individuals with feedback about their race IAT scores. Textbooks proudly point out that over 20 million people have received this feedback, as if this number says something about the validity of the test (Myers & Twenge, 2019).

When visitors might see a discrepancy between their self-views and the test scores, they are informed that this does not invalidate the test because it measures something that is hidden from self-knowledge. The present results suggest that many visitors of the Project Implicit website were given false feedback about their prejudices because even individuals without any negative attitudes towards African Americans end up with a pro-White bias on the race IAT.

This bias can co-exist with evidence that variation in race IAT scores shows some convergent validity with other explicit and implicit measures of individual differences in attitudes towards African Americans. However, variances and means are two independent statistical constructs, and valid variance does not imply that means are valid. Nosek and Bar-Anan (2014) argued that the race IAT is the most valid measure of attitudes because it shows the largest differences in scores between African Americans and White Americans. However, this argument is only valid, if we assume that random measurement error attenuates the differences on other measures. The present study directly tested this assumption and found no evidence for the assumption. Instead, we found that the larger differences between African Americans and White Americans reflects some systematic mean differences that are unique to the race IAT. As noted earlier, a plausible explanation for this systematic bias is that the race IAT activates stereotypes, whereas other measures are purer measures of attitudes.

We hope that our direct demonstration of bias will finally end the practice of providing visitors of the Project Implicit website with misleading information about the validity of the race IAT and misleading information about individuals’ prejudice. There is simply no evidence that prejudice is hidden from honest self-reflection or that such hidden biases are revealed by the race IAT (Schimmack, 2021).

Implications for Future Research

Although our article focuses on the race IAT, the results also have implications for the use and interpretation of the other measures. One advantage of the other measures is that they provide separate information about in-group and out-group attitudes because they avoid the pitting of one group against the other. However, these measures have other problems. Fast reactions to pictures of African Americans and White Americans reflect only first impressions without context. They are also influenced by affective reactions to other aspects such as gender, age, or attractiveness. Thus, these scores may not reflect other aspects of attitudes that are activated in specific contexts. Moreover, the means will depend heavily on the selection of individual pictures. Thus, a lot more work would need to be done to ensure that the picture sets are representative of the whole group. Finally, our results showed that none of the measures had high loadings on the attitude factors. Thus, a single measure has only modest validity.

Unfortunately, psychologists often do not carefully examine the psychometric properties of their measures. Instead, one measure is often arbitrarily chosen and treated as if it were a perfect measure of a construct. Even worse, a specific measure may be chosen from a set of measures because it showed the desired result (John, Loewenstein, & Prelec, 2012). To avoid these problems, we strongly urge intergroup relationship researchers to use a multi-method approach and to use formal measurement models to analyze their data (Schimmack, 2021). This approach will also produce better estimates of effect sizes that are attenuated by random and systematic measurement error.


Adams, P. E. (2010). Understanding the Different Realities, Experience, and Use of Self-Esteem Between Black and White Adolescent Girls. Journal of Black Psychology, 36(3), 255–276. https://doi.org/10.1177/0095798410361454

Banaji, M. R., & Greenwald, A. G. (2013). Blindspot: Hidden biases of good people. New York, NY: Delacorte Press.

Bar-Anan, Y., & Nosek, B. A. (2014). A comparative investigation of seven indirect attitude measures. Behavior Research Methods, 46(3), 668–688. https://doi.org/10.3758/s13428-013-0410-6

Blanton, H., Jaccard, J., Gonzales, P. M., & Christie, C. (2006). Decoding the implicit association test: Implications for criterion prediction. Journal of Experimental Social Psychology, 42(2), 192–212. https://doi.org/10.1016/j.jesp.2005.07.003

Blanton, H., Jaccard, J., Klick, J., Mellers, B., Mitchell, G., & Tetlock, P. E. (2009). Strong claims and weak evidence: Reassessing the predictive validity of the IAT. Journal of Applied Psychology, 94(3), 567–582.

Blanton, H., Jaccard, J., Strauts, E., Mitchell, G., & Tetlock, P. E. (2015). Toward a meaningful metric of implicit prejudice. Journal of Applied Psychology, 100(5), 1468–1481. https://doi.org/10.1037/a0038379

Campbell, M. R., & Brauer, M. (2021). Is discrimination widespread? Testing assumptions about bias on a university campus. Journal of Experimental Psychology: General, 150(4), 756–777. https://doi.org/10.1037/xge0000983

Fazio, R. H., Jackson, J. R., Dunton, B. C., & Williams, C. J. (1995). Variability in automatic activation as an unobtrusive measure of racial attitudes: A bona fide pipeline? Journal of Personality and Social Psychology, 69(6), 1013–1027. https://doi.org/10.1037/0022-3514.69.6.1013

Greenwald, A. G., McGhee, D. E., & Schwartz, J. L. K. (1998). Measuring individual differences in implicit cognition: The Implicit Association Test. Journal of Personality and Social Psychology, 74, 1464–1480.

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 23(5), 524–532. https://doi.org/10.1177/0956797611430953

Jost, J. T. (2019). A quarter century of system justification theory: Questions, answers, criticisms, and societal applications. British Journal of Social Psychology, 58(2), 263–314. https://doi.org/10.1111/bjso.12297

Jost, J. T., Banaji, M. R., & Nosek, B. A. (2004). A Decade of System Justification Theory: Accumulated Evidence of Conscious and Unconscious Bolstering of the Status Quo. Political Psychology, 25(6), 881–919. https://doi.org/10.1111/j.1467-9221.2004.00402.x

Hofmann, W., Gawronski, B., Geschwendner, T., Le, H., & Schmitt, M. (2005). A meta-analysis on the correlation between the Implicit Association Test and explicit self-report measures. Personality and Social Psychology Bulletin, 31, 1369–1385. doi:10.1177/0146167205275613

Muthén, L.K. and Muthén, B.O. (1998-2017). Mplus User’s Guide. Eighth Edition. Los Angeles, CA: Muthén & Muthén

Myers, D. & Twenge, J. (2019). Social psychology (13th edition). McGraw Hill.

Nosek, B. A., Greenwald, A. G., & Banaji, M. R. (2005). Understanding and Using the Implicit Association Test: II. Method Variables and Construct Validity. Personality and Social Psychology Bulletin, 31(2), 166–180. https://doi.org/10.1177/0146167204271418

Olson, M. A., & Fazio, R. H. (2004). Reducing the Influence of Extrapersonal Associations on the Implicit Association Test: Personalizing the IAT. Journal of Personality and Social Psychology, 86(5), 653–667. https://doi.org/10.1037/0022-3514.86.5.653

Oswald, F. L., Mitchell, G., Blanton, H., Jaccard, J., & Tetlock, P. E. (2013). Predicting ethnic and racial discrimination: A meta-analysis of IAT criterion studies. Journal of Personality and Social Psychology, 105(2), 171–192. https://doi.org/10.1037/a0032734

Oswald, F. L., Mitchell, G., Blanton, H., Jaccard, J., & Tetlock, P. E. (2015). Using the IAT to predict ethnic and racial discrimination: Small effect sizes of unknown societal significance. Journal of Personality and Social Psychology, 108(4), 562–571. https://doi.org/10.1037/pspa0000023

Payne, B. K., Cheng, C. M., Govorun, O., & Stewart, B. D. (2005). An inkblot for attitudes: Affect misattribution as implicit measurement. Journal of Personality and Social Psychology, 89(3), 277–293. https://doi.org/10.1037/0022-3514.89.3.277

Rosenberg, M. (1986). Conceiving the self. Malabar, FL: Robert E. Krieger.

Schimmack, U. (2021a). The Implicit Association Test: A Method in Search of a Construct. Perspectives on Psychological Science, 16(2), 396–414. https://doi.org/10.1177/1745691619863798

Schimmack, U. (2021). Invalid Claims About the Validity of Implicit Association Tests by Prisoners of the Implicit Social-Cognition Paradigm. Perspectives on Psychological Science, 16(2), 435–442. https://doi.org/10.1177/1745691621991860

Teige-Mocigemba, S., Becker, M., Sherman, J. W., Reichardt, R., & Christoph Klauer, K. (2017). The affect misattribution procedure: In search of prejudice effects. Experimental Psychology, 64(3), 215–230. https://doi.org/10.1027/1618-3169/a000364

Twenge, J. M., & Crocker, J. (2002). Race and self-esteem: Meta-analyses comparing Whites, Blacks, Hispanics, Asians, and American Indians and comment on Gray-Little and Hafdahl (2000). Psychological Bulletin, 128(3), 371–408. https://doi.org/10.1037/0033-2909.128.3.371

How to build a Monster Model of Well-being: Part 4

This is part 4 in a mini-series of blogs that illustrate the usefulness of structural equation modeling to test causal models of well-being. The first causal model of well-being was introduced in 1980 by Costa and McCrae. Although hundreds of studies have examined correlates of well-being since then, hardly any progress has been made in theory development. In 1984, Diener (1984) distinguished between top-down and bottom-up theories of well-being, but empirical tests of the different models have not settled this issue. The monster model is a first attempt to develop a causal model of well-being that corrects for measurement error and fits empirical data.

The first part (Part1) introduced the measurement of well-being and the relationship between affect and well-being. The second part added measures of satisfaction with life-domains (Part 2). Part 2 ended with the finding that most of the variance in global life-satisfaction judgments is based on evaluations of important life domains. Satisfaction in important life domains also influences the amount of happiness and sadness individuals experience, whereas positive affect had no direct effect on life-evaluations. In contrast, sadness had a unique negative effect on life-evaluations that was not mediated by life domains.

Part 3 added extraversion to the model. This was a first step towards a test of Costa and McCrae’s assumption that extraversion has a direct effect on positive affect (happiness) and no effect on negative affect (sadness). Without life domains in the model, the results replicated Costa and McCrae’s (1980) results. Yes, personality psychology has replicable findings. However, when domain satisfactions were added to the model, the story changed. Costa and McCrae (1980) assumed that extraversion increases well-being because it has a direct effect on cheerfulness (positive affect) that adds to well-being. However, in the new model, the effect of extraversion on life-satisfaction was mediated by life domains rather than positive affect. The strongest mediation was found for romantic satisfaction. Extraverts tended to have higher romantic satisfaction and romantic satisfaction contributed significantly to overall life-satisfaction. Other domains like recreation and work are also possible mediators, but the sample size was too small to produce more conclusive evidence.

Part 4 is a simple extension of the model in part 3 by adding the other personality dimensions to the model. I start with neuroticism because it is by far the most consistent and strongest predictor of well-being. Costa and McCrae (1980) assumed that neuroticism is a general disposition to experience more negative affect without any relation to positive affect. However, most studies show that neuroticism has a negative relationship with positive aspect as well, although it is not as strong as the relationship with negative affect. Moreover, neuroticism is also related to lower satisfaction in many life domains. Thus, the model simply allowed for neuroticism to be a predictor of both affects and all domain satisfaction. The only assumption made by this model is that the negative effect of neuroticism on life-satisfaction is fully mediated by domain satisfaction and affect.

Figure 1 shows the model and the path coefficients for neuroticism. The first important finding is that neuroticism has a strong direct effect on sadness that is independent of satisfaction with various life domains. This finding suggests that neuroticism may have a direct effect on individuals’ mood rather than interacting with situational factors that are unique to individual life domains. The second finding is that neuroticism has sizeable effects on all life domains ranging from b = -.19 for satisfaction with housing to -31 for satisfaction with friendships.

Following the various paths from neuroticism to life-satisfaction produces a total effect of b = -.38, which confirms the strong negative effect of neuroticism on well-being. About a quarter of this effect is directly mediated by negative affect (sadness), b = -.09. The rest is mediated by the top-down effect of neuroticism on satisfaction with life domains and the bottom-up effect of life domains on global life-evaluations.

McCrae and Costa (1991) expanded their model to include the other Big Five factors. They proposed that agreeableness has a positive influence on well-being that is mediated by romantic satisfaction (adding Liebe) and that conscientiousness has a positive influence on well-being that is mediated by work satisfaction (adding Arbeit). Although this proposal was made three decades ago, it has never been seriously tested because few studies measure domain satisfaction (but see Heller et al., 2004).

To test these hypotheses, I added conscientiousness and agreeableness to the model. Adding both together was necessary because agreeableness and conscientiousness were correlated as reflected in a large modification index when the two factors were assumed to be independent. This does not mean that agreeableness and conscientiousness are correlated factors, an issue that is debated among personality psychologists (Anusic et al., 2009; Biesanz & West, 2004; DeYoung, 2006). One problem is that secondary loadings can produce spurious correlations among scale scores that were used for this model. This could be examined by using a more complex item-level model in the future. For now, agreeableness and conscientiousness were allowed to correlate. The results showed no direct effects of conscientiousness on PA, NA, and LS. In contrast, agreeableness was a positive predictor of PA and a negative predictor of NA. Most important are the relationships with domain satisfactions.

Confirming McCrae and Costa’s (1991) prediction, work satisfaction was predicted by conscientiousness, b = .21, z = 3.4. Also confirming McCrae and Costa, romantic satisfaction was predicted by agreeableness, although the effect size was small, b = .13, z = 2.9. Moreover, conscientiousness was an even stronger predictor, b =.28, z = 6.0. This confirms the old saying “marriage is work.” Also not predicted by McCrae and Costa was that conscientiousness is related to higher housing satisfaction, b = .20, z = 3.7, presumably because conscientious individuals take better care of their houses. The other domains were not significantly related to conscientiousness, |b| < .1.

Also not predicted by McCrae and Costa are additional relationships of agreeableness with other domains such as health, b = .18, z = 3.7, housing, a = .17, z = 2.9, recreation, b = .25, z = 4.0, and friendships, b = .35, z = 5.9. The only domains that were not predicted by agreeableness were financial satisfaction, b = .05, z = 0.8, and work satisfaction, b = .07, z = 1.3. Some of these relationships could reflects benefits for social relationships aside from romantic relationships. Thus, the results are broadly consistent with McCrae and Costa’s assumption that agreeableness is beneficial for well-being.

The total effect of agreeableness in this dataset was b = .21, z = 4.34. All of this effect was mediated by indirect paths, but only the path through romantic satisfaction achieved statistical significance due to a lack of power, b = .03, z = 2.6.

The total effect of conscientiousness was b = .18, z = 4.14. Three indirect paths were significant, namely work, b = .06, z = 3.3. romantic satisfaction, b = .06, z = 4.2, and housing satisfaction, b = .04, z = 2.51.

Overall, these results confirm previous findings that agreeableness and conscientiousness are also positive predictors of well-being and shed first evidence on potential mediators of these relationships. These results need to be replicated in datasets from other populations.

When openness was added to the model, a modification index suggested a correlation between extraversion and openness, which has been found in several multi-method studies (Anusic et al., 2009; DeYoung, 2006). Thus, the two factors were allowed to correlate. Openness had no direct effects on positive affect, negative affect, or life-satisfaction. Moreover, there were only two, weak, just significant relationships with domain satisfaction for work, b = .12, z = 2.0, and health, b = .12, z = 2.2. Consistent with meta-analysis, the total effect is negligible, b = .06, z = 1.3. In short, the results are consistent with previous studies and show that openness is not a predictor of higher or lower well-being. To keep the model simple, it is therefore possible to omit openness from the monster model.

Model Comparisons

At this point, we have built a complex, but plausible model that links personality traits to subjective well-being by means of domain satisfaction and affect. However, just because this model is plausible and fits the data, does not ensure that it is the right model. An important step in causal modeling is to consider alternative models and to do model comparisons. Overall fit is less important than relatively better fit among alternative models.

The previous model assumed that domain satisfaction causes higher levels of PA and lower levels of NA. Accordingly, affect is a summary of the affect generated in different life domains. This assumption is consistent with bottom-up models of well-being. However, a plausible alternative model assumes that affect is largely influenced by internal dispositions which in turn color our experiences of different life domains. Accordingly neuroticism may simply be a disposition to be more often in a negative mood and this negative mood colors perception of marital satisfaction, job satisfaction, and so on. Costa and McCrae (1980) proposed that neuroticism and extraversion are global affective dispositions. So, it makes sense to postulate that their influence on domain satisfaction and life satisfaction is mediated by affect. McCrae and Costa (1991) postulated that agreeableness and conscientiousness are not affective dispositions, but rather only instrumental for higher satisfaction in some life domains. Thus, their effects should not be mediated by affect. Consistent with this assumption, conscientiousness showed only significant relationships with some domains, including work satisfaction. However, agreeableness was a positive predictor of all life domains, suggesting that it is also a broad affective disposition. I thus modeled agreeableness as a third global affective disposition (see Figure 2).

The effect sizes for affect on domain satisfaction are shown in Table 1.

A comparison of the fit indices for the top-down and bottom-up models shows that both models meet standard criteria for global model fit (CFI > .95; RMSEA < .06). In addition, the results show no clear superiority of one model over the other. CFI and RMSEA show slightly better fit for the bottom-up model, but the Bayesian Information Criterion favors the more parsimonious top-down model. Thus, the data are unable to distinguish between the two models.

Both model assume that conscientiousness is instrumental for higher well-being in only some domains. The key difference between the models is the assumption of the top-down model that changes in domain satisfaction have no influence on affective experiences. That is, an increase in relationship satisfaction does not produce higher levels of PA or a decrease in job satisfaction does not produce a change in NA. These competing predictions can be tested in longitudinal studies.


To conclude part 4 of the monster model series. As surprising as it may sound, the present results provide one of the first tests of McCrae and Costa’s causal theory of well-being (Costa & McCrae, 1980, McCrae & Costa, 1991). Although the present results are consistent with their proposal that agreeableness and conscientiousness are instrumental for higher well-being because they foster higher romantic and job satisfaction, respectively, the present results also show that this model is too simplistic. For example, conscientiousness may also increase well-being because it contributes to higher romantic satisfaction (marriage is work).

One limitation of the present model is the focus on the Big Five as a measure of personality traits. The Big Five are higher-order personality traits of more specific personality traits that are often called facets. Facet level traits may predict additional variance in well-being that is not captured by the Big Five (Schimmack ,Oishi, Furr, & Funder, 2004). Part 5 will add the strongest facet predictors to the model, namely the Depressiveness facet of Neuroticism and the Cheerfulness facet of Extraversion (see also Payne & Schimmack, 2020).

Continue here to Part 5.

Stay tuned.