All posts by Ulrich Schimmack

About Ulrich Schimmack

Since Cohen (1962) published his famous article on statistical power in psychological journals, statistical power has not increased. The R-Index makes it possible f to distinguish studies with high power (good science) and studies with low power (bad science). Protect yourself from bad science and check the R-Index before you believe statistical results.

How to Build a Monster Model of Well-Being: Part 1


Few psychologists would deny that psychological measurement is messy and that observed scores are biased by systematic and random measurement error. It is therefore surprising that also few psychologists are actually trying to do something about this by using multi-method measurement models to control for measurement error. Multi-method measurement models are usually published by psychometricians who are only interested in measurement, but not in substantive theories. As a result, most published work relies on messy measures with unknown validity. This is also true for research on well-being. Given the lack of emphasize on measurement, it is also very difficult to find competent reviewers of work that uses measurement models because most well-being researchers are not trained in measurement.

To avoid the pain of peer-review, I am presenting the result of a complex model of well-being as a blog post. This model addresses many unanswered questions in the well-being literature. This may sound like a good thing, but it is a problem. Psychology articles are designed to answer one question at a time. Hopefully, without any concern about effect sizes. Like, does money buy happiness? Yes or No? Or, does neuroticism predict well-being. Of course, it does. This piecemeal work often culminates in a meta-analysis that shows the average correlation across many studies. Although this information quantifies effects, nobody knows what to do with this quantitative information. Psychologists are not used to think in terms of complex models with multiple variables that can all influence each other. In fact, they despise these models becaues they make causal assumptions with correlational data. What would happen if we allowed that. After all, we pride ourselves on the use of experiments which makes us superior to the social sciences that use correlations. The problem is that nobody has come up with a good experiment to study what makes lives good that could be adminstered on Mturk for 20 cents per participant.

Ok, rant over.

1. A measurement model of well-being.

The most widely used indicators of well-being are life-satisfaction judgments. They have high face validity because they directly ask respondents to evaluate their lives with a subjectively chosen ideal. However, face validity does not imply that a measure is valid or that all of the variance in it is valid variance. The most widely used evidence to claim that self-ratings of life-satisfaction are valid indicators of well-being is that they show convergent validity with informant ratings. Although everybody might be lying, it is more likely that agreement between raters reflects some valid information about individuals’ lives.

Lucas, Suh, and Diener (1996) demonstrated convergent validity of self-ratings with averaged informant ratings.

Life-Satisfaction T1
Life-Satisfaction T20.77
Life-Satisfaction Informant0.480.49

These data cannot be used to create a measurement model because the self-ratings at time 1 and time 2 are likely to share some method variance. This would explain why they are more highly correlated with each other than with averaged informant ratings that reduce random and systematic measurement error by averaging across raters.

The first measurement model of well-bieng was published in 2013 (Zou, Schimmack, & Diener, 2013). Rather than averaging informants, each rater was treated like an independent method. The data came from the Mississauga Family Study. In this study, students who lived with their biological parents participated in a Round-Robin family study. Thus, there were three targets (students, mothers, fathers). In addition, there were three informants (students, mothers, fathers). These data can be analzyed with a model with planned missing data and four measures (self, student informant, mother informant, father informant), where data are missing for the informant ratings of the same target (informant ratings of student by students).

Life-Satisfaction Self
Life-Satisfaction Student Informant0.38
Life-Satisfaction Mother Informant0.310.35
Life-Satisfaction Father Informant0.370.380.47

The correlations in Table 2 show convergent validity four all four methods. Moreover, there is no evidence that self-ratings are more valid than informant ratings which should produced stronger self-informant correlations than informant-informant correlations. All methods seem to be equally valid. The main deviation was found for the correlation between mothers and fathers as informants. This could reflect some shared method variance in parents’ ratings of students’ well-being. However, a single-factor model fits these data reasonably well, CFI = .983, RMSEA = .048.

Nevertheless, adding a method factor (i.e, a correlation between the error variances) for mothers and fathers as informants improved model fit, CFI = 1.000, RMSEA = .000.

The path coefficient from the factor (unobserved variable, latent variable) to the observed self-rating scores of .597, implies that only about one-third of the variance in self-ratings is valid variance that is also reflected in informant ratings. As these scale scores are based on the 3-item Satisfaction with Life scale (Diener et al., 1985), with a reliability of about 85%, this means that a large portion of the variance in self-ratings is systematic measurement error. This hasn’t stopped well-being researchers from treating single-item life-satisfaction ratings as perfectly valid measures that do not require a measurement model (e.g., World Happiness Reports).

In conclusion, the first part of the model answers three questions. First, do life-satisfaction judgments have some validity? The answer is that they very very likely have some validity. Second, is all of the reliable variance in life-satisfaction judgments valid? The answer is that it is very very unlikely that this is the case. The third question is how much of the variance in life-satisfaction judgments is valid variance? Of course, there is no precise answer to this question, but a reasonable answer is that it is more than a quarter and less than fifty percent.

2. Affect and Well-Being

The second question is what predicts variation in life-satisfaction judgments. One theory of well-being is that the only relevant information is how much pleasure versus displeasure we experience in our lives. This theory is known as hedonism and usually attributed to Bentham who famously proposed that we are all slaves to pleasure and pain. In philosophy the question was whether it can be objectively justified to define well-being in this way. The modern consensus is that well-being cannot be reduced to feelings because individuals might want to do other things with their lives. Thus, a subjective theory of well-being at least allows individuals in principle to have high well-being with a lot of pain and little pleasure. However, it seems likely that people care about their feelings and that they influence how they evaluate their lives. This brings us to two empirical questions. First, how much of the variance in well-being is explained by feelings? Second, what else influences well-being independent of feelings?

Many studies have examined the first question using just self-ratings. However, this creates problems. Feelings are also measured with self-ratings, which creates shared method variance, but there could also be systematic method variance that is unique to judgments of feelings and life-satifaction. To overcome these problems, it is possible to create measurement models for feelings and life-satisfaction judgments and to examine their relationship at the unobserved level that removes measurement error from the observed scores.

To go slowly, I first show the results for a model in which memory-based ratings of happy feelings are used to predict life-satisfaction judgments. Model fit was acceptable, CFI = .989, RMSEA = .036.

The autogenerated model in MPLUS looks pretty crappy, but it does show all of the paths, including the correlated error variances for ratings by the same rater as well as for father and mother informant ratings. The key finding is that the standardized effect size for the effect of happiness on life-satisfaction judgments is .718. In a model with a single predictor, we can square this parameter and see that happiness alone accounts for 51% of the variance in well-being.

We can now do the same for negative affect. Very similar results are obtained with global ratings of unpleasantness or ratings of sadness (.69 vs. 71). Thus, sadness also explains 50% of the variance in well-being.

If happiness and sadness were independent, these results would confirm the hedonist theory of well-being, but happiness and sadness are not independent. It is therefore necessary to include happiness and sadness simultaneously as predictors to see how much of the variance in well-being is explained by affect alone and how much variance is left to be explained by something else. This is of course just multiple regression with the notable difference that regression is conducted at the level of unobserved variables that control for measurement error.

The autogenerated model looks fine because it doesn’t include the correlated errors. The actual model included these errors. The fit of the model was acceptable, CFI = .989, RMSEA = .029. The correlation between happiness and sadness was r = -.59, which is similar to results in the only two multi-method studies of this relationship (Diener, Smith, & Fujita, 1995; Zou et al., 2013). The residual (unexplained) variance in well-being was 37%. Thus, affect explains roughly two-thirds of the valid variance in well-being, but one-third is not explained by affect.

Thus, the first conclusion that we can draw from these results is that well-being cannot be reduced to the balance of pleasure and displeasure. Apparently, we are not slaves to our passions and have ways to embrace our lives even when we are not experiencing pleasure or dislike our lives even if we do. Most important, the remaining unexplained variance is not just measurement error because the model controlled for it.

This brings us to the second question. What else predicts well-being independent of our feelings. And this question will be examined in Part II of this series called “How to build a monster model of well-being.”

Continued here. Part 2

Deceptive Open Science: Faking a Registered Report

Psychology, especially social psychology, has a credibility problem. For decades, psychologists used questionable research practices to confirm their predictions. As a result, psychology journals are filled with successful experiments, p < .05, that fail to provide credible evidence for social psychological theories. In theory, p-values below .05, ensure that no more than 5% of published results are false positives. However, if results are selected to be significant, it is theoretically possible that all of the published results are false. The happy production of incredible results in social psychology only attracted attention when social psychologists published crazier and crazier results that culminated in evidence for time-reversed, unconscious detection of erotic stimuli (Bem, 2011). Even social psychologists had a hard time believing this nonsense.

Over the past decade, some social psychologists have tried to improve (or establish) sound scientific practices in social psychology with new initiatives like pre-registration of studies, open data sharing, and using larger samples to reduce sampling error so that results could be significant without inflated effect sizes.

One format that aims to improve the credibility of published findings is a registered report. Registered reports are studies where researchers submit a study design without data to a journal. The editor and reviewers vet the proposed study. If the study is approved, it receives a conditional acceptance, data are collected, and results are reported. One advantages of registered reports are that potential flaws are detected before data are collected, ensuring that studies that are carried out are better designed than studies that did not receive formal peer-review. Another advantage is that results will be published even if the results do not support researchers’ hypotheses.

The format of registered reports is fairly new and few articles have used this format. Unfortunately, there are already signs that journals and authors are abusing this new format and publishing articles as registered reports that do not have the desirable properties of registered reports. One example, is a registered report published in Frontiers of Psychology.

In 2014, Frontiers in Psychology published an article with the title “Registered Report: Measuring Unconscious Deception Detection by Skin Temperature” (van’t Veer, Stel, van Beest, & Marcello Gallucci, 2014). This article was not a registered report. Rather, it reported the results of a so-called pilot study and then outlined the design of a follow-up study. In 2015, the authors published an article with the title “Unconscious deception detection measured by finger skin temperature and indirect veracity judgments— results of a registered report” (van ’t Veer, Gallucci, Stel, van Beest, 2015).

The publication of two articles is inconsistent with the structure of a pre-registered report, where a data-analysis plan is submitted, approved, and then published with the results after data collection, resulting in a single article. More important is the question how similar or different the pilot study in the first article is to the new study in the actual registered report.

The pilot study had a sample size of 108 participants. The actual registered study had a sample size of 155 participants. Although slightly bigger, this is not a notable difference in sample size. More important, the pilot study was not merely a study that examined the validity of outcome measures or the effectiveness of the experimental manipulation with manipulation checks. Rather, both studies used the full design to test the same hypotheses. In other words, the registered study was mainly a replication of a previous studies.

The main hypothesis (H1) was specified to be “confirmed if the average skin temperature of participants while watching a liar is lower than when watching a truth-teller. This should translate into an interaction between veracity and time, and possibly a main effect of veracity” (p. 7).

The so-called pilot study showed no main effect of veracity, the predicted veracity x time interaction, b = 0.0007, F(1,73218.7) = 387.14, p < 0.001, and a three-way interaction between veracity x time x awareness, b = 0.0002, F(1,73218.7) = 6.69, p = 0.010.

The registered replication study also showed no main effects of veracity. It also failed to replicate the two-way interaction with time (p = .598). Thus, H1 was not confirmed. However, in the discussion section the authors write

“The observed patterns of temperature change over time only partly confirmed our main hypothesis(H1), and the current findings pertaining to this hypothesis are therefore inconclusive” (p. 7).

This makes no sense. The results did not confirm the hypothesis, the results are inconsistent with those of a nearly equally powered previous study, and the main conclusion that can be drawn from these results is that the studies were underpowered to reliably detect the predicted effect.

The discussion section goes on to interpret the pattern of a non-significant 3-way interaction (p = .06) that was not predicted a priori.

“We found that finger skin temperature consistently decreased while observing a liar. When participants were observing a truth-teller, however, their finger skin temperature decreased more than it did for liars in the phase where participants did not have the goal to detect deception. In contrast, participant’s finger skin temperature stayed higher when observing a truth-teller compared to a liar when participants did have the goal to detect deception” (p. 7).

The discussion leaves out that the so-called pilot study produced a different pattern.

“From Figure1, it becomes apparent that in the not aware condition (i.e., in the first block of videos) participants were warming up overtime while watching both a video of a liar or a truth-teller. Interestingly and surprisingly, in the second block of videos (what we have called the aware phase) finger temperature dropped when watching a video of a truth-teller, mores ot han when watching a liar.”

Much later in the discussion the authors do make a comparison to the pilot study.

“When comparing the current results to the results obtained in the pilot study, the temperature pattern observed in the second phase of the pilot study seems to resemble the current pattern
observed in the not forewarned phase. Although speculative, a perceivable cause of this could be that in the pilot study the forewarning was not manipulated as strong as in the current study, leaving participants still in a relatively ignorant state about what was to come and whether the experimental context was one of deception detection. Being able to expect and prepare for what is to come arguably has some advantages, although it should be noted that not anticipating threats is comparable to an everyday life situation in which people assume they will not be lied to.” (p. 9).

The registered report has been cited only six times. Another study of liars cites the 2015 article with the claim that the new findings “converge and expand on those of van ’t Veer, Gallucci, Stel, and van Beest (2015), who found that participants experienced decreases
in finger skin temperature while observing lies relative to truths” (ten Brinke, Lee, & Carney, p.574). This statement is an inaccurate statement about the actual results (a lie?), which showed no main effect of lying in the so-called pilot study and the registered study. Apparently, social psychologists still favor compelling stories over truth.

In conclusion, there is a danger that social psychologists once again undermine methodological reforms that force them to put their fancy theories to hard empirical tests and to burry them when they fail. Social psychology could clearly benefit from a reality check that puts many zombie theories out of their misery.

That being said, van’t Veer et al.’s registered report did produce some useful insights.
“In our experiment we found liars to be liked and trusted less.” (p. 9).

Comparing Top-Down and Bottom-Up Models of Subjective Well-Being

Over the past decade it has become apparent that psychological science is not yet a real science. Too often a single article with inconclusive results is widely cited as evidence. One reason for this is the limited amount of journal space in traditional, tree-killing journals that made it difficult to publish replication studies. This is changing as it is becoming easier to share scientific research in online only journals, pre-prints, or blog posts.

In an influential article, Diener (1984) proposed a distinction between bottom-up models and top-down models of subjective well-being. Although the terms top-down and bottom-up are used to distinguish a variety of models, the terms have been used to compare two alternative models of the relationship between life-satisfaction and domain satisfaction judgments.

The top-down model assumes that life-satisfaction has a global halo effect on satisfaction with specific life domains. In contrast, bottom-up model consider life-satisfaction judgments to be summary judgments of satisfaction with important life domains (Schimmack, Diener, & Oishi, 2002).

An influential article in Psychological Bulletin used meta-analytic correlations to test these two models against each other (Heller, Watson, & Ilies, 2004). The article has been cited over 300 times, but a review of these articles did not reveal a single replication study.

To compare top-down and bottom-up models, Heller et al. (2004) included personality measures of the Big Five. The bottom-up model assumes that personality traits influence specific life domains (e.g., neuroticism influences health satisfaction, extraversion influences leisure satisfaction), and that domain satisfaction mediates the influence of personality on life-satisfaction (Figure 1).

The alternative top-down model assumes that personality traits influence life-satisfaction and that life-satisfaction mediates effects of personality on life domains (Figure 2).

There are two major differences between these models. First, the bottom-up model allows for life-circumstances to influence life-satisfaction. For example, actual income could influence income satisfaction, which is supported by studies that show moderate to strong correlations between income and financial satisfaction. In Figure 1, income and other life-circumstances are represented by the residual variances of domain satisfaction that are not due to personality traits. These residual variances are often omitted from figures because there is a common misperception to think about residual variances as error variances, which is only true in measurement models where residual variance may reflect only random error. This is not the case here. While domain satisfaction residuals contain some random and systematic measurement error, they also reflect actual life-circustances and person x situation interaction effects that produce real variation in domain satisfaction. Figure 1 shows that these residuals have an influence on life-satisfaction because there is an arrow from the residuals to the domain satisfactions (e.g., income to income satisfaction) and from the domain to life-satisfaction (income-satisfaction to life-satisfaction). It would be possible to add manifest measures of these life-circumstances to the model, but if no measures are available, the variance is reflected in latent variables that capture the influence of unmeasured causes on domain satisfaction. Figure 2 shows that there is no path from the domain-satisfaction residuals to life-satisfaction. Thus, Heller et al.’s top-down model implies that actual life-circustances like income have no influence on life-satisfaction or that these effects are so small that they can be safely ignored.

The second difference between the models is that Figure 2 implies that personality traits that influence life-satisfaction also influence all life domains. The rational is that the halo from life extends to all aspects of life. In contrast, the bottom-up model in Figure 1 does not make this assumption. It is possible that all personality traits influence all life domains (cf. Heller et al., 2004), but it is also possible that some personality traits are only beneficial for some domains. For example, McCrae and Costa (1991) speculated that agreeableness is beneficial for love (family satisfaction), whereas conscientiousness is beneficial for work (job & income).

It is noteworthy that Heller et al. (2004) found better fit for the bottom-up model than the top-down model, CFI = 1.00 vs. .96, RMSEA = .01 vs. .08, but favored the top-down model because it is more parsimonious. This decision is questionable because CFI and RMSEA do take parsimony into account and still favor the bottom-up model. The preference of the top-down model may also be explained by review articles that suggested objective life-circumstances are relatively unimportant for life-satisfaction (Diener, 1984; Diener, Suh, Lucas, & Smith, 1999). However, over the past two decades, new research has demonstrated that life-events are more important than researchers originally assumed (Diener, Lucas, & Scollon, 2006; Schimmack, Schupp, & Wagner, 2008).

My literature review of studies that cited Heller et al. (2004) retrieved one important article that deserves attention (Lachmann et al., 2018).

The article reports four highly similar studies that examined the correlations among the Big Five dimensions, 5 or 6 domain satisfaction judgments, and global life-satisfaction judgments. Study 1 had a sample size of 29,418 participants. Study 2 replicated Study 1 with 4390 participants. The main limitation of Studies 1 and 2 was the use of a very brief personality questionnaire (i.e., the 10-item BFI-10). Studies 3 and 4 had smaller sample sizes (Ns = 496, 488), but used the longer NEO-FFI to measure personality. Study 3 and 4 were based on German and Chinese student samples, respectively. This made it possible to examine cultural similarities and differences. The main finding of this study was that domain satisfaction judgments explained additional variance in life-satisfaction judgments that was not explained by personality. This finding favors bottom-up models. The main limitation of this article was that the authors did not fit bottom-up and top-down models to the data, but they reported the correlation matrices and standard deviations, which makes it possible to fit these models to their data. Here I report the results of these analyses.

Studies 1 and 2

One major problem in mono-method studies is that self-ratings of personality and life-satisfaction share rater biases (Diener, 1984; Schimmack et al., 2008). Anusic et al. (2009) found a way to model this rater bias with a method factor that they called halo factor. It is possible to fit a halo factor to self-ratings of the Big Five because the actual personality traits are fairly independent. Thus, any correlations among self-ratings of the Big Five reflect mostly halo bias, although some additional relationships may exist. The same halo factor can also influence ratings of domain satisfaction and life-satisfaction. Thus, one solution to control for method variance in self-ratings is to let all measures (personality, domain satisfaction, life-satisfaction) to load on a single factor and to use the residuals to model the actual structural relationships (Schimmack et al., 2008). This is the approach I used here. Figure 3 illustrates this measurement approach.

H represents the halo factor that produces shared method variance among all measures. N’, A’, he’, fa’ and ls’ represents the remaining variance in these measures after controlling for the shared method variance. The actual structural model is then created by relating these residuals to each other. Figure 3 illustrates this with a sparse bottom-up model that assumes only N and A influence life-satisfaction and that these effects are fully mediated by health satisfaction and family satisfaction, respectively. Figure 3 is only used to illustrate the measurement model. The actual model was more complex, but difficult to represent graphically.

To easy model identification, I first constrained the unstandardized loading of all Big Five dimensions on the halo factor to 1. I allowed the remaining loadings to vary freely. The bottom-up model allowed for N, E, A, and C to influence all domains and all domains could influence life-satisfaction. This is identical to Heller’s model. Also, I allowed domain satisfaction residuals to correlate because life domains can overlap (e.g., income satisfaction can be correlated with job satisfaction).

Inspection of the modification indices suggests that BFI-10 Extraversion is related to BFI-10 Neuroticism. Therefore, I allowed for an extra relationship between the residuals of E and N (N’ with E’). The bottom-up model had good fit, CFI = .988, RMSEA = .042, and had better fit than the top-down model, CFI = .947, RMSEA = .054. Even the Bayesian Information Criterion that rewards parsimony the most, favored the more complex bottom=up model, BIC = 1,122,010 vs. 1,123,937. This result replicates Heller et al.’s (2004) finding, although they favored the top-down model, and the results confirm Lachmann et al.’s conclusions based on the same data.

Given the large sample size, it is better to focus on effect sizes. The parameter estimates for the bottom-up model showed notable (effect size r > .1) effects of neuroticism on health satisfaction (-.28), job satisfaction (-.20), income satisfaction (-.13), housing satisfaction (-.24), and leisure satisfaction (-.12). That is, all domains were negatively related to neuroticism. Extraversion had positive relationships with health satisfaction (.21), job satisfaction (.15), housing satisfaction (.23), and leisure satisfaction (.14). The effect for income satisfaction (.08) was close to .10. Thus, extraversion also showed broad relationships with most if not all domains. Conscientiousness showed notable effects for job satisfaction (.16), income satisfaction (.12), but not for housing satisfaction (.01) or leisure satisfaction (.05). The effect for health was close to .10 (.08). Agreeableness did not show any notable relationships. The strongest effect was for health satisfaction (.07).

Table 1 shows the unique effects of the domains on life-satisfaction. All of these effects are significant well beyond the conservative 5-sigma criterion used in particle physics (zs > 11). The reason the effect sizes are small is that life is a complex object with many domains. Thus, any single domain can only explain a small portion of the variance in life-satisfaction. In addition, life domains overlap and regression coefficients reveal only the unique variance that is explained by a single domain. Together, these five domains explained 56% of the variance in life-satisfaction, which is close to the amount of reliable variance in a single-item measure of life-satisfaction (Schimmack & Oishi, 2005).

A bottom up model assumes that personality effects on life-satifaction are mediated by life-domains. I used the model indirect function to get estimates of the indirect effects of personality traits on life-satisfaction. The results are consistent with the common finding that neuroticism and extraversion are the strongest predictors of life-satisfaction (Heller et al., 2004). After controlling for shared method variance, the remaining traits explain relatively small amounts of variance in life-satisfaction judgments (Schimmack et al., 2008; Schimmack & Kim, 2020). Even if these traits had some strong effects on some domains, the effect on life-satisfaction would be muted by the small effects of single domains on life-satisfaction. Extraversion and neuroticism have stronger effects on life-satisfaction because they influence a broad range of domains. The effect size for neuroticism in this study may be attenuated by the lower reliability of the 2-item measure in the BFI-10.

Study 2 is an exact replication of Study 1 with a new sample. The bottom-up model fit the data better than the top-down model, CFI = .993 vs. .975, RMSEA = .035 vs. 041, BIC = 170,443 vs. 170,471.

Neuroticism was a notable (effect size r > .1), predictor of health satisfaction (-.15), job satisfaction (-.12) and leisure satisfaction (-.14). This time, the effects for income satisfaction (-.08) and housing satisfaction (-.08) were slightly below the cutoff. Extraversion only reached the cutoff value for leisure (.11). Conscientiousness once again predicted job satisfaction (.12) and income satisfaction (.14). Once again, agreeableness was not a notable predictor of any life domain.

Table 3 shows that the contribution of domains to life-satisfaction was also similar across studies. Once more job was the weakest predictor and leisure was the strongest predictor. Despite the smaller sample size, all effects are significant with sigma-5 as criterion (zs > 8).

Neuroticism was again the strongest predictor and agreeableness and conscientiousness were again weak predictors of life-satisfaction.

The most notable difference to the first study is that extraversion was a weak predictor in this study, which was also reflected in the weak relationships with domain satisfaction. A large investigation of possible moderators shows that this finding is not due to the reliance on a German sample, although extraversion effects in Germany may be a little bit weaker than in North American samples. Given the use of a two-item Extraversion measure these results should not be overinterpreted.

The main finding from Studies 1 and 2 is that the bottom-up model fits the data better than the top-down model and that one reason for this better fit is that unique variance in life domains contributes to overall judgments of life-satisfaction. Figure 4 summarizes the main findings.

Studies 3 and 4

Samples 3 and 4 were analyzed simultaneously with a multiple group model to compare cultural differences between German and Chinese samples. Studies 3 and 4 also added family satisfaction as a life domain and used the longer NEO-FFI to measure personality.

The bottom-up model fitted the data better than the top-down model for the CFI, .973 vs. .933, and RMSEA = .045 vs. .063, but not for BIC = 37,357 vs. 37345. The reason is that BIC is sensitive to sample sizes and treats small effects in small samples as consistent with null-effects. In fact, the sample size adjusted BIC preferred the bottom-up model, 37,087 vs. 37,138.

The personality effects on domain satisfaction were once again weak, except for neuroticism. The only notable difference between cultures was a stronger relationship between conscientiousness and health satisfaction in the Chinese sample (.16 vs. 01).

All domains except job satisfaction contributed to overall life-satisfaction. Results were similar across the two cultural samples.

The reason for the different results for job satisfaction could be that these were student samples and they rated satisfaction with their courses.

The indirect effects of personality are again weak, but neuroticism is the strongest predictor because it influences most if not all domains.


Heller et al. (2004) argued for top-down models of subjective well-being. The main assumption of these models is that well-being is strongly influenced by personality traits and that actual life-circumstances have no influence on SWB. Even Heller et al.’s (2004) data did not favor this model, but it fitted trait-theories of well-being that were popular in the 1980s and 1990s. Although well-being researchers have moved on from trait or set-point theories (Diener et al., 2006), old articles are still cited although they are inconsistent with new data as well as the old data. It is time to abandon top-down models that never made sense and never fit actual data and start building on models that are actually consistent with the data. Any scientific model of well-being has to acknowledge that well-being is less stable than personality traits (Anusic & Schimmack, 2016) and more strongly influenced by life events and environmental factors than personality traits (Diener et al., 2006).

The neglect of bottom-up models has had several negative consequences on personality and well-being research. Most importantly, researchers have neglected to study the influence of personality on specific life domains. The present results suggest that only neuroticism appears to have a pervasive influence on many life domains, whereas other personality traits may only affect some life domains. It is also possible that personality traits interact with environmental factors in the prediction of domain satisfactions. A focus on global life-satisfaction judgments is problematic because life-satisfaction is a broad aggregate of many factors. As a result, effect sizes for any specific factor are bound to be weak. Exploration of domain satisfaction in smaller studies may elict new factors that have been overlooked in studies with global life-satisfaction judgments.

Another problem for well-being researchers is the assumption that global life-satisfaction judgments are unbiased measures of subjective well-being. That is, the prevailing assumption is that respondents weigh relevant information according to their subjective importance. However, this may not be the case. While I have found some evidence that accessibility is related to importance (Schimmack et al., 2002), many other factors can influence what information respondents use to answer vague, global life-satisfaction judgments. For example, while the present study found that health satisfaction contributed unique information, another study failed to find effects of health satisfaction (Payne & Schimmack, in press). More studies with the bottom-up model are needed to examine the contribution of domains to life-satisfaction judgments and to examine potential moderators of these relationships.

In conclusion, although top-down models are not supported by empirical evidence, well-being and personality researchers have been slow to adopt bottom-up models that recognize the importance of life circumstances for well-being. Although personality is important, well-being is not a stable trait and well-being is more than a disposition to look to the bright side of life. A complete understanding of well-being requires an integrated model of environmental and personality factors. The bottom-up model is a first step in that direction.

ManyLabs5: More Evidence that Social Psychology is Incredible

The word incredible has two meanings. One meaning is that something wonderful, spectacular, and remarkable occurred. The other meaning is that something is difficult to belief.

For several decades, experimental social psychologists reported spectacular findings about human behavior and cognition that culminated in the discovery of time-reversed, subliminal, erotic priming (Bem, 2011). The last decade has revealed that many of these incredible findings cannot be replicated (Schimmack, 2020). The reason is that social psychologists used a number of statistical tricks to inflate their effect sizes in order to produce statistically significant results (John et al., 2012). This has produced a crisis of confidence about published results in social psychology journals. What evidence can textbook writers and lecturers trust?

A shocking finding was that only 25% of published results in social psychology could be replicated and the percentage for classic experiments with random assignment to groups was even lower (OSC, 2015). Eminent social psychologists have responded in two ways. They either ignored these embarrassing results and pretended that everything is fine or they argued that the replication studies were poorly designed and carried out, maybe even with the intention to produce replication failures. Neither response is satisfactory. It is telling that eminent social psychologists have resisted calls to self-replicate their famous findings (cf. Schimmack, 2020).

Meanwhile, authors of the reproducibility project have responded to criticism by replicating their replication studies. Moreover, they improved statistical power to produce significant results by collaborating across labs. The results of this replication of replications project have just been published under the title “Many Labs 5” (ML5).

The project focussed on 10 original studies that failed to replicate in the OSC-Reproducibilty Project, but with some concerns about the replication studies. The success rate for ML5 was 20% (2 out of 10). However, none of the studies would have produced a significant result with the original sample size. These results reinforce the claim that experimental social psychology suffers from a replication crisis that casts a shadow of doubt over decades of research and the empirical foundations of social psychology.

One important question for the future of social psychology is whether any published findings provide credible evidence and how credible findings can be separated from incredible findings without the need for costly actual replications. One attempt to find credible evidence are prediction markets. The idea is that the wisdom of crowds makes it possible to identify credible findings. For the 10 studies in ML5, the average estimated success rate was about 30%, which is relatively close to the actual success rate of 20%. Thus, market participants were well calibrated to the low replicability of social psychological findings. However, they were not able to predict which of the 10 studies would replicate. The reason is that even the studies that replicated had very small effect sizes that were not statistically significant from those of studies that did not replicate. Thus, none of the 10 studies was particularly credible and instilled confidence among market participants.

An alternative approach to predict replicability relies on statistical information about the strength of evidence against the null-hypothesis in original studies (Bartos & Schimmack, 2020; Brunner & Schimmack, 2020; Schimmack, 2012). Results with p-values that are just significant (p < .05 & > .01) provide weak evidence against the null-hypothesis when questionable research practices are used because it is relatively easy to get these results. In contrast, very small p-values are difficult to obtain with QRPs. Thus, studies with high test-statistics should be more likely to replicate.

After examining the results from the OSC-reproducibility project for social psychology, I proposed that we should distrust all findings with a z-score less than 4 (Schimmack, 2015). Importantly, this rule is specific to social psychology and other disciplines (e.g., cognitive psychology) or sciences may require different rules (Schimmack, 2015).

How does the 4-sigma rule fare with the ML5 results? Only 1 out of the 10 studies has a z-score greater than 4. This study deserves some special mention because it was published by Jens Forster, who left academia under suspicions of research fraud (Retraction Watch). Fraud does not require actual data that need to be massaged to produce significant results and no statistical method can correct for research fraud. Thus, it is reasonable to apply the 4-sigma rule to the remaining studies. Consistent with the 4-sigma rule, none of the 9 remaining studies would have produced a significant result with the original sample size. Thus, the ML5 results provide support for this rule to social psychology.

The problem for social psychology is that most test-statistics that are published are below this criterion. Figure 1 (from Schimmack, 2020) shows the distribution of published test-statistics in a representative sample of studies from social psychology collected by Motyl and colleagues.

The graph shows clear evidence of QRPs because journals hardly ever report a non-significant result, despite low power (Expected discovery rate 19%) to produce significant results, which has been the case since the beginning of experimental social psychology (Cohen, 1962; Sterling, 1959). Moreover, we see that most published test-statistics are between 2 and 4 sigma. The results from OSC-RPP and ML5 suggest that most of these results are difficult to replicate even with larger samples. Moreover, these results suggest that the replicability estimates provided by z-curve (43%) are overly optimistic because the model does not account for fraud and other extremely questionable practices that can produce significant results without actual effects.

In conclusion, experimental social psychology is the poster-child of pseudo-science, where researchers ape real sciences to sell incredible stories with false evidence. Social psychologists have shied away from this reality, just like Trump is trying to hold on to his lie that he won the 2020 election. It is time to through out this junk science and to usher in a new era of credible, honest, and responsible social psychology that addresses real world problems with real scientific evidence, and to hold charlatans accountable for their actions and denial. It is problematic that textbooks still paddle research and theories that rest on incredible evidence that was obtained with questionable research practices.

Invalid Claims about the Validity of Implicit Association Tests

Draft. Response to Commentaries by Vianello and Bar-Anan and Kurdi, Ratliff, and Cunningham

Invalid Claims about the Validity of Implicit Association Tests by Prisoners of the Implicit Social-Cognition Paradigm

Greenwald and colleagues (1998) introduced Implicit Association Tests (IATs) as a new method to measure individual differences in implicit cognitions. Twenty years later, IATs are widely used for this purpose, but their construct validity has not been established. Even its creator is no longer sure what IATs measure. Whereas Banaji and Greenwald (2013) confidently described IATs as “a method that gives the clearest window now available into a region of the mind that is inaccessible to question-asking methods” (p. xiii), they now claim that IATs merely measure “the strengths of associations among concepts” (Cvencek et al., 2020, p. 3). This is akin to saying that an old-fashioned thermometer measures the expansion of mercury, which it true, but is not really alluding to the purpose of thermometers to measure temperature.

Fortunately, we do not need Greenwald or Banaji to define the constructs that IATs are supposed to measure. Twenty years of research with IATs makes it clear what researchers believe to be measuring with IATs. A self-esteem IAT is supposed to measure implicit self-esteem (Greenwald & Farnham, 2000). A race-IAT is supposed to measure implicit prejudice (Cunningham, Preacher, & Banaji, 2001), and a suicide IAT is supposed to measure implicit suicidal tendencies that can predict suicidal behaviors above and beyond self-reports (Kurdi, Radliff, & Cunningham, 2020). The empirical question is how good IATs are in measuring these constructs. I concluded that most IATs are poor measures of their intended constructs (Schimmack, 2019a). This conclusion elicited one implicit and two explicit responses.

Implicit Response

The implicit response is to simply ignore criticism and to make invalid claims about the construct validity of IATs (Greenwald & Lai, 2020). For example, a 2020 article co-authored by Nosek, Greenwald, and Banaji claims “available evidence for validity of IAT measures of self-esteem is limited (Bosson et al., 2000; Greenwald & Farnham, 2000), with some of the strongest evidence coming from empirical tests of the balance-congruity principle” (p. Cvencek et al., 2020, p. 7). This statement is as valid as Trump’s claim that an honest count of votes would make him the winner of the 2020 election. Over the past two decades several articles have concluded that self-esteem IATs lack validity (Buhrmester, Blanton, & Swann, 2011; Falk et al., 2015; Walker & Schimmack, 2008). It is unscientific to omit these references from a literature review.

The balance-congruity principle is also not a strong test of the claim that the self-esteem IAT is a valid measure of individual differences in implicit self-esteem. In contrast, lack of convergent validity with informant ratings and even other implicit measures of self-esteem provides strong evidence that self-esteem IATs are invalid (Bosson et al. 2000; Falk et al., 2015).

Finally, supporting evidence is surprisingly weak. For example, Greenwald and Farnham’s (2000) highly cited article tested predictive validity of the self-esteem IAT with responses to experimentally manipulated successes and failures (N = 94). They did not even report statistical results. Instead, they suggested that even non-significant results should be counted as evidence for the validity of the self-esteem IAT. “Although p values for these two effects straddled the p = .05 level that is often treated as a boundary between noteworthy and ignorable results, any inclination to dismiss these findings should be tempered by noting that these two effects agreed with prediction in both direction and shape.” (p. 1032). Twenty years later this finding has not been replicated, and psychologists have learned to distrust p-values that are marginally or just significant (Benjamin et al., 2018; Schimmack, 2012, 2020). In conclusion, conflict of interest and motivated biases undermine the objectivity of Greenwald and colleagues in evaluations of IATs.

Explicit Response I

Vianello and Bar-Anan (2020) criticize my structural equation models of their data and presenting a new model that appeared to show incremental predictive validity for implicit racial bias and implicit political orientation. I thought it would be possible to resolve some of the disagreement in a direct and open communication with the authors because the disagreement is about modeling of the same data. I was surprised when the authors declined this offer because Bar-Anan co-authored an article that praised the virtues of open scientific communication (Nosek & Bar-Anan, 2012). Readers therefore have to reconcile conflicting viewpoints for themselves. To ensure full transparency, I published syntax, outputs, and a detailed discussion of the different modeling assumptions in Supplementary Materials on the Open Science Foundation server (

In brief, a comparison of the models shows that my model is more parsimonious and has better fit than their model. As the model is more parsimonious, better fit cannot be attributed to overfitting of the data. Rather, the model is more consistent with the actual data, which in most sciences is considered a good reason to favor a model. Vianello and Bar’Anan’s model also shows unexpected results that show some problems with the interpretation of their method factors. For example, the race IAT has only a weak positive loading on the IAT method factor and the political orientation IAT even has a moderate negative loading. It is not clear how a method can have negative loadings on a method factor, and Vianello and Bar-Anan provided no explanation for this surprising finding.

The two models also produce different results regarding incremental predictive validity (Table 1). My model shows no incremental predictive validity for implicit factors. It is also surprising that Vianello and Bar-Anan found incremental predictive validity for voting behaviors, because the explicit and implicit factors correlated r = .9. This high correlation leaves little room for variance in implicit political orientation that is distinct from political orientation measured with self-ratings. In conclusion, Vianello and Bar-Anan failed to challenge my conclusion that implicit and explicit measures measure mostly the same constructs and that low correlations between explicit and implicit measures reflect measurement error rather than some hidden implicit processes.

Explicit Response II

The second response is a confusing, 7,000 words article that is short of facts and filled with false claims that requires more fact-checking than a Trump interview (Kurdi, Ratliff, & Cunningham, 2020).

False Fact I

The authors begin with the surprising statement that my findings are “not at all incompatible with the way that many social cognition researchers have thought about the construct of implicit evaluation and the validity of the IAT” (p. 6). This statement is misleading. For three decades, social cognition researchers have pursued the idea that many social-cognitive processes that guide behavior occur outside of awareness. For example, Nosek, Hawkins, and Frazier (2011) claim “most human cognition occurs outside conscious awareness or conscious control” (p. 152), and go on to claim that IATs “measure something different from self-report (p. 153). And just this year, Greenwald and Lai (2020) claim that “in the last 20 years, research on implicit social cognition has established that social judgments and behavior are guided by attitudes and stereotypes of which the actor may lack awareness” (p. 419).

Social psychologists have also been successful in making the term implicit bias a common term in public discussions of social behavior. The second author, Kathy Ratliff, is director of Project Implicit which “has a mission to develop and deliver methods for investigating and applying phenomena of implicit social cognition, including especially phenomena of implicit bias based on age, race, gender or other factors” (Kurdi et al., 2020, p. 10). It is not clear what this statement means if we do not make a distinction between traditional research on prejudice with self-report measures and the agenda of Project Implicit to study implicit biases with IATs.

In addition, all three authors have published recent articles that allude to IATs as measures of implicit cognitions. In a highly cited American Psychologist article, Kurdi and co-authors (2019) claim “in addition to dozens of studies that have established construct validity… investigators have asked to what extent, and under what conditions, individual differences in implicit attitudes, stereotypes, and identity are associated with variation in behavior toward individuals as a function of their social group membership.” (p. 570). Just last year, the second author co-authored an article with the claim that “Black participants’ implicit attitudes reflected no ingroup/outgroup preference … Black participants’ explicit attitudes reflected an ingroup preference” (Jiang, Vitiello, Axt, Campbell & Ratliff, 2019). In 2007, third-author Cunningham wrote the “distinction between automatic and controlled processes now lies at the heart of several of the most influential models of evaluative processing” (Cunningham & Zelazo, 2007, p. 97). And last year, Cunningham co-authored a review paper with the claim “a variety of tasks have been used to reflect implicit psychopathology associations, with the IAT (Greenwald et al. 1998) used most widely” (Teachman, Clerkin, Cunningham, Dreyer-Oren, Werntz, 2019). Finally, many users of IATs assume that they are measuring implicit constructs that are distinct from constructs that are measured with self-ratings. It is therefore a problem for the construct validity of IATs, if they lack discriminant validity. At least, Kurdi et al. (2020) fail to explain why anybody should use IATs if they merely measure the same constructs that can be measured with cheaper self-ratings with more measurement error.

False Fact II

A more serious false claim is that I found “high correlations between relatively indirect (automatic) measures of mental content, as indexed by the IAT, and relatively direct (controlled) measures of mental content, as indexed by a variety of self-report scales.” (p. 2).  Table 2 shows some of the correlations among implicit and explicit measures in Vianello and Bar-Anan’s (2020) data. Only one of these correlations meets the standard criterion of a high correlation, r = .5 (Cohen, 1988). The other correlations are small to moderate. These correlations show at best moderate convergent validity and no evidence of discriminant validity (i.e., higher implicit-implicit than implicit-explicit correlations). Similar results have been reported since the first IATs were created (Bosson et al., 2000). For 20 years, IAT researchers have ignored these low correlations and made grand claims about the validity of IATs. Kurdi et al. (2020) are doubling down on this misinformation by falsely describing these correlations as high.

False Fact III

The third false claim is that “plenty of evidence in favor of dissociations between direct and indirect measures exists” (p. 7). To support this claim, Kurdi et al. cite a meta-analysis of incremental predictive validity (Kurdi et al., 2019). There are three problems with this evidence. First, the meta-analysis only corrects for random measurement error and not systematic measurement error. To the extent that systematic measurement error is present, incremental validity will shrink because explicit and implicit factors are very highly correlated when both sources of error are controlled (Schimmack, 2019). Second, Kurdi et al. (2020) fail to mention effect sizes. The meta-analysis suggests that a perfectly reliable IAT would explain about 2% unique variance. However, IATs have only modest reliability. Thus, manifest IAT scores would explain even less unique variance. Even this estimate has to be interpreted with caution because the meta-analysis did not correct for publication bias and included some questionable studies. For example, Phelps et al. (2000) report a correlation of r = .58 with 12 participants between scores on the race IAT and differences in amygdala activation in response to Black and White faces.  Assuming 20% valid variance in the IAT scores (Schimmack, 2019), the validation-corrected correlation would be r = 1.30. In other words, a correlation of r = .58 is impossible given the low validity of race-IAT scores. It is well-known that correlations in fMRI studies with small samples are not credible (Vul, Harris, Winkielman, & Pashler, 2009). Moreover, brain activity is not a social behavior. It is therefore unclear why studies like this were included in Kurdi et al.’s (2019) meta-analysis.

Kurdi et al. (2020) also used suicides as an important outcome that can be predicted with suicide and death IATs. They cited two articles to support this claim. Fact checking shows that one article reported a statistically significant result, p = .013 (Barnes et al., 2017), whereas the other one did not, ps > .50 (Glenn et al., 2019). I conducted a meta-analysis of all studies that reported incremental predictive validity of suicide or death IATs. The criterion were suicide attempts in the next 3 to 6 months as criterion (Table 3). I found 8 studies, but 6 of them came from a single lab (Matthew K. Nock). Nock was also the first one to report a significant result in an extremely underpowered study that included only two suicide attempts (Nock & Banaji, 2007). Five of the eight studies showed a statistically significant result (63%), but the average observed power to achieve significance was only 42%. This discrepancy suggests the presence of publication bias (Schimmack, 2012). Moreover, significant results are all clustered around .05 and none of the p-values meets the stricter criterion of .005 that has been suggested by Nosek and others to claim a discovery (Benjamin et al., 2018). Thus, there is no conclusive evidence to suggest that suicide IATs have incremental predictive validity in the prediction of suicides. This is not surprising because most of the studies were underpowered and unlikely to detect small effects. Moreover, effect sizes are bound to be small because the convergent validity between suicide and death IATs is low, r = .21 (Chiurliza et al., 2018), suggesting that most of the variance in these IATs is measurement error.

In conclusion, 20 years of research with IATs has produced no credible and replicable evidence that IATs have incremental predictive validity over explicit measures. Even if there is some statistically significant incremental predictive validity, the amount of explained variance may lack practical significance (Kurdi et al., 2019).

False Fact IV

Kurdi et al. (2020) object to my claim that “most researchers regard the IAT as a valid measure of enduring attitudes that vary across individuals” (p. 6 of proof♦♦♦). They claim that “most attitude researchers view attitudes as emerging properties from an interaction of persons and situations” (p. 4). It is instructive to compare this surprising claim with Cunningham and Zelazo’s (2007) definition of attitudes as “relatively stable ideas about whether something is good or bad” (p. 97). Kurdi and Banaji (2017) write “differences in implicit attitudes … may arise because of multiple components, including relatively stable components [italics added]” (p. 286). Rae and Greenwald (2017) state that it is a “widespread assumption … that implicit attitudes are characteristics of people, almost certainly more so than a property of situations” (p. 297). Greenwald and Lai (2020) state that test-retest reliability “places an upper limit on correlational tests of construct validity” (p. 425). This statement only makes sense if we assume that the construct to be measured is stable over the retest interval.

It is also not clear how it would be ethical to provide individuals with feedback about their IAT scores on the Project Implicit website, if IAT scores were merely a product of the specific situation at the moment they are taking the test. Finally, the suicide IAT could not be a useful predictor of suicide, if it would not measure some stable dispositions related to suicidal behaviors. In conclusion, Kurdi et al.’s (2020) definition of attitudes is inconsistent with the common definition of attitudes as relatively enduring evaluations.

That being said, the more important question is whether IATs measure stable attitudes or momentary situational effects. Ironically, some of the best evidence comes from third-author Cunningham. Cunningham et al. (2001) repeatedly measured prejudice four times over a three-month period with multiple-measures, including the race IAT. Cunningham et al. (2001) modeled the data with a single trait factor that explained all of the covariation among different measures of racial attitudes. Thus, Cunningham et al. (2001) provided first evidence that most of the valid variance in race-IAT scores is perfectly stable over a three-month period, and that person by situation interactions had no effect on racial attitudes.

There have been few longitudinal studies with IATs since Cunningham’s (2001) seminal study. However, last year an article examined stability over a six-year interval (Onyeador et al., 2019). Racial attitudes of over 3,000 medical students were measured in the first year of med-school, the fourth year of med school, and the second year of residence. Table 4 shows the correlations for the explicit feeling-thermometer and the IAT scores. The first observation is that the t1-t3 correlation for the IAT scores is not smaller than the t1-t2 or the t2-t3 correlations. This pattern shows that a single trait factor can capture the shared variance among the repeated IAT measures. The second observation is that the bold correlations between explicit ratings and IAT scores on the same occasion are only slightly higher than the correlations for different measurement occasions. This finding shows that there is very little occasion-specific variance in racial attitudes. The third observation is that IAT correlations over time are higher than the corresponding FT-IAT correlations over time. This finding points to IAT specific method variance that is revealed in studies with multiple implicit measures (Cunningham et al., 2001; Schimmack, 2019). These findings extend Cunningham et al.’s (2001) findings to a six-year period and show that most of the valid variance in race-IAT scores is stable over long periods of time. In conclusion, Kurdi et al.’s (2020) claims about person by situation effects are not supported by evidence.


Like presidential debates, the commentaries and my response did present radically different views of reality. In one world, IATs are valid and useful tools that have led to countless new insights into human behavior. In the other world, IATs are noisy measures that add nothing to the information we already get from cheaper self-reports. Readers not well-versed in the literature are likely to be confused rather than informed by these conflicting accounts. While we may expect such vehement disagreement in politics, we may not expect it among scientists.

A common view of scientists is that they are able to resolve disagreement by carefully looking at data and drawing logical conclusions from empirical facts. However, this model of scientists is naïve and wrong. A major source of disagreement among psychologists is that psychology lacks an overarching paradigm; that is, a set of fundamentally shared assumptions and facts. Psychology does not have one paradigm, but many paradigms. The IAT was developed within the implicit social-cognition paradigm that gained influence in the 1990s (Bargh et al., 1996; Greenwald & Banaji, 1995; Nosek et al., 2011). Over the past decade, it has become apparent that the empirical foundations of this paradigm are shaky (Doyen et al., 2012; Kahneman, 2012; Schimmack, 2020). It took a long time to see the problems because paradigms are like prisons that make it impossible to see the world from the outside. A key force that prevents researchers within a paradigm from noticing problems is publication bias. Publication bias ensures that studies that are consistent with a paradigm are published, cited, and highlighted in review articles to providing false evidence in support for a paradigm (Greenwald & Lai, 2020; Kurdi et al., 2020).

Over the past decade, it has become apparent how pervasive these biases have been, especially in social psychology (Schimmack, 2020). The responses to my critic of IATs merely confirms how powerful paradigms and conflicts of interest can be. It is therefore necessary to allocate more resources to validation projects by independent researchers. Also, validation studies should be pre-registered, properly powered, and results need to be published whether they show validity or not. Conducting validation studies of widely used measures could be an important role for the emerging field of meta-psychology that is not focused on new discoveries, but rather on evaluating paradigmatic research from an outsider, meta-perspective (Carlsson et al., 2017). Viewed from this perspective, many IATs that are in use lack credible evidence of construct validity.


* Included in Suicide IAT meta-analysis

Banaji, M. R., & Greenwald, A. G. (2013). Blindspot: Hidden biases of good people. New York, NY: Delacorte Press.

Bargh, J. A., Chen, M., & Burrows, L. (1996). Automaticity of social behavior: Direct effects of trait construct and stereotype activation on action. Journal of Personality and Social Psychology, 71(2), 230–244.

*Barnes, S. M., Bahraini, N. H., Forster, J. E., Stearns-Yoder, K. A., Hostetter, T. A., . . . Nock, M. K. (2016). Moving beyond self-report: Implicit associations about death/ life prospectively predict suicidal behavior among veterans. Suicide and Life-threatening Behavior, 47, 67–77. doi:10.1111/sltb.12265

Benjamin, D. J. et al. (2018). Redefine statistical significance. Nature Human Behaviour, 2, 6-10.

Bosson, J. K., Swann, W. B., Jr., & Pennebaker, J. W. (2000). Stalking the perfect measure of implicit self-esteem: The blind men and the elephant revisited? Journal of Personality and Social Psychology, 79, 631–643. doi:10.1037/0022-3514.79.4.631

Buhrmester, M. D., Blanton, H., & Swann, W. B., Jr. (2011). Implicit self-esteem: Nature, measurement, and a new way forward. Journal of Personality and Social Psychology, 100(2), 365–385.

Chiurliza, B., Hagan, C. R., Rogers, M. L., Podlogar, M. C., Hom, M. A., Stanley, I. H., & Joiner, T. E. (2018). Implicit measures of suicide risk in a military sample. Assessment, 25(5), 667–676.

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers.

Cunningham, W. A., Preacher, K. J., & Banaji, M. R. (2001). Implicit attitude measures: Consistency, stability, and convergent validity. Psychological Science, 12(2), 163–170.

Cunningham, W. A., & Zelazo, P. D. (2007). Attitudes and evaluations: A social cognitive neuroscience perspective. Trends in Cognitive Sciences, 11, 97–104. doi:10.1016/j.tics.2006.12.005

Cvencek, D., Meltzoff, A. N., Maddox, C. D., et al. (2020). Meta-analytic use of balanced identity theory to validate the Implicit Association Test. Personality and Social Psychology Bulletin. 1-16. doi:10.1177/0146167220916631

Doyen S, Klein O, Pichon CL, Cleeremans A (2012) Behavioral Priming: It’s All in the Mind, but Whose Mind? PLOS ONE 7(1): e29081.

Falk, C. F., Heine, S. J., Takemura, K., Zhang, C. X., & Hsu, C. (2015). Are implicit self-esteem measures valid for assessing individual and cultural differences. Journal of Personality, 83, 56–68. doi:10.1111/jopy.12082

*Glenn, C. R., Millner, A. J., Esposito, E. C., Porter, A. C., & Nock, M. K. (2019). Implicit identification with death predicts suicidal thoughts and behaviors in adolescents. Journal of Clinical Child & Adolescent Psychology, 48, 263–272.  doi:10.1080/15374416.2018.1528548

Greenwald, A. G., & Banaji, M. R. (1995). Implicit social cognition: Attitudes, self-esteem, and stereotypes. Psychological Review, 102(1), 4–27.

Greenwald, A. G., & Farnham, S. D. (2000). Using the Implicit Association Test to measure self-esteem and self-concept. Journal of Personality and Social Psychology, 79, 1022–1038. doi:10.1037/0022-3514.79.6.1022

Greenwald, A. G., & Lai, C. K. (2020). Implicit social cognition. Annual Review of Psychology, 71, 419–445.

Greenwald, A.G., McGhee, D.E., & Schwartz, J.L.K. (1998). Measuring individual differences in implicit cognition: The Implicit Association Test. Journal of Personality and Social Psychology, 74, 1464–1480.

*Harrison, D. P., Stritzke, W. G. K., Fay, N., & Hudaib, A.-R. (2018). Suicide risk assessment: Trust an implicit probe or listen to the patient? Psychological Assessment, 30(10), 1317–1329.

Jiang, C., Vitiello, C., Axt, J. R., Campbell, J. T., & Ratliff, K. A. (2019). An examination of ingroup preferences among people with multiple socially stigmatized identities. Self and Identity. Advance online publication.

Kahneman, D. (2012). Open Letter.!/suppinfoFile/Kahneman%20Letter.pdf

Kurdi, B., & Banaji, M. R. (2017). Reports of the death of the individual difference approach to implicit social cognition may be greatly exaggerated: A commentary on Payne, Vuletich, and Lundberg. Psychological Inquiry, 28, 281–287. doi:10.1080/1047840X.2017.1373555

Kurdi B., Ratliff K. A., Cunningham, W. A. (2020). Can the Implicit Association Test Serve as a Valid Measure of Automatic Cognition? A Response to Schimmack (2020). Perspectives on Psychological Science. doi:10.1177/1745691620904080

Kurdi, B., Seitchik, A. E., Axt, J. R., Carroll, T. J., Karapetyan, A., Kaushik, N., Tomezsko, D., Greenwald, A. G., & Banaji, M. R. (2019). Relationship between the Implicit Association Test and intergroup behavior: A meta-analysis. American Psychologist, 74(5), 569–586.

*Millner, A. J., Augenstein, T. M., Visser, K. H., Gallagher, K., Vergara, G. A., D’Angelo, E. J., & Nock, M. K. (2019). Implicit cognitions as a behavioral marker of suicide attempts in adolescents. Archives of Suicide Research, 23(1), 47–63.

*Nock, M. K., & Banaji, M. R. (2007). Prediction of suicide ideation and attempts among adolescents using a brief performance-based test. Journal of Consulting and Clinical Psychology, 75(5), 707–715.

*Nock, M. K., Park, J. M., Finn, C. T., Deliberto, T. L., Dour, H. J., & Banaji, M. R. (2010). Measuring the Suicidal Mind: Implicit Cognition Predicts Suicidal Behavior. Psychological Science, 21(4), 511–517.

Nosek, B. A., & Bar-Anan, Y. (2012). Scientific utopia: I. Opening scientific communication. Psychological Inquiry, 23(3), 217–243.

Nosek, B. A., Hawkins, C. B., & Frazier, R. S. (2011). Implicit social cognition: From measures to mechanisms. Trends in Cognitive Sciences, 15(4), 152–159.

Onyeador, I. N., Wittlin, N. M., Burke, S. E., Dovidio, J. F., Perry, S. P., Hardeman, R. R., … van Ryn, M. (2020). The Value of Interracial Contact for Reducing Anti-Black Bias Among Non-Black Physicians: A Cognitive Habits and Growth Evaluation (CHANGE) Study Report. Psychological Science, 31(1), 18–30.

Phelps, E. A., Cannistraci, C. J., Cunningham, W. A. (2003). Intact performance on an indirect measure of race bias following amygdala damage. Neuropsychologia, 41(2):203-208. doi:10.1016/s0028-3932(02)00150-1

Rae, J. R., & Greenwald, A. G. (2017). Persons or situations? Individual differences explain variance in aggregated implicit race attitudes. Psychological Inquiry, 28, 297–300. doi:10.1080/1047840X.2017.1373548

*Randall, J. R., Rowe, B. H., Dong, K. A., Nock, M. K., & Colman, I. (2013). Assessment of self-harm risk using implicit thoughts. Psychological Assessment, 25(3), 714–721.

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551–566.

Schimmack U. (2019b). The Implicit Association Test: A Method in Search of a Construct. Perspectives on Psychological Science. doi:10.1177/1745691619863798

Schimmack, U. (2020a). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne. Advance online publication.

Schimmack, U. (2020b). The validation crisis. Meta-Psychology.

Teachman, B. A., Clerkin, E. M., Cunningham, W. A., Dreyer-Oren, S., & Werntz, A. (2019). Implicit cognition and psychopathology: Looking back and looking forward. Annual Review of Clinical Psychology, 15, 123–148.

*Tello, N., Harika-Germaneau, G., Serra, W., Jaafari, N., & Chatard, A. (2020). Forecasting a Fatal Decision: Direct Replication of the Predictive Validity of the Suicide–Implicit Association Test. Psychological Science, 31(1), 65–74.

Vianello, M., & Bar-Anan, Y. (2020). Can the Implicit Association Test measure automatic judgment? The validation continues. Perspectives on Psychological Science, 15, ♦♦♦. doi:10.1177/1745691619897960

Vul, E., Harris, C., Winkielman, P., & Pashler, H. (2009). Puzzlingly high correlations in fMRI studies of emotion, personality, and social cognition. Perspectives on Psychological Science, 4(3), 274–290.

Walker, S. S., & Schimmack, U. (2008). Validity of a Happiness Implicit Association Test as a measure of subjective wellbeing. Journal of Research in Personality, 42, 490–497. doi:10.1016/j.jrp.2007.07.005

The Structure of Affective Dispositions

Extraversion and Neuroticism are some of the oldest constructs in personality psychology. They were the key dimensions in Eysenck’s theory of personality that was prominent in the 1970s. Although Eysenck’s theory of Neuroticism and Extraversion failed to be supported, extraversion and neuroticism remained prominent dimensions in the Big Five model of personality that emerged in the 1980s.

In an influential article, Costa and McCrae (1980) reconceptualized Extraversion and Neuroticism as two broad affective disposition that influences positive and negative affective experiences. They found that extraversion predicted positive affect and neuroticism predicted negative affect on Bradburn’s affect measure.

A key assumption of this model is that the disposition to experience negative affects and the disposition to experience positive affects are largely independent traits. This model underlies the development of the popular Positive Affect and Negative Affect Schedule (PANAS) that is widely used to measure affective experiences over longer time periods (Watson, Tellegen, & Clark, 1988).

The independence model of PA and NA has created a heated controversy in the emotion literature (see JPSP special issue 1999). Most of the debate focussed on the structure of momentary affective experiences. However, some articles also questioned the independence of positive and negative affective dispositions. Specifically, Diener, Smith, & Fujita (1995) used a mutli-method approach to measure a variety of positive and negative affective traits. They did find separate factors for PA and NA, but these factors were strongly negatively correlated (see Zou, Schimmack, & Gere, 2013, for a conceptual replication). Findings like these suggest that the independence model is too simplistic.

There are several ways to reconcile the negative relationship between positive and negative affects with the independence model. First, it is possible that the relationship between PA and NA depends on the selection of specific affects. Whereas happiness and sadness/depression may be negatively correlated, excitement and anxiety may be independent. If the correlation varies for specific affects, it is necessary to use proper statistical methods like CFA to examine the relationship between PA and NA without the influence of variance due to specific emotions. An alternative approach would be to directly measure the valence of emotions. Few studies have used this approach to remove emotion-specific variance from studies of PA and NA.

Another possibility is that PA and NA are not as strongly aligned with Extraversion and Neuroticism as Costa and McCrae’s (1980) model suggest. In fact, Costa and McCrae also developed a model in which positive affect was merely one of several traits called facets that are related to extraversion. According to the facet model, extraversion is a broader trait that encompasses affective and non-affective dispositions. For example, extraversion is also related to behaviours in social situations (sociability, assertiveness) and situations with uncertainty (risk taking). One implication of this model is that Extraversion and Neuroticism could be independent, while PA and NA can be negatively correlated.

The relationship between Extraversion and Neuroticism has been examined in hundreds of studies that measured the Big Five. Simple correlations between Extraversion and Neuroticism scales typically show small to moderate negative correlations. This finding contradicts the assumption that E and N are independent, but this finding has often been ignored. For example, structural models that allow for correlations between the Big Five maintain that E and N are independent (DeYoung, 2015).

One explanation for the negative correlation between E and N are response styles. Extraversion items tend to be desirable, whereas Neuroticism items tend to be undesirable. Thus, socially desirable responding can produce spurious correlations between E and N measures. In support of this hypothesis, the correlation weakens and sometimes disappears in multi-rater studies (Anusic et al., 2009; Biesanz & West, 2004; DeYoung, 2006). However, the correlation between E and N also depends on the item content. Scales that focus on sociability and anxiety tend to find weaker correlations than scales that measure E and N with a broader range of facets like the NEO-PI. Once more, this means that scale content moderates the results and that proper analyses of the relationship between the higher-order factors E and N requires a hierarchical CFA model to remove facet-specific variance from the correlation.

The aim of this blog post is to examine the structure of Extraversion and Neuroticism facets with a hierarchical CFA. A CFA model can reveal aspects of the data that a traditional EFA cannot reveal. Most importantly, it can reveal relationships between facets that are independent of the higher-order factors E and N. These residual correlations are important aspects of the relationship between traits that have been neglected in theoretical models based on EFA because EFA does not allow for these relationships.


Over the past decade, Condon and Revelle have assembled an impressive data set from over 50,000 participants who provided self-ratings of their personality for subsets of over 600 personality items that cover a broad range of personality traits at the facet level. Modern statistical methods make it possible to analyze these data with random missing data to examine the structure of all 600 personality items. The authors generously made their data openly available. I used the datasets that represent data collected between 2013 and 2014 and 2014 to 2015 ( I did not use all of the data to allow cross-validation of the results with a new sample.

Even modern computers would take too long to analyze the structure of over 600 items. For the present purpose, I focussed on items that have been shown to be valid indicators of extraversion and neuroticism facets (Schimmack, 2020a, 2020b). The actual items and their primary factor loadings are shown below in Tables 1-3.


Preliminary analyses showed problems with model identification because some Neuroticism and Extraversion scales were strongly negatively related. Specifically, E-Boldness was strongly negatively related to N-Self-Consciousness, E-Happiness was strongly negatively correlated with N-Depression, and E-Excitement Seeking was strongly negatively related to N-Fear. These findings already show that E and N are not independent domains that fit a simple structure; that is, E facets are not related to N-facets. To accommodate these preliminary findings, I created three bipolar facet factors. I then fitted a measurement model for 4 N-facets, 6 E-facets, and the three bipolar facets. The measurement model allowed for secondary loadings and correlated item residuals based on modification indices. All primary factors were allowed to correlate freely. In addition, the model included a method factor for acquiescence bias with fixed loadings depending on the scoring of items. As this model was data-driven, the results are exploratory and require cross-validation in a new sample.

Model fit of the final model met standard fit indices for overall model fit (CFI > .95, RMSEA < .06), CFI = .951, RMSEA = .006. However, standard fit indices have to be interpreted with caution for models with many missing values (Zhang & Savaley, 2019). More important, modification indices suggested no major further improvements to the measurement model by allowing additional secondary loadings. It is also known that minor misspecifications of the measurement model have relatively little influence on the theoretically important correlations among the primary factors. Thus, results are likely to be robust across different specifications of the measurement model. Table 1 shows the items and their primary factor loadings for the Neuroticism facets.

Table 2 shows the items and their primary loadings for the Extraversion facets.

Table 3 shows the items for the three bipolar Extraversion-Neuroticism factors.

Table 4 shows the correlations among the 13 Extraversion and Neuroticism facets.

All correlations among the N-facets are positive and above r = .3. All of the correlations among the E-facets are positive and only two are below .30. Two of the bipolar facets, namely Happiness-Depression and Boldness-Self-Consciousness are negatively correlated with neuroticism facets (all r < -.3). Most of the correlations with the extraversion facets are positive and above .3, but some are smaller and one is practically zero (r = -.03). Surprisingly, the Excitement Seeking versus Fear facet has very few notable relationships with neuroticism and extraversion facets. This suggests that this dimension is not central to either domains.

The correlations between extraversion facets and neuroticism facets tend to be mostly negative, but most of them are fairly small. This suggests that Extraversion and Neuroticism are largely independent and relatively weakly correlated factors. The only notable exceptions were negative correlations of anxiety with novelty seeking and liveliness.

The aim of any structural theory of personality is to explain this complex pattern of relationships in Table 4. Although a model with two factors is viable, a model with a simple structure that assigns each facet to only one factor does not fit these data. To account for the actual structure it is necessary to allow for some facets to be related to extraversion and neuroticism and to allow for additional relationship between some facets. Allowing for these additional relationships produced a model that fit the data nearly as well as the measurement model: CFI = .938 vs. .951, RMSEA = .007 vs. .006. The results are depicted in Figure 1.

Figure 1 does not show the results for the excitment-seeking vs. fear factor which was only weakly related to E and N, but strongly related to Novelty-Seeking and Boldness. To accommodate this factor, the model included direct paths from Anxiety (-.55), Anger (.33), Happiness-Depression (-.58), Novelty Seeking (.45), and liveliness (.24). The strong positive relationship with Happiness-Depression is particularly noteworthy. It could mean that depression or related negative affects like boredom motivate people to seek excitement. However, these results are preliminary and require further investigation.

The key finding is that Extraversion and Neuroticism emerge as slightly negatively correlated factors. The negative correlation in this study could be partially due to evaluative biases in self-ratings. Thus, the results are consistent with the conceptualization of Extraversion and Neuroticism as largely independent higher-order factors of personality. However, this does not mean that affective dispositions are largely independent.

The happiness factor lacked discriminant validity from the depression factor, showing a strong negative relationship between these two affective traits. Moreover, the happiness-depression factor was related to anger and anxiety because it was related to neuroticism. Thus, high levels of neuroticism not only increase NA, they also lower happiness.

The results also explain the independence of PA and NA in the PANAS scales. The PANAS scales were not developed to measure basic affects like happiness, sadness, fear and anxiety. Instead, they were created to measure affect with two independent traits. While the NA dimension closely corresponds to neuroticism, the PA dimension corresponds more closely to positive activation or positive energy than to happiness. The PANAS-PA construct of Positive Activation is more closely aligned with the liveliness factor. As shown in Figure 1, liveliness loads on Extraversion and is fairly independent of negative affects. It is only related to anxiety and anger through the small correlation between E and N. For depression, it has an additional relationship because liveliness and depression load on Extraversion. It is therefore important to make a clear conceptual distinction between Positive Affect (Happiness) and Positive Activation (Liveliness).

Figure 1 also shows a number of correlated residuals that were needed to achieve model fit. These correlated residuals are by no means arbitrary. Activity is related to being lively, presumably because energy is required to be active. Amusement is related to sociability, presumably because humor helps to establish and maintain positive relationships. Boldness is related to assertiveness because both traits require a dominant role in social relationships and groups. Anxiety is negatively related to boldness because bold behaviours are risky. Moody is related to anger and depression, presumably because mood swings can be produced by either anger or depressive episodes. Although these relationships are meaningful they are often ignored because EFA fails to show these relationships and fails to show that models without these relationships do not fit the data. The present results show that theoretical progress requires developing models that explain these relationships. In this regard, the present results merely show that these relationships exist without explaining them.

It is also noteworthy that the correlated residuals do not show a simple pattern that is postulated by some theories. Most notably, DeYoung (2015) proposed that facets are linked to Big Five factors by means of aspects. Aspects are supposed to represent shared variance among some facets that is not explained by the Big Five traits. A simple way to examine the presence of aspects is to find groups of facets that share correlated residuals. Contrary to the aspect model, most facets have either one or no correlated residuals. Sociability has two correlated residuals. It is related to amusement and dependence, but amusement and dependence are not related. Thus, there is no aspect linking these three facets. Moody is related to anger and depression, but anger and depression are unrelated. Again, this implies that there is no aspect linking these three facets. Boldness is linked to three facets. It is positively related to assertiveness, but negatively related to anxiety and dependence, but anxiety and dependence are unrelated, and assertiveness is only related to boldness. This means that there is no evidence for DeYoung’s Extraversion and Neuroticism aspects. These results are by no means inconsistent with previous findings. The aspect model was developed with EFA and EFA may separate a facet from other facets and create the illusion of aspects. This is the first test of the aspects model with CFA and it shows no support for the model.


In conclusion, the present study examined the structure of affective traits using hierarchical CFA. The results broadly confirm the Big Five model of personality. Neuroticism represents shared variance among several negative affective traits like anxiety, anger, and depression, and self-conscious emotions. Extraversion is a broader trait that includes affective and non-affective traits. The core affective traits are happiness and positive energy (liveliness). Extraversion and Neuroticism are only slightly negatively correlated and this correlation could be inflated by rating biases. Thus, it is reasonable to conceptualize them as largely independent higher-order traits. However, at the facet level, the structure is more complex and does not fit a simple structure. Some E-facets and N-facets are highly negatively correlated and could be conceptualized as opposite ends of a single trait, namely Happiness-Depression, Boldness-Self-Consciousness, and Excitement-Seeking vs. Fear. It is therefore questionable to classify Happiness, Boldness, and Excitement-Seeking under Extraversion and Depression, Self-Consciousness and Fear under Neuroticism. These traits are related to Extraversion and Neuroticism. The present results do not provide explanations for the structure of affective trait. The main contribution is to provide a description of the structure that actually represents the structure in the data. In contrast, many prominent models are overly simplistic, focus on subsets of facets, and do not fit the data. The present results integrate these models into one general model that can stimulate future research.

No Evidence for A Higher-Oder Plasticity Factor

Social psychologists have discovered confirmation bias as a powerful human trait. Rather than looking carefully for all relevant information, humans tend to prefer to look for information that confirms their existing beliefs. One major advantage of the scientific method is that it provides objective information that forces individuals to update false beliefs. As a result, evidence that disconfirms or falsifies existing beliefs is a powerful driver of science.

Unfortunately, psychologists do not use the scientific method properly. Instead of collecting data that can falsify existing theories to advance psychological theories, they have a confirmation bias in the application of the scientific evidence. Rather than updating theories in the light of disconfirming evidence, they tend to ignore disconfirming evidence. As a result, psychological science has made little progress over the past decades.

DeYoung (2015) proposed a theory of personality with a higher-order factor of plasticity. He emphasized that his cybernetic Big Five theory affords a wealth of testable hypotheses. Here I subject one of these testable hypothesis to an empirical test by fitting a hierarchical model to 15 primary traits that have been linked to extraversion and neuroticism. The plasticity hypothesis predicts that most of the correlations among these 15 traits are positive and that a model with higher-order factors of extraversion and openness shows a positive correlation between the two traits. To foreshadow the results. The correlation between E and O was close to zero and slightly negative.


In the 1980s, personality psychologists agreed on a hierarchical structure of personality with many correlated primary traits that are encoded in everyday language (e.g., helpful, talkative, curious, efficient, etc.) and five largely independent higher-order factors. These higher-order factors are known as the Big Five. The five factors reflect sensitivity to negative information (Neuroticism), positive energy / approach motivation (Extraversion), a focus on ideas (Openness), pro-social behaviours (Agreeabeleness), and a rational, deliberate way of decision making (Conscientiousness).

In 1997, Digman proposed that the Big Five factors are not independent and systematically related to each other by two even higher-order factors. He suggested that one factor produces negative correlations of neuroticism with agreeableness and conscientiousness and a positive correlation between agreeableness and conscientiousness. A positive correlation between extraversion and openness was attributed to a second factor called beta.

DeYoung (2006) changed the names of the two factors. Alpha was called stability and beta was called plasticity. To test the theory of stability and plasticity, DeYoung analyzed multi-rater data that avoid the problem of spurious correlations among Big Five scales in self-ratings (Anusic et al., 2009). The key finding was that the shared variance among raters in E-scores and the shared variance among raters in O-scores on the Big Five Inventory corelated r = .24 (.59 x .40). In contrast, the corresponding correlation for the Mini-Marker scales was considerably lower, r = .09 (.69 x .13).

This article has been cited 389 times in Web of Science. In contrast, another article by Biesanz and West that used the same methodology and found no support for the plasticity factor has been cited only 100 times (Biesanz and West, 2004). This bias in citations shows the prevalence of confirmation bias in psychology. Given the weak correlation of r = .09 with the Mini-Markers and Biesanz and West’s failure to find a plasticity factor at all, the evidence for a plasticity factor is weak at best.

Moreover, the inconsistency of results points to a methodological problem in all existing tests of a plasticity factor. The problem is that all tests relied on scale scores to test theories about factors. This is a problem because scale scores are biased by the items that were used to measure a construct. As a result, additional relationships between item-specific content can produce spurious correlations that are inconsistent across measures with different item content. A simple solution to this problem is to conduct a hierarchical factor analysis. In a hierarchical factor analysis, the Big Five are represented as the shared variance among items that are used as indicators of a Big Five factor. As far as I know, this approach has not been used to examine the correlation between the extraversion factor and the openness factor.

What are Extraversion and Openness?

Another problem for empirical tests of Plasticity is that extraversion and openness are poorly defined constructs. Most of the time, personality psychologists are satisfied with operational definitions. That is, extraversion is whatever an extraversion scale measures and openness is whatever an openness scale measures. This is a problem when the correlation between E-scales and O-scales varies across different scales.

To avoid this problem, it is necessary to define and measure extraversion and openness in a more stringent way. Short of a classic definition of these constructs in terms of defining features, it is possible to define these construct by listing prototypical exemplars. For example, core primary traits of extraversion are sociability and positive energy (lively, energetic). Thus, an extraversion factor can be defined as a factor with high loadings of sociability and positive energy. Some theories of Extraversion have established longer lists of primary factors that are related to extraversion. The NEO-PI lists six primary factors that are often called facets. A competing model called the HEXACO model lists four primary factors. After accounting for overlap, this provides a list of 7 primary factors that can be used to define extraversion. According to this definition extraversion is the shared variance between these 8 factors. The same logic applies to the definition of openness to experience. After taking overlap into account, the NEO-PI and HEXACO models suggest that openness can be defined as the shared variance among 8 primary factors.

It is noteworthy that this definition of extraversion also implies a way to test this particular theory of extraversion. If one of these 8 factors does not load on a common extraversion factor, the theory is falsified. This does not mean that extraversion does not exist. For example, if only one factor does not load on the extraversion factor, the extraversion theory can be modified to exclude this factor from the definition of extraversion.

Only after an empirically validated model of Extraversion and Openness has been established, it is possible to test the Plasticity theory. A straightforward prediction of this theory is that all primary factors of Extraversion share variance with all primary factors of Openness. Once more, rejecting this theory does not automatically imply that there is no Plasticity factor. Additional relationships between specific facets could influence the pattern of correlations. However, this would mean that Plasticity alone is insufficient to explain the relationship between E-factors and O-factors and a simplistic Plasticity theory is insufficient.


One problem for empirical tests at the facet level is that the measurement of many facets requires a lot of items and that factor analyses at the item level require many participants. One solution to this problem is to ask participants to complete only a subset of all items and to use advanced statistical methods to analyze data with planned missing values. It has also become easier to collect data from large samples using online surveys.

Over the past decade, Condon and Revelle have collected data from tenth of thousands of participants for over 600 personality items that were selected to represent several personality questionnaires including the NEO-PI and HEXACO scales. The authors generously made their data openly available. I used the datasets that represent data collected between 2013 and 2014 and 2014 to 2015 ( I did not use all of the data to allow cross-validation of the results with a new sample.

Measurement Model

The measurement model presented here is strictly exploratory. To speed up data exploration, I computed the covariances among items and analyzed the covariance matrix with a sample size of 1,000, which was the minimum number of cases for all item pairs. Each primary factor was represented by 10 items. However, the items were not validated by strict psychometric tests and CFA results showed items with correlated residuals, low primary loadings, or high secondary loadings. These items were eliminated and only the four or five items with the best psychometric properties were retained.

I was able to identify 15 of the 16 theoretically postulated primary factors. The only factor that created problems was the social self-esteem factor of Extraversion in the HEXACO model. Thus, the measurement model had 7 extraversion factors and 8 openness factors.

After completing the preliminary analyses, I fitted a proper model based on the raw data with planned missing data, which is the default option in MPLUS. The main disadvantage of this method is that it is computing intensive. It took 5 hours for this model to converge.

All items had primary loadings on the theoretically assigned factor and loadings on an acquiescence factor depending on the direct or reverse scoring of the item. In addition, some items had secondary loadings that were all below .4 and all but 2 were below .3 (see complete results on OSF, Overall model fit of this model was excellent for the RMSEA and acceptable for the CFI, RMSEA = .006, CFI = .936. However, overall model fit for data with many missing values can be misleading because fit is determined by a different formula ( (Zhang & Savaley, 2019). More important is that inspection of modification indices showed no major modifications to improve model fit. Model fit is most important to provide a comparison standard for models of the correlations among the primary factors.

Table 1 shows the items and their primary loadings for Extraversion factors. Point estimates are reported because sampling error is less than .02 and any deviations from the point estimate by two standard errors have no practical significance. The information in this table can be used to select items or to write new items with similar item content for future studies.

Table 2 shows the correlations among the 7 primary E-factors.

The key finding in Table 2 is that all correlations are positive. This finding confirms the assumption that all primary factors share variance with each other and that the correlations among the primary factors can be modeled with a higher-order factor. This also makes it possible to define Extraversion as the factor that represents the shared variance among these primary factors.

The next finding is that all primary factors have distinct variance as all correlations are significantly different from 1. However, the correlation between sociability and boldness is very high, suggesting that these two factors have little unique variance and could be merged into a single factor. Other pairs of strongly related primary factors are boldness and assertiveness and liveliness and activity level. All other correlations are below .70.

Table 3 shows the primary loadings for the openness items on the primary openness factors.

Table 4 shows the correlations among the primary O-factors.

The key finding is that all correlations are positive. This justifies the definition of Openness as a higher-order factor that represents the shared variance among these 8 factors.

The second observation is that only one pair of primary factors shows a correlation greater than .70. Namely, Inquisitive and reflective are correlated r = .77. Although it was possible to find a distinction between these factors, they were both derived from items belonging to the Inquisitive scale of the HEXACO model and the intellect scale of the NEO-PI. Thus, it would also possible to reduce the number of factors to 7.

Table 5 shows the correlations between the 7 primary E-factors and the 8 primary O-factors. If plasticity is a higher-order factor that produces shared variance between E-factors and O-factors, most of these correlations should be positive, although their magnitude should be lower than the E-E and O-O correlations in Tables 2 and 4.

– Drumroll –

Table 5 shows the results. In support of a Plasticity factor, 24 of the 56 correlations are positive and above .10, whereas only 12 correlations are below -.10. However, the pattern of correlations suggests that some O-factors are not positively related to E-factors. Specifically, fantasy and progressive attitudes tend to be negatively related with extraversion factors. In comparison, novelty seeking shows very strong and consistent positive relationships with all E-factors, suggesting that novelty seeking is related to Openness and Extraversion. To a lesser extend, this also appears to be the case for Imagination.

A Higher-Order Factor Model

To further explore the pattern of correlations in Table 5, I fitted a higher-order model with an E-factor and an O-factor. Such a simple model did not fit the data. I therefore modified the model to achieve fit that closely approximated the fit of the measurement model, while retaining interpretability. Figure 1 shows the final model. The model had an RMSEA of .007 vs. .006 for the measurement model and a CFI of .917 vs. .936 for the measurement model. Modification indices suggested no notable improvements by adding secondary loadings of primary factors on E and O or further correlated residuals among the primary factors.

The most notable finding was that the correlation between Extraversion and Openness was close to zero and negative. This finding contradicts the prediction of the plasticity model that E and O are positively correlated due to the shared influence of a common factor. For the most part, primary factors had only loadings on their theoretically predicted factor. The main exception was novelty seeking which is based on items for the NEO-PI adventurous scale. The novelty factor actually loaded more strongly on extraversion than on openness. However, even in th NEO-PI model, this factor was a hybrid of E and O, but with a stronger loading on openness. The hybrid nature of this factor does not necessarily require a change of the definition of Extraversion and Openness. It is still possible to define Extraversion as a factor that influences among other things novelty seeking and to define Openness as a factor that defines among other things novelty seeking. The remaining secondary loadings are weaker and do not require a change of the definition to accomodate them.

In conclusion, the key finding is that extraversion can be defined as the shared variance among eight basic traits and openness can be defined as the shared variance among eight basic traits, with one overlapping trait. When extraversion and openness are defined in this way, they emerged as largely independent factors. This finding is inconsistent with the plasticity model that postulates a positive correlation between extraversion and openness.

The present results are not inconsistent with previous findings. As noted in the Introduction, previous studies produced inconsistent results and the inconsistency could be attributed to the use of scales with different item content.


The present findings have relatively little implications for the measurement of personality and for the use of personality questionnaires to predict behavior. In hierarchical models all of the variance of higher-order traits is captured by lower order traits that also contain unique variance. Therefore, higher-order traits can never predict behavior better than lower order traits. Aggregating E and O to create a plasticity scale only destroys valid variation in E and O that is not shared and makes it impossible to say whether explained variance was due to E, O, or the shared variance between E and O.

The results are more important for theories of personality, especially theories about the nature and causes of personality traits, such as DeYoung’s cybernetic theory of the Big Five (DeYoung, 2015). This theory entails the assumption that E and O are related by plasticity.

Although the Big Five traits were initially assumed to be independent and, thus, the highest level of the hierarchy, they are, in fact, regularly intercorrelated such that there exist two higher order traits, or metatraits, which we have labeled Stability and Plasticity (DeYoung, 2006; DeYoung, Peterson, & Higgins, 2002; Digman, 1997; see Section 5 for explanation of these labels). Although Stability and Plasticity are positively correlated in ratings by single informants, this correlation appears to result from rater bias, as they are typically uncorrelated in multi-informant studies (Anusic, Schimmack, Pinkus, & Lockwood, 2009; Chang, Connelly, & Geeza, 2012; DeYoung, 2006; McCrae et al., 2008). The metatraits, therefore, appear to be the highest level of the personality hierarchy, with no ‘‘general factor of personality’’ above them (Revelle & Wilt, 2013).” (p. 36).

DeYoung tried to characterize the glue that binds Extraversion and Openness together as “a cybernetic function to explore, create new goals, interpretations, and strategies (cf. Table 1, p. 42). The theory also postulates that dopaminergic systems in the brain are shared between extraversion and openness traits to provide a neurobiological explanation for the plasticity factor. He also suggests that plasticity is related to externalizing problems like delinquency and hyperactivity. However, this has never been shown by demonstrating that all or at least most facets of extraversion and neuroticism are related to these outcomes. Although future research is needed to examine this question, the present finding that E and O facets are largely independent renders it unlikely that this would be the case.

Novelty Seeking versus Plasticity

Many claims about plasticity may be valid for the adventurousness facet of the NEO-PI that corresponds to the Novelty Seeking factor in the present model. Novelty seeking is related to exploration, making new goals, and engagement in risky activities. It would not be surprising to see that it is also related to externalizing rather than internalizing problems. Novelty seeking is also related to all extraversion and openness facets. Thus, in many ways, novelty seeking has many of the characteristics attributed to plasticity. The key difference between novelty seeking and plasticity is that novelty seeking is a lower-order (facet) trait whereas plasticity is supposed to be a higher-order trait. The difference between the two is that a higher-order trait is assumed to produce shared variance among all E and O factors, whereas a lower-order trait can be related to all E and O factors without influencing the relationship between them. That is, in Figure 1 the causal arrows from E and O to novelty seeking would be reversed and the plasticity factor would produce correlations among all E and O factors. Given the lack of a correlation in the model without these factors, it is clear that there is no higher-order Plasticity factor.

This has important implications for theories of personality. It is unlikely that all E and O factors share a single dopaminergic system. Rather the focus might be directed that the lower-order trait of Novelty Seeking.


DeYoung (2015) emphasized that his cybernetic Big Five theory affords a wealth of testable hypotheses. Testable hypothesis are useful because they make it possible to falsify false predictions and to modify and improve theories of personality. One obvious prediction of the theory is that the plasticity factor produces positive correlations among primary traits related to extraversion and openness to experience. This follows directly from the notion that higher-order factors represent shared variance among lower-order factors. Here I presented the first test of this prediction and found no support for it. While a single failure is not sufficient to abandon a theory, it should be noted that the CBFT has not been subjected to many tests and that the results were inconsistent. Given the lack of strong support for the theory in the first place, the present results need to be taken seriously. I also provided a simple way to revise the theory by moving the plasticity factor of exploration from the higher-oder level to the facet level and to equate plasticity with novelty seeking.

Conflict of Interest Statement: I was the author of an article that introduced a model that also included a plasticity factor (Anusic et al., 2009). We included the plasticity factor mainly under pressure from DeYoung as a reviewer of the paper, while our focus was on the evaluative bias or halo factor. I never really believed in a beta-factor and I am very pleased with the present results. I hope that proponents of the plasticity model analyze the open data to examine whether they are influenced by unconscious biases.

The Structure of Neuroticism

The construct of neuroticism is older than psychological science. It has its roots in Freud’s theories of mental illnesses. Thanks to to influence of psychoanalysis on the thinking of psychologists, the first personality questionnaires included measures of neuroticism or anxiety, which were considered to be highly related or even identical constructs.

Eysenck’s research on personality first focussed on Neuroticism and Extraversion as the key dimensions of personality traits. He then added psychoticism as a third dimension.

In the 1980s, personality psychologists agreed on a model with five major dimensions that included neuroticism and extraversion as prominent dimensions. Psychoticism was divided into agreeableness and conscientiousness and a fifth dimension openness was added to the model.

Today, the Big Five model dominates personality psychology and many personality questionnaires focus on the measurement of the Big Five.

Despite the long history of research on neuroticism, the actual meaning of the term and the construct that is being measured by neuroticism scales is still unclear. Some researchers see neuroticism as a general disposition to experience a broad range of negative emotions. In the emotion literature, anxiety, anger, and sadness are often considered to be basic negative emotions, and the prominent NEO-PI questionnaires considers neuroticism to be a general disposition to experience these three basic emotions more intensely and frequently.

Neuroticism has also been linked to more variability in mood states, higher levels of self-consciousness and lower self-esteem.

According to this view of neuroticism, it is important to distinguish between neuroticism as a more general disposition to experience negative feelings and anxiety, which is only one of several negative feelings.

A simple model of neuroticism would assume that a general disposition to respond more strongly to negative emotions produces correlations among more specific dispositions to experience more anxiety, anger, sadness, and self-conscious emotions like embarrassment. This model implies a hierarchical structure with neuroticism as a higher-order factor of more specific negative dispositions.

In the early 2000s, Ashton and Lee published an alternative model of personality with six factors called the HEXACO model. The key difference between the Big Five model and the HEXACO model is the conceptualization of pro- and anti-social traits. While these traits are considered to be related to a single higher-order factor of agreeableness in the Big Five model, the HEXACO model distinguishes between agreeableness and honesty-humility as two distinct traits. However, this is not the only difference between the two models. Another important difference is the conceptualization of affective dispositions. The HEXACO model does not have a factor corresponding to neuroticism. Instead it has an emotionality factor. The only common trait to neuroticism and emotionality is anxiety, which is measured with similar items in Big Five questionnaires and in HEXACO questionnaires. The other three traits linked to emotionality are unique to the HEXACO model.

The four primary factors, also called facets) that are used to identify and measure emotionality are anxiety, fear, dependence, and sentimentality. Fear is distinguished from anxiety by a focus on immediate and often physical danger. In contrast, anxiety and worry tend to be elicited by thoughts about uncertain events in the future. Dependence is defined by a need for social comfort in difficult times. Sentimentality is a disposition to respond strongly to negative events that happen to other people, including fictional characters.

In a recent target article, Ashton and Lee argued that it is time to replace the Big Five model with the superior HEXACO model. A change from neuroticism to emotionality would be a dramatic shift given the prominence of neuroticism in the history of personality psychology. Here, I examine empirically how Emotionality is related to Neuroticism and whether personality psychologists should adapt the HEXACO framework to understand individual differences in affective dispositions.


A key problem in research on the structure of personality is that researchers often rely on questionnaires that were developed with a specific structure in mind. As a result, the structure is pre-determined by the selection of items and constructs. To overcome this problem, it is necessary to sample a broad and ideally representative sample of primary traits. The next problem is that motivation and attention-span of participants limits the number of items that a personality questionnaire can include. These problems have been resolved by Revelle and colleagues survey that asks participants to complete only a subset of over 600 items. Modern statistical methods can analyze datasets with planned missing data. Thus, it is possible to examine the structure of hundreds of personality items. Condon and Revelle (2018) also made these data openly available ( I am very grateful for their initiative that provides an unprecedented opportunity to examine the structure of personality.

The items were picked to represent primary factors (facets) of the HEXACO questionnaire and the NEO-PI questionnaire. In addition, the questionnaire covered neuroticism items from the EPQ and other questionnaires. The items are based on the IPIP item pool. Each primary factor is represented by 10 items. I picked the items that represent the four HEXACO Emotionality factors, anxiety, fear, dependency, sentimentality, and four of the NEO-PI Neuroticism factors, anxiety, anger, depression, and self-consciousness. The anxiety factor overlaps and is represented by mostly overlapping items. Thus, this item selection resulted in 70 items that were intended to measure 7 primary factors. I added four additional items that represented variable moods (moodiness) that were included in the BFAS and EPQ, which might form an independent factor.


The data were analyzed with confirmatory factor analysis (CFA), using the MPLUS software. CFA has several advantages over traditional factor analytic methods that have been employed by proponents of the HEXACO and the Big Five models. The main advantages are that it is possible to model hierarchical structures that represent the Big Five or HEXACO factors as higher-order factors of primary factors. A second advantage is that CFA provides information about model fit whereas traditional EFA produces solutions without evaluating model fit.

Measurement Model

A first step in establishing a measurement model was to select items with high primary loadings, low secondary loadings, and low correlated residuals. The aim was to represent each primary factor with the best four items. While four items may not be enough to create a good scale, four items are sufficient to establish a measurement model of primary factors. Limiting the number of items to four items is also advantages because computing time increases with additional items and models with missing data can take a long time to converge.

Aside from primary loadings, the model included an acquiescence factor based on the coding of items. Directed coded items had unstandardized loadings of 1 and reverse coded items had an unstandardized loading of -1. There were no secondary loadings or correlated residuals.

The model met standard criteria of model fit such as a CFI > .95 and RMSEA < .05, CFI = .954, RMSEA = .007. However, models with missing data should not be evaluated based on these fit indices because fit is determined by a different formula ( (Zhang & Savaley, 2019).  More importantly, modification indices showed no notable changes in model fit if fixed parameters were freed. Table 1 shows the items and their primary factor loadings.

Table 2 shows the correlations among the primary factors.

The first four factors are assumed to belong to the HEXACO-Emotionality factor. As expected, fear, anxiety, and dependence are moderately to highly positively correlated. Contrary to expectations, sentimentality showed low correlations especially with fear.

Factors 4 to 8 are assumed to be related to Big Five neuroticism. As expected, all of these factors are moderately to highly correlated.

In addition, the dependence factor from the HEXACO model also shows moderate to high correlations with all Big Five neuroticism factors. The fear factor also shows positive relations with the neuroticism factors, especially for self-consciousness.

With the exception of Sentimentality, all of the factors tend to be positively correlated, suggesting that they are related to a common higher-order factor.

Overall, this pattern of results provides little support for the notion that HEXACO-Emotionality is a distinct higher-order factor from Big-Five neuroticism.


The first model assumed that all factors are related to each other by means of a single higher-order factor. In addition, the model allowed for correlated residuals among the four HEXACO factors. This makes it possible to examine whether these four factors share additional variance with each other that is not explained by a general Negative Emotionality factor.

Model fit decreased compared to the measurement model which serves as a comparison standard for theoretical models, CFI: .916 vs. 954, RMSEA = .009 vs. .007.

All primary factors except sentimentality had substantial loadings on the Negative Emotionality factor. Table 3 shows the residual correlations for the four HEXACO factors.

All correlations are positive suggesting that the HEXACO Emotionality factor captures some shared variance among these four factors that is not explained by the Negative Emotionality factor. However, two of the correlations are very low indicating that there is little shared variance between sentimentality and fear or dependence and anxiety.


The second model, modeled the relationship among the HEXACO factors with a factor. Model fit decreased, CFI = .914 vs. .916, RMSEA = .010 vs. 009. Loadings on the Emotionality factor ranged from .27 to .46. Fear, anxiety, and dependence had higher loadings on the Negative Emotionality factor than on the Emotionality factor.

The main conclusion from these results is that it would be problematic to replace the Big Five model with the HEXACO model because the Emotionality factor in the HEXACO model fails to capture the nature of the broader Neuroticism factor in the Big Five model. In fact, there is little evidence for a specific Emotionality factor in this dataset.


The discrepancy between the measurement model and Model 1 suggests that there are additional relationships between some primary factors that are not explained by the general Negative Emotionality factor. Examining modification indices suggested several changes to the model. Model 3 shows the final results. This model fit the data nearly as well as the measurement model, CFI = .949 vs. 954, RMSEA = .007 vs. .007. Inspection of the Modification Indices showed no further ways to improve the model by freeing correlated residuals among primary factors. In one case, three correlated residuals were consistent and were modeled as a factor. Figure 1 shows the results.

First, the model shows notable and statistically significant effects of neuroticism on all primary factors except sentimentality. Second the correlated residuals show an interesting patterns where primary factors can be arranged in a chain. that is, depression is related to moody, moody is related to anger, anger is related to anxiety, anxiety is related to fear, fear is related to self-consciousness and dependence, self-consciousness is related to dependence and finally, dependence is related to sentimentality. This suggests the possibility that a second broader dimension might be underlying the structure of negative emotionality. Research on emotions suggests that this dimension could be activation (fear is high, depression is low) or potency (anger is high, dependence is low).This is an important avenue for future research. The key finding in Figure 1 is that the traditional Neuroticism dimension is an important broad higher-order factor that accounts for the correlations among 7 of the 8 primary factors. These results favor the Big Five model over the HEXACO model.

A Big-5 Model of the Hexaco-100 Items

In the 1980s, personality psychologists celebrated the emergence of a five-factor model as a unifying framework for personality traits. Since then, the so-called Big-5 have dominated thinking and measurement of personality.

Two decades later, Ashton and Lee proposed an alternative model with six factors. This model has come to be known as the HEXACO model.

A recent special issue in the European Journal of Personality discussed the pros and cons of these two models. The special issue did not produce a satisfactory resolution between proponents of the two models.

In theory, it should be possible to resolve this dispute with empirical data, especially given the similarities between the two models. Five of the factors are more or less similar between the two models. One factor is Neuroticism with anxiety/worry as a key marker of this higher-order trait. A second factor is Extraversion with sociability and positive energy as markers. A third factor is Openness with artistic interests as a common marker. A forth factor is conscientiousness with orderliness and planful actions as markers. The key differences between the two models is concerned with pro-social and anti-social traits. In the Big Five model, a single higher-order trait of agreeableness is assumed to produce shared variance among all of these traits (e.g., morality, kindness, modesty). The HEXACO model assumes that there are two higher-order traits. One is also called agreeableness and the other one is called honesty and humility.

As Ashton and Lee (2005) noted, the critical empirical question is how the Big Five model accounts for the traits related to the honesty-humility factor in the HEXACO model. Although the question is straightforward, empirical tests of it are not. The problem is that personality researchers often rely on observed correlations between scales and that correlations among scales depend on the item-content of scales. For example, Ashton and Lee (2005) reported that the Big-Five Mini-Marker scale of Agreeableness correlated only r = .26 with their Honesty-Humility scale. This finding is not particularly informative because correlations between scales are not equivalent to correlations between the factors that the scales are supposed to reflect. It is also not clear whether a correlation of r = .26 should be interpreted as evidence that Honesty-Humility is a separate higher-order factor at the same level as the other Big Five traits. To answer this question, it would be necessary to provide a clear definition of a higher-order factor. For example, higher-order factors should account for shared variance among several primary factors that have only low secondary loadings on other factors.

Confirmatory factor analysis (CFA) addresses some of the problems of correlational studies with scale scores. One main advantage of CFA is that models do not depend on the item selection. It is therefore possible to fit a theoretical structure to questionnaires that were developed for a different model. I therefore used CFA to see whether it is possible to fit the Big Five model to the HEXACO-100 questionnaire that was explicitly designed to measure 4 primary factors (facets) for each of the six HEXACO higher-order traits. Each primary factor was represented by four items. This leads to 4 x 4 x 6 = 96 items. After consultation with Michael Ashton, I did not include the additional four altruism items.

Measurement Model

The Big-Five or HEXACO models are higher-order models that are supposed to explain the pattern of correlations among the primary factors. In order to test these models, it is necessary to first establish a measurement model for the primary factors. Starting point for the measurement model was a model with a simple structure where each item only has a primary loading on its designated factor. For example, the anxiety item “I sometimes can’t help worrying about little things” loaded only on the anxiety factor. All 24 primary factors were allowed to correlate freely with each other.

It is well-known that few data fit a simple structure for two reasons. First, the direction of items can influence responses. This can be modeled with an acquiescence factor that codes whether an item is a direct or a reverse coded items. Second, it is difficult to write items that reflect only variation in the intended primary trait. Thus, many items are likely to have small, but statistically significant, secondary loadings on other factors. These secondary loadings need to be modeled to achieve acceptable model fit, even if they have little practical significance. Another problem is that two items of the same factor may share additional variance because they share similar wordings or item content. For example, the two items “I clean my office or home quite frequently” and the reverse coded item “People often joke with me about the messiness of my room or desk” share specific content. This shared variance between items needs to be modeled with correlated residuals to achieve acceptable model fit.

Researchers can use Modification Indices to identify secondary loadings and correlated residuals that have a strong influence on model fit. Freeing the identified parameters improves model fit and can produce a measurement model with acceptable model fit. Moreover, MI can also provide information that there are no more fixed parameters that have a strong negative effect on model fit.

After modifying the simple-structure model accordingly, I established a measurement model that had acceptable fit, RMSEA = .021, CFI = .936. Although the CFI did not reach the threshold of .950, the MI did not show any further improvements that could be made. Freeing further secondary loadings resulted in secondary loadings less than .1. Thus, I stopped at this point.

16 primary factors had primary factor loadings of .4 or higher for all items. The remaining 8 primary factors had 3 primary factor loadings of .4 or higher. Only 4 items had secondary loadings greater than .3. Thus, the measurement model confirmed the intended structure of the questionnaire.

Importantly, the measurement model was created without imposing any structure on the correlations among higher-order factors. Thus, the freeing of secondary loadings and correlated residuals did not bias the results in favor of the Big Five or HEXACO model. Rather, the fit of the measurement model can be used to evaluate the fit of theoretical models about the structure of personality.

A simplistic model that is often presented in textbooks would imply that only traits related to the same higher-order factor are correlated with each other and that all other correlations are close to zero. Table 1 shows the correlations for the HEXACO-Agreeableness (A-Gent = gentle, A-Forg = forgiving, A-Pati = patient, & A-Flex = flexible) and the HEXACO-honesty-humility (H-Gree = greed-avoidance, H-Fair = fairness, H-Mode = modest, & H-Sinc = sincere) factors.

In support of the Big Five model, all correlations are positive. This suggests that all primary factors are related to a single higher-order factor. In support of the HEXACO model, correlations among A-factors and correlations among H-factors tend to be higher than correlations of A-factors with H-factors. Three notable exceptions are highlighted in red and all of them involve modesty. Modesty is more strongly related to A-Gent and A-Flex than to H-Mode.

Table 2 shows the correlations of the A and H factors with the four neuroticism factors (N-Fear = fear, N-Anxi = anxiety, N-Depe = dependence, N-Sent = sentimental). Notable correlations greater than .2 are highlighted. For the most part, the results show that neuroticism and pro-social traits are unrelated. However, there are some specific relations among factors. Notably, all four HEXACO-A factors are negatively related to anxiety. This shows some dissociation between A and H factors. In addition, fear is positively related to fairness and negatively related to sincerity. Sentimentality is positively related to fairness and modesty. Neither the Big Five model nor the HEXACO model has explanations for these relationships.

Table 3 shows the correlation with the Extraversion factors (E-soci = Sociable, E-socb = bold, E-live = lively, E-Sses = self-esteem). There are few notable relationships between A and H factors on the one hand and E factors on the other hand. This supports the assumption of both models that pro-social traits are unrelated to extraversion traits, including being sociable.

Table 4 shows the results for the Openness factors. Once more there are few notable relationships. This is consistent with the idea that pro-social traits are fairly independent of Openness.

Table 5 shows the results for conscientiousness factors (C-Orga = organized, C-Dili = diligent, C-Perf = Perfectionistic, & C-Prud = prudent). Most of the correlations are again small, indicating that pro-sociality is independent of conscientiousness. The most notable exceptions are positive correlations of the conscientiousness factors with fairness. This suggests that fairness is related to conscientiousness.

Table 6 shows the remaining correlations among the N, E, O, and C factors.

The green triangles show correlations among the primary factors belonging to the same higher-order factor. The strong correlations confirm the selection of primary factors to be included in the HEXACO-100. Most of the remaining correlations are below .2. The grey fields show correlations greater than .2. The most notable correlations are for diligence (C-Dili), which is correlated with all E-factors. This suggests a notable secondary loading of diligence on the higher-order factor E. Another noteworthy finding is a strong correlation between self-esteem (E-Sses) and anxiety (N-anx). This is to be expected because self-esteem is known to have strong relationships with neuroticism. It is surprising, however, that self-esteem is not related to the other primary factors of neuroticism. One problem in interpreting these results is that the other neuroticism facets are unique to the HEXACO-100.

In conclusion, inspection of the correlations among the 24 primary factors shows clear evidence for 5 mostly independent factors that correspond to the Big Five factors. In addition, the correlations among the pro-social factors show a distinction between the four HEXACO-A factors and the four HEXACO-H factors. Thus, it is possible to represent the structure with 6 factors that correspond to the HEXACO model, but the higher-order A and H factors would not be independent.

A Big Five Model of the HEXACO-100

I fitted a model with five higher-order factors to examine the ability of the Big Five model to explain the structure of the HEXACO-100. Importantly, I did not alter the measurement model of the primary factors. It is clear from the previous results that a simple-structure would not fit the data. I therefore allowed for secondary loadings of primary factors on the higher-order factors. In addition, I allowed for residual correlations among primary factors. Furthermore, when several primary factors showed consistent correlated residuals, I modeled them as factors. In this way, the HEXACO-A and HEXACO-H factors could be modeled as factors that account for correlated residuals among pro-social factors. Finally, I added a halo factor to the model. The halo factor has been identified in many Big Five questionnaires and reflects the influence of item-desirability on responses.

Model fit was slightly less than model fit for the measurement model, RMSEA = .021 vs. .021, CFI = .927 vs. .936. However, inspection of MI did not suggest additional plausible ways to improve the model. Figure 1 shows the primary loadings on the Big Five factors and the two HEXACO factors, HEXACO-Agreeableness (HA) and HEXACO-Honesty-Humility.

The first notable observation is that primary factors have loadings above .5 for four of the Big Five factors. For the Agreeableness factor, all loadings were statistically significant and above .2, but four loadings were below .5. This shows that agreeableness explains less variance in some primary factors than the other Big Five factors. Thus, one question is whether the magnitude of loadings on the Big Five factors should be a criterion for model selection.

The second noteworthy observation is that the model clearly identified HEXACO-A and HEXACO-H as distinct factors. That is, the residuals of the corresponding primary factors were all positively correlated. All loadings were above .2, but several of the loadings were also below .5. Moreover, for the HEXACO-A factors the loadings on the Big5-A factor were stronger than the loadings on the HEXACO-A factor. Modesty (H-Mode) also loaded more highly on Big5-A than HH. The results for HEXACO-A are not particularly troubling because the HEXACO model does not consider this factor to be particularly different from Big5-A. Thus, the main question is whether the additional shared variance among HEXACO-H factors warrants the creation of a model with six factors. That is, does Honesty-Humility have the same status as the Big Five factors?

Alternative Model 1

The HEXACO model postulates six factors. Comparisons of the Big Five and HEXACO model tend to imply that the HEXACO factors are just as independent as the Big Five factors. However, the data show that HEXACO-A factors and HEXACO-H factors are not as independent of each other as other factors. To fit a six-factor model to the data, it would be possible to allow for a correlation between HEXACO-A and HEXACO-H. To make this model fit as well as the Big-Five model, an additional secondary loading of modesty (H-Mode) on HEXACO-A was needed, RMSEA = .22, CFI = .926. This secondary loading was low, r = .25, and is not displayed in Figure 2.

The most notable finding is a substantial correlation between Hexaco-A and Hexaco-H of r = .49. Although there are no clear criteria for practical independence, this correlation is strong and suggests that there is an important common factor that produces a positive correlation between these two factors. This makes this model rather unappealing. The main advantage of the Big Five model would be that it captures the highest level of independent factors in a hierarchy of personality traits.

Alternative Model 2

An alternative solution to represent the correlations among HEXACO-A and HEXACO-H factors is to treat HEXACO-A and HEXACO-H as independent factors and to allow for secondary loadings of HEXACO-H factors on HEXACO-A or vice versa. Based on the claim that the H-factor adds something new to the structure, I modelled secondary loadings of the primary H-factors on HEXACO-A. Fit was the same as for the first alternative model, RMSEA = .22, CFI = .927. Figure 3 shows substantial secondary loadings for three of the four H-factors, and for modesty the loading on the HEXACO-A factor is even stronger than the loading on the HEXACO-H factor.

The following table shows the loading pattern along with all secondary loadings greater than .1. Notable secondary loadings greater than .3 are highlighted in pink. Aside from the loading of some H-factors on A, there are some notable loadings of two C-factors on E. This finding is consistent with other results that high achievement motivation is related to E and C.

The last column provides information about correlated residuals (CR) in the last column. Primary factors with the same letter have a correlated residual. For example, there is a strong negative relationship between anxiety (N-anxiety) and self-esteem (E-Sses) that was apparent in the correlations among the primary factors in Table 6. This relationship could not be modeled as a negative secondary loading on neuroticism because the other neuroticism factors showed much weaker relationships with self-esteem.


In sum, the choice between the Big5 model and the HEXACO model is a relatively minor stylistic choice. The Big Five model is a broad model that predicts variance in a wide variety of primary personality factors that are often called facets. There is no evidence that the Big Five model fails to capture variation in the primary factors that are used to measure the Honesty-Humility factor of the HEXACO model. All four H-factors are related to a general agreeableness factor. Thus, it is reasonable to maintain the Big Five model as a model of the highest level in a hierarchy of personality traits and to consider the H-factor a factor that explains additional relationships among pro-social traits. However, an alternative model with Honesty-Humility as a sixth factor is also consistent with the data. This model only appears different from the Big Five model if secondary loadings are ignored. However, all H-factors had secondary loadings on agreeableness. Thus, agreeableness remains a broader trait that links all pro-social traits, while Honesty-Humility explains additional relationships among a subset of this factors. If Honesty-Humility is indeed a distinct global factor it should be possible to find primary factors that are uniquely related to this factor without notable secondary loadings on Agreeableness. If such traits exists, they would strengthen the support for the HEXACO model. On the other hand, if all traits that are related to Honesty-Humility also load on Agreeableness, it seems more appropriate to treat Honesty-Humility as a lower-level factor in the hierarchy of traits. In conclusion, these structural models did not settle the issue, but they clarify the issue. Agreeableness factors and Honesty-Humilty factors form distinct, but related clusters of primary traits. This empirical finding can be represented with a Five-Factor model with Honest-Humility as shared variance among some pro-social traits or it can be represented with six factors and secondary loadings.


A major source of confusion in research on the structure of personality is the failure to distinguish between factors and scales. Many proponents of the HEXACO model point out that the HEXACO scales, especially the Honesty-Humilty scale, explain variance in criterion variables that is not explained by Big-Five scales. It has also been observed that the advantage of the HEXACO scales depends on the Big-Five scales that are used. The reason for these findings is that scales are imperfect measures of their intended factors. They also contain information about the primary factors that were used to measure the higher-order factors. The advantage of the HEXACO-100 is that it measures 24 primary factors. There is nothing special about the Honesty-Humility factor. As Figure 1 shows, the honesty-humilty factor explains only a portion of the variance in its designated primary factors, namely .67^2 = 45% of the variance in greed-avoidance, .55^2 = 30% of the variance in fairness, .32^2 = 10% of the variance in modesty, and .41^2 = 17% of the variance in sincerity. Averaging these scales to form a Honesty-Humilty scale destroys some of this variance and inevitably lowers the ability to predict some criterion variable that is strongly related to one of these primary factors. There is also no reason why Big Five questionnaires should not include some primary factors of Honesty-Humility and the NEO-PI-3 does include modesty and fairness.

Personality psychologists need to distinguish more clearly between factors and scales. The correlation of the NEO-PI-3 agreeableness scale will be different from those with the HEXACO-A scale or the BFI2-agreeableness scale. Scale correlations are biased by the choice of items, unless items are carefully selected to maximize correlation with the latent factor. For research purposes, researchers should use latent variable models that can decompose an observed correlation into the influence of the higher-order factor and the influence of specific factors.

Personality researchers should also carefully think about the primary factors they may want to include in their studies. For example, even researchers who favor a HEXACO model may include additional measures of anger and depression to explore the contribution of affective dispositions to outcome measures. Similarly, Big Five researchers may want to supplement their Big Five questionnaires with measures of primary traits related to honesty and morality if the Big-Five measure does not capture them. A focus on the highe-order factors is only justified in studies that require short measures with a few items.


My main contribution to the search for a structural model of personality is to examine this question with a statistical tool that makes it possible to test structural models of factors. The advantage of this method is that it is possible to separate structural models of factors from the items that are used to measure factors. While scales of the same factor can differ sometimes dramatically, structural models of factors are independent of the specific items that are used to measure a factor as long as some items reflect variance in the factor. Using this approach, I showed that the Big Five and HEXACO model only differ in the way they represent covariation among some primary factors. It is incorrect to claim that Big Five models fail to represent variation in honesty or humility. It is also incorrect to assume that all pro-social traits are independent after their shared variance in agreeableness is removed. Future research needs to examine more carefully the structural relationships among primary traits that are not explained by higher-order factors. This research question has been neglected because exploratory factor analysis is unable to examine this question. I therefore urge personality researchers to adopt confirmatory factor analysis to advance research on personality structure.

A Meta-Psychological Investigation of Intelligence Research with Z-Curve.2.0

A recent article by Nuijten, van Assen, Augusteijn, Crompvoets, and Wicherts reported the results of a meta-meta-analysis of intelligence research. The authors extracted 2442 eect sizes from 131 meta-analyses. The authors made these data openly available to allow “readers to pursue other categorizations and analyses” (p. 6). In this blog post, I report the results of an analysis of their data with z-curve.2.0 (Bartos & Schimmack, 2020; Brunner & Schimmack, 2020). Z-curve is a powerful statistical tool that can (a) examine the presence of publication bias and/or the use of questionable research practices, (b) provide unbiased estimate of statistical power before and after selection for significance when QRPs are present, and (c) estimate the maximum number of false positive results.

Questionable Research Practices

The term questionable research practices refers to a number of statistical practices that inflate the number of significant results in a literature (John et al., 2012). Nuijten et al. relied on the correlation between sample size and effect size to examine the presence of publication bias. Publication bias produces a negative correlation between sample size and effect size because larger effects are needed to get significance in studies with smaller samples. The method has several well-known limitations. Most important, a negative correlation is also expected if researchers use larger samples when they anticipate smaller effects, either in the form of a formal a priori power analysis or based on informal information about sample sizes in previous studies. For example, it is well-known that effect sizes in molecular genetics studies are tiny and that sample sizes are huge. Thus, a negative correlation is expected even without publication bias.

Z-curve.2.0 avoids this problem by using a different approach to detected the presence of publication bias. The approach compares the observed discovery rate (i.e., the percentage of significant results) to the expected discovery rate (i.e., the average power of studies before selection for significance). To estimate the EDR, z-curve.2.0 fits a finite mixture model to the significant results and estimates average power based on the weights of a finite number of non-centrality parameters.

I converted the reported information about sample size, effect size, and sampling error into t-values, and then converted the t-values. Extremely large t-values of 20 were fixed to a value of 20. Then t-values were converted into absolute z-scores.

Figure 1 shows a histogram of the z-scores in the critical range from 0 to 6. All z-scores greater than 6 are assumed to have a power of 1 with a significance threshold of .05 (z = 1.96).

The critical comparison of the observed discovery rate (52%) and the expected discovery rate (58%) shows no evidence of QRPs. In fact, the EDR is even higher than the ODR, but the confidence interval is wide and includes the ODR. When there is no evidence that QRPs are present, it is better to use all observed z-scores, including the credible non-significant results, to fit the finite mixture model. Figure 2 shows the results. The blue line moved to 0, indicating that all values were used for estimation.

Visual inspection shows a close match between the observed distribution of z-scores (blue line) and the predicted distribution by the finite mixture model (grey line). The observed discovery rate now closely matches the expected discovery rate of 52%. Thus, there is no evidence of publication bias in the meta-meta-analysis of effect sizes in intelligence research.

Interestingly, there is also no evidence that researchers used mild QRPs to move marginally significant results just below .05 on the other side of the significance criterion to produce just significant results. There are two possible explanation for this. On the one hand, intelligence researchers may be more honest than other psychologists. On the other hand, it is possible that meta-analyses are not representative of the focal hypothesis tests that led to publication of original research articles. A meta-analysis of focal hypothesis tests in original articles is needed to answer this question.

In conclusion, this superior analysis of the presence of bias in the intelligence literature showed no evidence of bias. In contrast, Nuijten et al. (2020) found a significant correlation between effect sizes and sample sizes which they call small study effect. The problem with this finding is that it can reveal either careful planning of sample sizes (good practices) or the use of QRPs (bad practices). Thus, their analyses does not tell us whether there is bias in the data. Z-curve.2.0 resolves this ambiguity and shows that there is no evidence of selection for significance in these data.

Statistical Power

Nuijten et al. used Cohen’s classic approach to investigate power (Cohen, 1962). Based on this approach, they concluded “we found an overall median power of 11.9% to detect a small effect,54.5% for a medium effect, and 93.9% for a large effect (corresponding to a Pearson’s r of 0.1, 0.3, and 0.5 or a Cohen’s d of 0.2, 0.5, and 0.8, respectively)”

This information merely provides information about the sample sizes in the different studies. Studies with small sample sizes have low power to detect a small effect size. As most studies had small sample sizes, the average power to detect small effects is low. However, this does not tell us anything about the actual power of studies to obtain significant results for two reasons. First, effect sizes in a meta-meta-analysis are extremely heterogeneous. Thus, not all studies are chasing small effect sizes. As a result, the power of studies is likely to be higher than the average power to detect small effect sizes. Second, the previous results showed that (a) sample sizes correlate with effect sizes and (b) there is no evidence of QRPs. This means that researchers are a priori deciding to use smaller samples to search for larger effects and larger samples to search for smaller effects. This means that formal or informal a priori power analyses ensure that small samples can have as much or more power than large samples. It is therefore not informative to conduct power analysis only based on information about sample size. Z-curve.2.0 avoids this problem and provides estimates of the actual mean power of studies. Moreover, it provides two estimates of power for two different populations of studies. One population are all studies that are conducted by intelligence researchers without selecting for significance. This estimate is the expected discovery rate. Z-curve also provides an estimate for the population of studies that produced a significant result. This population is of interest because only significant results can be used to claim a discovery; with an error rate of 5%. When there is heterogeneity in power, the mean power after selection for significance is higher than the average power before selection for significance (Brunner & Schimmack, 2020). When researchers attempt to replicate a significant results to verify that it was not a false positive result, mean power after selection for significance provides the average probability that an exact replication study will be significant. This information is valuable to evaluate the outcome of actual replication studies (cf. Schimmack, 2020).

Given the lack of publication bias, there are two ways to determine mean power before selection for significance. We can simply compute the average of significant results and we can use the estimated discovery rate. Figure 2 shows that both values are 52%. Thus, the average power of studies conducted by intelligence researchers is 52%. This is well-below the recommended level of 80%.

The picture is a bit better for studies with a significant result. Here the average power called the expected replication rate is 71% and the 95% confidence interval approaches 80%. Thus, we would expect that more than 50% of significant results in intelligence research can be replicated with a significant result in the replication study. This estimate is higher than for social psychology, where the expected replication rate is only 43%.

False Positive Psychology

The past decade has seen a number of stunning replication failures in social psychology (cf. Schimmack, 2020). This has led to a concern that most discoveries in psychology if not in all sciences are false positive results that were obtained with questionable research practices (Ioannidis, 2005 ; Simmons et al., 2011). So far, however, these concerns are based on speculations and hypothetical scenarios rather than actual data. Z-curve.2.0 makes it possible to examine this question empirically. Although it is impossible to say how many published results are in fact false positive results, it is possible to estimate the maximum number of false-positive results based on the discovery rate. (Soric, 1989). As the observed and expected discovery are identical, we can use the value of 52% as our estimate of the discovery rate. This implies that no more than 5% of the significant results are false positive results. Thus, the empirical evidence shows that most published results in intelligence research are not false positives.

Moreover, this finding implies that most non-significant results are false negatives or type-II errors. That is, the null-hypothesis is also false for non-significant results. This is not surprising because many intelligence studies are correlational and the nil-hypothesis that there is absolutely no relationship between two naturally occurring variables has a low a priori probability. This also means that intelligence researchers would benefit from specifying some minimal effect size for hypothesis testing or to focus on effect size estimation rather than hypothesis testing.


Nujiten et al. conclude that intelligence research is plagued by QRPs. “Based on our findings, we conclude that intelligence research from 1915 to 2013 shows signs that publication bias may have caused overestimated effects”. This conclusion ignores that small-sample effects are ambiguous. The superior z-curve analysis shows no evidence of publication bias. As a result, there is also no evidence that reported effect sizes are inflated.

The z-curve.2.0 analysis leads to a different conclusion. There is no evidence of publication bias, significant results have a probability of 70% to be replicated in exact replication studies and even if exact replication studies are impossible the discovery rate of 50% implies that we should expect the majority of replication attempts with the same sample sizes to be successful (Bartos & Schimmack, 2020). In replication studies with larger samples even more results should replicate. Finally, most of the non-significant results are false negative results because there are few true null-hypothesis in correlational research. A modest increase in sample sizes could easy achieve 80% power which is typically recommended.

A larger concern is the credibility of conclusions based on meta-meta-analyses. The problem is that meta-analysis focus on general main effects that are consistent across studies. In contrast, original studies may focus on unique patterns in the data that can not be subjected to meta-analysis because direct replications of these specific patterns are lacking. Future research therefore needs to code the focal hypothesis tests in intelligence articles to examine the credibility of intelligence research.

Another concern is the reliance on alpha = .05 as a significance criterion. Large genomic studies have a multiple comparison problem where 10,000 analyses can easily produce hundreds of significant results with alpha = .05. This problem is well-known and genetics studies now use much lower alpha levels to test for significance. A proper power analysis of these studies needs to use the actual alpha level rather than the standard level of .05. Z-curve is a flexible tool that can be used with different alpha levels. Therefore, I highly recommend z-curve for future meta-scientific investigations of intelligence research and other disciplines.


Bartoš, F., & Schimmack, U. (2020). z-curve.2.0: Estimating replication and discovery rates. Under review.

Brunner, J., & Schimmack, U. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta- Psychology. MP.2018.874,

Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145–153.

Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2, e124.

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the Prevalence of Questionable Research Practices With Incentives for Truth Telling. Psychological Science, 23(5), 524–532.

Schimmack, U. (2020). A meta-psychological perspective on the decade of replication failures in social psychology. Canadian Psychology/Psychologie canadienne. Advance online publication.

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22,  1359 –1366.

Sorić, B. (1989). Statistical “discoveries” and effect-size estimation.Journal of the American Statistical Association,84(406), 608-610.