Dr. Ulrich Schimmack’s Blog about Replicability

For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication” (Cohen, 1994).

DEFINITION OF REPLICABILITYIn empirical studies with random error variance replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion.

BLOGS BY YEAR:  20192018, 2017, 2016, 2015, 2014

Featured Blog of the Month (January, 2019): 
Why Ionnidis’s Claim “Most published research findings are false” is false

TOP TEN BLOGS

RR.Logo

  1. 2018 Replicability Rankings of 117 Psychology Journals (2010-2018)

Rankings of 117 Psychology Journals according to the average replicability of a published significant result. Also includes detailed analysis of time trends in replicability from 2010 to 2018). 

Golden2.  Introduction to Z-Curve with R-Code

This post presented the first replicability ranking and explains the methodology that is used to estimate the typical power of a significant result published in a journal.  The post provides an explanation of the new method to estimate observed power based on the distribution of test statistics converted into absolute z-scores.  The method has been developed further to estimate power for a wider range of z-scores by developing a model that allows for heterogeneity in power across tests.  A description of the new method will be published when extensive simulation studies are completed.

Say-No-to-Doping-Test-Image

3. An Introduction to the R-Index

 

The R-Index can be used to predict whether a set of published results will replicate in a set of exact replication studies. It combines information about the observed power of the original studies with information about the amount of inflation in observed power due to publication bias (R-Index = Observed Median Power – Inflation). The R-Index has predicted the outcome of actual replication studies.

Featured Image -- 203

4.  The Test of Insufficient Variance (TIVA)

 

The Test of Insufficient Variance is the most powerful test of publication bias and/or dishonest reporting practices. It can be used even if only two independent statistical results are available, although power to detect bias increases with the number of studies. After converting test results into z-scores, z-scores are expected to have a variance of one.   Unless power is very high, some of these z-scores will not be statistically significant (z .05 two-tailed).  If these non-significant results are missing, the variance shrinks, and TIVA detects that the variance is insufficient.  The observed variance is compared against the expected variance of 1 with a left-tailed chi-square test. The usefulness of TIVA is illustrated with Bem’s (2011) “Feeling the Future” data.

train-wreck-15.  MOST VIEWED POST (with comment by Noble Laureate Daniel Kahneman)

Reconstruction of a Train Wreck: How Priming Research Went off the Rails

This blog post examines the replicability of priming studies cited in Daniel Kahneman’s popular book “Thinking fast and slow.”   The results suggest that many of the cited findings are difficult to replicate.

http://schoolsnapshots.org/blog/2014/09/30/math-prize-for-girls-at-m-i-t/6. How robust are Stereotype-Threat Effects on Women’s Math Performance?

Stereotype-threat has been used by social psychologists to explain gender differences in math performance. Accordingly, the stereotype that men are better at math than women is threatening to women and threat leads to lower performance.  This theory has produced a large number of studies, but a recent meta-analysis showed that the literature suffers from publication bias and dishonest reporting.  After correcting for these effects, the stereotype-threat effect was negligible.  This blog post shows a low R-Index for the first article that appeared to provide strong support for stereotype-threat.  These results show that the R-Index can warn readers and researchers that reported results are too good to be true.

GPower7.  An attempt at explaining null-hypothesis testing and statistical power with 1 figure and 1500 words.   Null-hypothesis significance testing is old, widely used, and confusing. Many false claims have been used to suggest that NHST is a flawed statistical method. Others argue that the method is fine, but often misunderstood. Here I try to explain NHST and why it is important to consider power (type-II errors) using a picture from the free software GPower.

snake-oil

8.  The Problem with Bayesian Null-Hypothesis Testing

 

Some Bayesian statisticians have proposed Bayes-Factors to provide evidence for a Null-Hypothesis (i.e., there is no effect).  They used Bem’s (2011) “Feeling the Future” data to argue that Bayes-Factors would have demonstrated that extra-sensory perception does not exist.  This blog post shows that Bayes-Factors depend on the specification of the alternative hypothesis and that support for the null-hypothesis is often obtained by choosing an unrealistic alternative hypothesis (e.g., there is a 25% probability that effect size is greater than one standard deviation, d > 1).  As a result, Bayes-Factors can favor the null-hypothesis when there is an effect, but the effect size is small (d = .2).  A Bayes-Factor in favor of the null is more appropriately interpreted as evidence that the alternative hypothesis needs to decrease the probabilities assigned to large effect sizes. The post also shows that Bayes-Factors based on a meta-analysis of Bem’s data provide misleading evidence that an effect is present because Bayesian statistics do not take publication bias and dishonest reporting practices into account.

hidden9. Hidden figures: Replication failures in the stereotype threat literature.  A widespread problem is that failed replication studies are often not published. This blog post shows that another problem is that failed replication studies are ignored even when they are published.  Selective publishing of confirmatory results undermines the credibility of science and claims about the importance of stereotype threat to explain gender differences in mathematics.

20170620_14554410. My journey towards estimation of replicability.  In this blog post I explain how I got interested in statistical power and replicability and how I developed statistical methods to reveal selection bias and to estimate replicability.

Open Communication about the invalidity of the race IAT

In the old days, most scientific communication occured behind closed doors, when reviewers provide anonymous peer-reviews that determine the fate of manuscripts. In the old days, rejected manuscripts would not be able to contribute to scientific communications because nobody would know about them.

All of this has changed with the birth of open science. Now authors can share manuscripts on pre-print servers and researchers can discuss merits of these manuscripts on social media. The benefit of this open scientific communication is that more people can join in and contribute to the communication.

Yoav Bar-Anan co-authored an article with Brian Nosek titled “Scientific Utopia: I. Opening Scientific Communication.” In this spirit of openness, I would like to have an open scientific communication with Yoav and his co-author Michelangelo Vianello about their 2018 article “A Multi-Method Multi-Trait Test of the Dual-Attitude Perspective

I have criticized their model in an in press article in Perspectives of Psychological Science (Schimmack, 2019). In a commentary, Yoav and Michelangelo argue that their model is “compatible with the logic of an MTMM investigation (Campbell & Fiske, 1959). They argue that it is important to have multiple traits to identify method variance in a matrix with multiple measures of multiple traits. They then propose that I lost the ability to identify method variance by examining one attitude (i.e., race, self-esteem, political orientation) at a time. They then point out that I did not include all measures and included the Modern Racism Scale as an indicator of political orientation to note that I did not provide a reason for these choices. While this is true, Yoav and Michelangelo had access to the data and could have tested whether these choices made any differences. They do not. This is obvious for the modern racism scale that can be eliminated from the measurement model without any changes in the overall model.

To cut to the chase, the main source of disagreement is the modelling of method variance in the multi-trait-multi-method data set. The issue is clear when we examine the original model published in Bar-Anan and Vianello (2018).

In this model, method variance in IATs and related tasks like the Brief IAT is modelled with the INDIRECT METHOD factor. The model assumes that all of the method variance that is present in implicit measures is shared across attitude domains and across all implicit measures. The only way for this model to allow for different amounts of method variance in different implicit measures is by assigning different loadings to the various methods. Moreover, the loadings provide information about the nature of the shared variance and the amount of method variance in the various methods. Although this is valuable and important information, the authors never discuss this information and its implications.

Many of these loadings are very small. For example, the loading of the race IAT and the brief race IAT are .11 and .02. In other words, the correlation between these two measures is inflated by .11 * .02 = .0022 points. This means that the correlation of r = .52 between these two measures is r = .5178 after we remove the influence of method variance.

It makes absolutely no sense to accuse me of separating the models, when there is no evidence of implicit method variance that is shared across attitudes. The remaining parameter estimates are not affected if a factor with low loadings is removed from a model.

Here I show that examining one attitude at a time produces exactly the same results as the full model. I focus on the most controversial IAT; the race IAT. After all, there is general agreement that there is little evidence of discriminant validity for political orientation (r = .91, in the Figure above), and there is little evidence for any validity in the self-esteem IAT based on several other investigations of this topic with a multi-method approach (Bosson et al., 2000; Falk et al., 2015).

Model 1 is based on Yoav and Michelangelo’s model that assumes that there is practically no method variance in IAT-variants. Thus, we can fit a simple dual-attitude model to the data. In this model, contact is regressed onto implicit and explicit attitude factors to see the unique contribution of the two factors without making causal assumptions. The model has acceptable fit, CFI = .952, RMSEA = .013.

The correlation between the two factors is .66, while it is r = .69 in the full model in Figure 1. The loading of the race IAT on the implicit factor is .66, while it is .62 in the full model in Figure 1. Thus, as expected based on the low loadings on the IMPLICIT METHOD factor, the results are no different when the model is fitted only to the measure of racial attitudes.

Model 2 makes the assumption that IAT-variants share method variance. Adding the method factor to the model increased model fit, CFI = .973, RMSEA = .010. As the models are nested, it is also possible to compare model fit with a chi-square test. With five degrees of freedom difference, chi-square changed from 167. 19 to 112.32. Thus, the model comparison favours the model with a method factor.

The main difference between the models is that there the evidence is less supportive of a dual attitude model and that the amount of valid variance in the race IAT decreases from .66^2 = 43% to r = .47^2 = 22%.

In sum, the 2018 article made strong claims about the race IAT. These claims were based on a model that implied that there is no systematic measurement error in IAT scores. I showed that this assumption is false and that a model with a method factor for IATs and IAT-variants fits the data better than a model without such a factor. It also makes no theoretical sense to postulate that there is no systematic method variance in IATs, when several previous studies have demonstrated that attitudes are only one source of variance in IAT scores (Klauer, Voss, Schmitz, & Teige-Mocigemba, 2007).

How is it possible that the race IAT and other IATs are widely used in psychological research and on public websites to provide individuals with false feedback about their hidden attitudes without any evidence of its validity as an individual difference measure of hidden attitudes that influence behaviour outside of awareness?

The answer is that most of these studies assumed that the IAT is valid rather than testing its validity. Another reason is that psychological research is focused on providing evidence that confirms theories rather than subjecting theories to empirical tests that they may fail. Finally, psychologists ignore effect sizes. As a result, the finding that IAT scores have incremental predictive validity of less than 4% variance in a criterion is celebrated as evidence for the validity of IATs, but even this small estimate is based on underpowered studies and may shrink in replication studies (cf. Kurdi et al., 2019).

It is understandable that proponents of the IAT respond with defiant defensiveness to my critique of the IAT. However, I am not the first to question the validity of the IAT, but these criticisms were ignored. At least Banaji and Greenwald recognized in 2013 that they do “not have the luxury of believing that what appears true and valid now will always appear so” (p. xv). It is time to face the facts. It may be painful to accept that the IAT is not what it was promised to be 21 years ago, but that is what the current evidence suggests. There is nothing wrong with my models and their interpretation, and it is time to tell visitors of the Project Implicit website that they should not attach any meaning to their IAT scores. A more productive way to counter my criticism of the IAT would be to conduct a proper validation study with multiple methods and validation criteria that are predicted to be uniquely related to IAT scores in a preregistered study.

References

Bosson, J. K., Swann, W. B., Jr., & Pennebaker, J. W. (2000). Stalking the perfect measure of implicit self-esteem: The blind men and the elephant revisited? Journal of Personality and Social Psychology, 79, 631–643.

Falk, C. F., Heine, S. J., Takemura, K., Zhang, C. X., & Hsu, C. (2015). Are implicit self-esteem measures valid for assessing individual and cultural differences. Journal of Personality, 83, 56–68. doi:10.1111/jopy.12082

Klauer, K. C., Voss, A., Schmitz, F., & Teige-Mocigemba, S. (2007). Process components of the Implicit Association Test: A diffusion-model analysis. Journal of Personality and Social Psychology, 93, 353–368.

Kurdi, B., Seitchik, A. E., Axt, J. R., Carroll, T. J., Karapetyan, A., Kaushik, N., . . . Banaji, M. R. (2019). Relationship between the Implicit Association Test and intergroup behavior: A meta-analysis. American Psychologist, 74, 569–586.

The Diminishing Utility of Replication Studies In Social Psychology

Dorthy Bishop writes on her blog.

“As was evident from my questions after the talk, I was less enthused by the idea of doing a large, replication of Darryl Bem’s studies on extra-sensory perception. Zoltán Kekecs and his team have put in a huge amount of work to ensure that this study meets the highest standards of rigour, and it is a model of collaborative planning, ensuring input into the research questions and design from those with very different prior beliefs. I just wondered what the point was. If you want to put in all that time, money and effort, wouldn’t it be better to investigate a hypothesis about something that doesn’t contradict the laws of physics?”


I think she makes a valid and important point. Bem’s (2011) article highlighted everything that was wrong with the research practices in social psychology. Other articles in JPSP are equally incredible, but this was ignored because naive readers found the claims more plausible (e.g., blood glucose is the energy for will power). We know now that none of these published results provide empirical evidence because the results were obtained with questionable research practices (Schimmack, 2014; Schimmack, 2018). It is also clear that these were not isolated incidents, but that hiding results that do not support a theory was (and still is) a common practice in social psychology (John et al., 2012; Schimmack, 2019).

A large attempt at estimating the replicability of social psychology revealed that only 25% of published significant results could be replicated (OSC). The rate for between-subject experiments was even lower. Thus, the a-priori probability (base rate) that a randomly drawn study from social psychology will produce a significant result in a replication attempt is well below 50%. In other words, a replication failure is the more likely outcome.

The low success rate of these replication studies was a shock. However, it is sometimes falsely implied that the low replicability of results in social psychology was not recognized earlier because nobody conducted replication studies. This is simply wrong. In fact, social psychology is one of the disciplines in psychology that required researchers to conduct multiple studies that showed the same effect to ensure that a result was not a false positive result. Bem had to present 9 studies with significant results to publish his crazy claims about extrasensory perception (Schimmack, 2012). Most of the studies that failed to replicate in the OSC replication project were taken from multiple-study articles that reported several successful demonstrations of an effect. Thus, the problem in social psychology was not that nobody conducted replication studies. The problem was that social psychologists only reported replication studies that were successful.

The proper analyses of the problem also suggests a different solution to the problem. If we pretend that nobody did replication studies, it may seem useful to starting doing replication studies. However, if social psychologists conducted replication studies, but did not report replication failures, the solution is simply to demand that social psychologists report all of their results honestly. This demand is so obvious that undergraduate students are surprised when I tell them that this is not the way social psychologists conduct their research.

In sum, it has become apparent that questionable research practices undermine the credibility of the empirical results in social psychology journals, and that the majority of published results cannot be replicated. Thus, social psychology lacks a solid empirical foundation.

What Next?

It is implied by information theory that little information is gained by conducting actual replication studies in social psychology because a failure to replicate the original result is likely and uninformative. In fact, social psychologists have responded to replication failures by claiming that these studies were poorly conducted and do not invalidate the original claims. Thus, replication studies are both costly and have not advanced theory development in social psychology. More replication studies are unlikely to change this.

A better solution to the replication crisis in social psychology is to characterize research in social psychology from Festinger’s classic small-sample, between-subject study in 1957 to research in 2017 as exploratory and hypotheses generating research. As Bem suggested to his colleagues, this was a period of adventure and exploration where it was ok to “err on the side of discovery” (i.e., publish false positive results, like Bem’s precognition for erotica). Lot’s of interesting discoveries were made during this period; it is just not clear which of these findings can be replicated and what they tell us about social behavior.

Thus, new studies in social psychology should not try to replicate old studies. For example, nobody should try to replicate Devine’s subliminal priming study with racial primes with computers and software from the 1980s (Devine, 1989). Instead, prominent theoretical predictions should be tested with the best research methods that are currently available. Thus, the way forward is not to do more replication studies, but rather to use open science (a.k.a. honest science) that uses experiments to subject theories to empirical tests that may also falsify a theory (e.g., subliminal racial stimuli have no influence on behavior). The main shift that is required is to get away from research that can only confirm theories and to allow for empirical data to falsify theories.

This was exactly the intent of Danny Kahneman’s letter, when he challenged social priming researchers to respond to criticism of their work by going into their labs and to demonstrate that these effects can be replicated across many labs.

Kahneman makes it clear that the onus of replication is on the original researchers who want others to believe their claims. The response to this letter speaks volumes. Not only did social psychologists fail to provide new and credible evidence that their results can be replicated, they also demonstrated defiant denial in the face of replication failures by others. The defiant denial by prominent social psychologists (e.g., Baumeister, 2019) make it clear that they will not be convinced by empirical evidence, while others who can look at the evidence objectively do not need more evidence to realize that the social psychological literature is a train-wreck (Schimmack, 2017; Kahneman, 2017). Thus, I suggest that young social psychologists search the train wreck for survivors, but do not waste their time and resources on replication studies that are likely to fail.

A simple guide through the wreckage of social psychology is to distrust any significant result with a p-value greater than .01 (Schimmack, 2019). Prediction markets also suggest that readers are able to distinguish credible and incredible results (Atlantic). Thus, I recommend to build on studies that are credible and to stay clear of sexy findings that are unlikely to replicate. As Danny Kahneman pointed out, young social psychologists who work in questionable areas face a dilemma. Either they have to replicate the questionable methods that were used to get the original results, which is increasingly considered unethical, or they end up with results that are not very informative. On the positive side, the replication crisis implies that there are many important topics in social psychology that need to be studied properly with the scientific method. Addressing these important questions may be the best way to rescue social psychology.

Confirmation Bias is Everywhere: Serotonin and the Meta-Trait of Stability

Most psychologists have at least a vague understanding of the scientific method. Somewhere they probably heard about Popper and the idea that empirical data can be used to test theories. As all theories are false, these tests should at some point lead to an empirical outcome that is inconsistent with a theory. This outcome is not a failure. It is an expected outcome of good science. It also does not mean that the theory was bad. Rather it was a temporary theory that is now modified or replaced by a better theory. And so, science makes progress….

However, psychologists do not use the scientific method popperly. Null-hypothesis significance testing adds some confusion here. After all, psychologists publish over 90% successful rejections of the nil-hypothesis. Doesn’t that show they are good Popperians? The answer is no because the nil-hypothesis is not predicted by a theory. The nil-hypothesis is only useful to reject it to claim that there is a predicted relationship between two variables. Thus, psychology journals are filled with over 90% reports of findings that confirm theoretical predictions. While this may look like a major success, it actually shows a major problems. Psychologists never publish results that disconfirm a theoretical prediction. As a result, there is never a need to develop better theories. Thus, a root evil that prevents psychology from being a real science is verificationism.

The need to provide evidence for, rather than against, a theory led to the use of questionable research practices. Questionable research practices are used to report results that confirm theoretical predictions. For example, researchers may simply not report results of studies that did not reject the nil-hypothesis. Other practices can help to produce significant results by inflating the risk of a false positive result. The use of QRPs explains why psychology journals have been publishing over 90% results that confirm theoretical predictions for 60 years (Sterling, 1959). Only recently, it has become more acceptable to report studies that failed to support a theoretical prediction and question the validity of a theory. However, these studies are still a small minority. Thus, psychological science suffers from confirmation bias.

Structural Equation Modelling

Multivariate, correlational studies are different from univariate experiments. In a univariate experiment, a result is either significant or not. Thus, only tempering with the evidence can produce confirmation bias. In multivariate statistics, data are analyzed with complex statistical tools that provide researchers with flexibility in their data analysis. Thus, it is not necessary to alter the data to produce confirmatory results. Sometimes it is sufficient to analyze the data in a way that confirm a theoretical prediction without showing that alternative models fit the data equally well or better.

It is also easier to combat confirmation bias in multivariate research by fitting alternative models to the same data. Model comparison also avoids the problem of significance testing, where non-significant results are considered inconclusive, while significant results are used to confirm and cement a theory. In SEM, statistical inferences work the other way around. A model with good fit (non-significant chi-square or acceptable fit) is a possible model that can explain the data, while a model with significant deviation from the data is rejected. The reason is that the significance test (or model fit) is used to test an actual theoretical model rather than the nil-hypothesis. This forces researchers to specify an actual set of predictions and subject them to an empirical test. Thus, SEM is ideally suited to test theories popperly.

Confirmation Bias in SEM Research

Although SEM is ideally suited to test competing theories against each other, psychology journals are not used to model comparisons and tend to publish SEM research in the same flawed confirmatory way as other research is conducted and reported. For example, an article in Psychological Science this year published an investigation of the structure of personality and the hypothesis that several personality traits are linked to a bio-marker (Wright et al., 2019).

Their preferred model assumes that the Big Five traits neuroticism, agreeableness, and conscientiousness are not independent, but systematically linked by a higher-order triat called alpha or stability (Digman, 1997; DeYoung, 2007). In their model, the stability factor is linked to a marker of the serotonin (5-HT) prolactin response. This model implies that all three traits are related to the biomarker as there are indirect paths from all three traits to the biomarker that are “mediated” by the stability factor (for technical reasons the path goes from stabilty to the biomarker, but theoretically, we would expect the relationship to go the other way from a neurological mechanism to behaviour).

Thanks to the new world of open science, the authors shared actual MPLUS outputs of their models on OSF ( https://osf.io/h5nbu/ ). All the outputs also included the covariance matrix among the predictor variables, which made it possible to fit alternative models to the data.

Alternative Models

Another source of confirmation bias in psychology is that literature reviews fail to mention evidence that contradicts the theory that authors try to confirm. This is pervasive and by no means a specific criticism of the authors. Contrary to the claims in the article, the existence of a meta-trait of stability is actually controversial. Digman (1997) reported some SEM results that were false and could not be reproduced (cf. Anusic et al., 2009). Moreover, alpha could not be identified when the Big Five were modelled as latent factors (Anusic et al., 2009). This led me to propose that meta-traits may be an artifact of using impure Big Five scales as indicators of the Big Five. For example, if some agreeableness items have negative secondary loadings on neuroticism, the agreeableness scale is contaminated with valid variance in neuroticism. Thus, we would observe a negative correlation between neuroticism and agreeableness even across raters (e.g., self-ratings of neuroticism and informant ratings of agreeableness). Here I fitted a model with secondary loadings and independent Big Five factors to the data. I also examined the prediction that the biomarker is related to all three Big Five traits. The alternative model had acceptable fit, CFI = .976, RMSEA = .056.

The main finding of this model is that the biomarker shows only a significant relationship with conscientiousness, while the relationship with agreeableness trended in the right direction, but was not significant (p = .089) and the relationship for neuroticism was even weaker (p = .474). Aside from the question about significance, we also have to take effect sizes into account. Given the parameter estimates, the bimarker would produce very small correlations among the Big Five traits (e.g., r(A,C) = .19 * .10 = .019. Thus, even if these relationships were significant, they would not provide compelling evidence that a source of shared variance among the three traits has been identified.

The next model shows that the authors’s model ignored the stronger relationship between conscientiousness and the biomarker. When this relationship is added to the model, there is no significant relationship between the stability factor and the biomarker.

Thus, the main original finding of this study was that a serotonin related bio-marker was significantly related to conscientiousness, but not significantly related to neuroticism. This finding is inconsistent with theories that link neuroticism to serotonin, and evidence that serotonin reuptake inhibitors reduce neuroticism (at least in depressed patients). However, such results are difficult to publish because a single study with a non-significant results does not provide sufficient evidence to falsify a theory. However, fitting data to a theory only leads to confirmation bias.

The good news is that the authors were able to publish the results of an impressive study and that their data are openly available and can provide credible information for meta-analytic evaluations of structural models of personality, while the results of this study alone are inconclusive and compatible with many different theories of personality.

One way to take more advantage of these data would be to share the covariance matrix of items to model personality structure with a proper measurement model of the Big Five traits and to avoid the problem of contaminated scale scores, which is the best practice for the use of structural equation models. These models provide no evidence for Digman’s meta-traits (Schimmack, 2019a, Schimmack, 2019b).

In conclusion, the main point of this post is that (a) SEM can be used to test and falsify models, (b) SEM can be used to realize that data are consistent with multiple models and that better data are needed to find the better model, (c) studies of Big Five factors require a measurement model with Big Five factors and cannot rely on messy scale scores as indicators of the Big Five, and (d) personality psychologists need better training in the use of SEM.

32 Personality Types

Personality psychology is dominated by dimensional models of personality (Funder, 2019). There is a good reason for this. Most personality characteristics vary along a continuum like height rather than being categorical like eye color. Thus, a system of personality types requires some arbitrary decisions about a cutoff point. For example, a taxonomy of body types could do a median split on height and weight to assign people to the tall-heavy or the tall-light type.

However, a couple of moderately influential articles have suggested that there are three personality types (Asendorpf et al., 2001; Robins et al., 1996).

The notion that there are only three personality types is puzzling. The dominant framework in personality psychology is the Big Five model that conceptualizes personality traits as five independent continuous dimensions. If we were to create personality types by splitting each dimension at the median, it would create 32 personality types, where individuals are either above or below the median on neuroticism, extraversion, openness, agreeableness, and conscientiousness. if these five dimensions were perfectly independent of each other, we would see that individuals are equally likely to be assigned to one of the 32 types. There is no obvious way to reduce these 32 types to just 3.

Figure 1. small caps = below median, capitals = above mean

So, how did Robins et al. (1996) come to the conclusion that there are only three personality types? The data were Q-sorts. A Q-sort is similar to personality ratings on a series of attributes. The main difference is that the sorting task imposes a constraint on the scores that can be given to an individual. As a result, all individuals have the same overall mean across items. That is, nobody could be above average on all attributes. These kind of data are known as ipsative data. An alternative way to obtain ipsative data would be to subtract the overall mean of ratings from individual ratings. Although the distinction between ipsative and non-ipsative data is technically important, it has no implications for the broader understanding of Robins et al.’s work. The study could also have used ratings.

Robins et al. then performed a factor analysis. However, this factor analysis is different from a typical factor analysis that relies on correlations among items. Rather, the data matrix is transposed and the factor analysis is run on participants. With N = 300, there are three hundred variables and factor analysis is used to reduce this set of variables to a smaller set of factors, while minimizing the loss of information.

Everybody knows that the number of factors in a factor analysis is arbitrary and that a smaller number of factors implies a loss of information.

“Empirical research on personality typologies has been hampered by the lack of clear criteria for determining the number of types in a given sample. Thus, the costs and benefits of having a large number of types must be weighed against those of having relatively few types” (Robins et al., 1996).

The authors do not report Eigenvalues or other indicators of how much variance their three factor solution explained.

The three types are described in terms of the most and least descriptive items. Type 1 can be identified by high conscientiousness (“He is determined in what he does”), high extraversion (“He is energetic and full of life”), low neuroticism (reversed: “When he is under stress, he gives up and backs off”), high agreeableness (“He is open and straightforward”), and high openness (“He has a way with words”). In short, Type 1 is everybody’s dream child; a little Superman in the making.

Type 2 is characterized by high neuroticism (“He gets nervous in uncertain situations”), introversion (reversed: “He tries to be the center of attention”), low openness (reversed: he has a way with words,” but high agreeableness (“He is considerate and thoughtful of other people” ). Conscientiousness doesn’t define this type one way or the other.

Type 3 is characterized by low neuroticism (rerversed: “He is calm and relaxed; easy going”), high extraversion (“He tries to be the center of attention”), low conscientiousness (reversed: He plans things ahead; he thinks before he does something) and low agreeableness (He is stubborn”).

The main problem with this approach is that these personality profiles are not types. Take Profile 1 for example. While some participants’ profile correlated highly positively with Profile 1, some participants profile correlates highly negatively with Profile 1. What personality type are they? We might say that they are the opposite of Superman, but that would imply that we need another personality type for the Anti-Supermans. The problem doesn’t end here. As there are three profiles, each individual is identified by their correlations with all three profiles. Thus, we end up with eight different types depending on whether the correlation with the three profiles are positive or negative.

In short, profiles are not types. Thus, the claim that there are only three personality types is fundamentally flawed because the authors confused profiles with types. Even the claim that there are only 8 types would rest on the arbitrary choice of extracting only three factors. Four factors would have produced 16 types and five factors would have produced 32 types, just as the Big Five model predicted.

Asendorph et al. (2001) also found three profiles that they considered to be similar to those found by Robins et al. (1996). Moreover, they examined profiles in a sample of adult with a Big Five questionnaire (i.e., the NEO-FFI). Importantly, Asendorpf et al. (2001) use better terminology and refer to profiles as prototypes rather than types.

The notion of a prototype is that there are no clear defining features that determine class membership. For example, on average mammals are heavier than birds. So we can distinguish birds and mammals by their prototypical weight (how close their weight is to the average weight of a bird or mammal) rather than on the basis of a defining feature (lays eggs, has a uterus). Figure 2 shows the prototypical Big Five profile for the three groups of participants, when participants were assigned to three groups.

The problem is once more that the grouping into three groups is arbitrary. Clearly there are individuals with high scores on agreeableness and on openness, but this variation in personality was not used to create the three groups. Based on this figure, groupings are based on low N and high C, high N and low E, and low C. It is not clear what we should do with individuals who do not match any of these prototypical profiles. What type are individuals who are high in N and high in C?

In sum, a closer inspection of studies of personality types suggests that these studies failed to address the question. Searching for prototypical item-profiles is not the same thing as searching for personality types. In addition, the question may not be a good question. If personality attributes vary mostly quantitatively and if the number of personality traits is large, the number of personality types is infinite. Every individual is unique.

Are Some Personality Types More Common Than Others?

As noted above, the number of personality types that are theoretically possible is determined by the number of attributes and the levels of each attribute. If we describe personality with the Big Five and limit the levels to being above or below the median, we have 32 theoretical patterns. However, this does not mean that we actually observe all patterns. Maybe some types never occur or are at least rare. The absence of some personality types could provide some interesting insights into the structure of personality. For example, high conscientiousness might suppress neuroticism and we would see very few individuals who are high in C and low in N (Digman, 1997). However, when C is low, we could see equal numbers of individuals with high N and low N because conscientiousness only inhibits high N, while low conscientiousness does not lead to high N. It is impossible to examine such patterns with bivariate correlations (Feger, 1988).

A simple way to examine this question is to count the frequencies of personality traits (Anusic & Schimmack, unpublished manuscript that was killed in peer-review). Here, I present the results of this analysis based on Sam Gosling’s large internet survey with millions of visitors who completed the BFI (John, Naumann, & Soto, 2008).

Figure 3 simply shows the relative frequencies of the 32 personality types.

Figure 4 shows the results only for US residents. The results are very similar to those for the total sample.

The most notable finding is that the types nEOAC and Neoac are more frequent than all other types. These types are evaluatively positive or negative. However, it is important to realize that these types are not real personality types. Other research has demonstrated that the evaluative dimension in self-ratings of personality is mostly a rating or a perception bias (Anusic et al., 2009). Thus, individuals with a nEOAC profile do not have a better personality. Whether they simply rate themselves (other-deception) or actually see themselves (self-deception) as better than they are is currently unknown.

The next two types with above average frequency are nEoAC and NeOac. A simple explanation for this pattern is that openness is not highly evaluative and so some people will not inflate their openness scores, while they are still responding in a desirable way on the other four traits.

The third complementary pair are the neoAC and the NEOac types. This pattern can also be explained with rating biases because some people do not consider openness and extraversion desirable; so they will only show bias on neuroticism, agreeableness and conscientiousness. These people were called “Saints” by Paulhus and John (1998).

In short, one plausible explanation of the results is that all 32 personality types that can be created by combining high and low scores on the Big Five exists. Some types are more frequent than others, but at least some of this variation is explained by rating biases rather than by actual differences in personality.

Conclusion

The main contribution of this new look at personality types is to clarify some confusion about the notion of personality types. Previous researchers used the term types for prototypical personality profiles. This is unfortunate because it led to the misleading impression that there are only three personality types. You are either resilient, over-controlled, or under-controlled. In fact, even three profiles create more than three types. Moreover, the profiles are based on exploratory factor analyses of personality ratings and it is not clear why there are only three profiles. Big Five theory would predict five profiles where each profile is defined by items belonging to one of the Big Five factors. It is not clear why profile analyses yielded only three factors. One explanation could be that the item set did not capture some personality dimensions. For example, Robins et al.’s (1996) Q-sort did not seem to include many openness items.

Based on Big Five theory, one would expect 32 personality types that are about equally frequent. An analysis of a large data set showed that all 32 types exists, which is consistent with the idea that the Big Five are fairly independent dimensions that can occur in any combination. However, some types were more frequent than others. The most frequent combination was either desirable (nEOAC) or undesirable (Neoac). This finding is consistent with previous evidence that personality ratings are influenced by a general evaluative bias (Anusic et al., 2009). Additional types with higher frequencies can be attributed to variations in desirability. Openness and extraversion are not as desirable, on average, as low neuroticism and high agreeableness and conscientiousness. Thus, the patterns nEoAC and neoAC may also reflect desirability rather than actual personality structure. Multi-method studies or low evaluative items would be needed to examine this question.

Implications

Personality psychologists are frustrated that they have discovered the Big Five factors and created a scientific model of personality, but in applied settings the Myers-Briggs Type Indicator (MBTI) dominates personality assessment (Funder, 2019).

One possible reason is that the MBTI provides simple information about personality by classifying individuals into 16 types. These 16 types are defined by being high or low on four dimensions.

There is no reason, why personality psychologists could not provide simplified feedback about personality using a median split on the Big Five and assigning individuals to the 32 types that can be created by the Big Five factors. For example, I would be the NEOac type. Instead of using small caps and capitals, one could also use letters for both poles of the dimension, neurotic (N) vs. stable (S), extraverted (E) vs. introverted (I), variable (V) versus regular (R), agreeable (A) vs. dominant (D), and conscientious (C) vs. laid back (L). This would make me an NEVDL type. My son would be an SIRAC.

I see no reason why individuals would prefer Myer-Briggs types over Big Five types, given that the Big Five types are based on a well-established scientific theory. I believe the main problem in giving individuals feedback with Big Five scores is that many people do not think in terms of dimensions.

The main problem might be that we are assigning individuals to types even when their scores are close to the median and their classification is arbitrary. For example, I am not very high on E or low on C and it is not clear whether I am really an NEVDL or an NIVDC type. One possibility would be to use only scores that are one standard deviation above or below then mean or median. This would make me an N-VD- type.

To conclude, research on personality types has not made much progress for a good reason. The number of personality types depends on the number of attributes that are being considered and it is no longer an empirical question which types exists. With fairly independent dimensions all types exist and the number of types increases exponentially with the number of attributes. The Big Five are widely considered the optimal trade-off between accuracy and complexity. Thus, they provide an appealing basis for the creation of personality type and a viable alternative to the Myer-Briggs Type Indicator.

If you want to know what type you are, you can take the BFI online ( https://www.outofservice.com/bigfive/ ). It provides feedback about your personality in terms of percentiles. To create your personality type, you only have to convert the percentiles into letters.

Negative Emotionality P < 50 = S P > 50 = N
Extraversion P < 50 = I P > 50 = E
Open-Mindedness P < 50 = R P > 50 = V
Agreeableness P < 50 = D P > 50 = A
Conscientiousness P < 50 = L P > 50 = C

However, keep in mind that your ratings and those of the comparison group are influenced by desirability.

If you are a NIRDL, you may have a bias to rate yourself as less desirable than you actually are

If you are an SEVAC, you may have a tendency to overrate your desirability.

Testing Hierarchical Models of Personality with Confirmatory Factor Analysis

Naive and more sophisticated conceptions of science assume that empirical data are used to test theories and that theories are abandoned when data do not support them. Psychological journals give the impression that psychologists are doing exactly that. Journals are filled with statistical hypothesis tests. However, hypothesis tests are not theory tests because only results that confirm a theoretical prediction (by falsifying the null-hypothesis) get published; p < .05 (Sterling, 1959). As a result, psychology journals are filled with theories that have never been properly tested. Chances are that some of these theories are false.

To move psychology towards being a science, it is time to subject theories to empirical tests and to replace theories that do not fit the data with theories that do. I have argued elsewhere already that higher-order models of personality are a bad idea with little empirical support (Schimmack, 2019a). Colin DeYoung responded to this criticism of his work (DeYoung, 2019). In this blog post, I present a new approach to the testing of structural theories of personality with confirmatory factor analysis (CFA). The advantage of CFA is that it is a flexible statistical method that can formalize a variety of competing theories. Another advantage of CFA is that it is possible to capture and remove measurement error. Finally, CFA provides fit indices that make it possible to compare models and to select models that fit the data better. Although CFA celebrates its 50th birthday this year, psychologists still have to appreciate its potential for testing personality theories (Joreskog, 1969).

What are Higher-Order Factors?

The notion of a factor has a clear meaning in psychology. A factor is a common cause that explains, at least in a statistical sense, why several variables are correlated with each other. That is, a factor represents the shared variance among several variables that is assumed to be caused by a common cause rather than by direct causation among the variables.

In traditional factor analysis, factors explain correlations among observed variables such as personality ratings. The notion of higher-order factors implies that first-order factors that explain correlations among items are correlated (i.e., not independent) and that these correlations among factors are explained by another set of factors, which are called higher-order factors.

In empirical tests of higher-order factors it has been overlooked that the Big Five factors are already higher-order factors in a hierarchy of personality traits that explain correlations among more specific personality traits like sociability, curiosity, anxiety, or impulsiveness. Instead ALL tests of higher-order models have relied on items or scales that measure the Big Five. This makes it very difficult to study the higher-order structure of personality because results will vary depending on the selection of items that are used to create Big Five scales.

A much better way to test higher-order models is to fit a hierarchical CFA model to data that represent multiple basic personality traits. A straightforward prediction of a higher-order model is that all or at least most facets that belong to a common higher order factor should be correlated with each other.

For example, Digman (1997) and DeYoung (2006) suggested that extraversion and openness are positively correlated because they are influenced by a common factor, called beta or plasticity. As extraversion is conceived as a common cause of sociability, assertiveness, and cheerfulness and openness is conceived as a common cause of being imaginative, artistic, and reflective, the model makes the straightforward prediction that sociability, assertiveness, and cheerfulness are positively correlated with being imaginative, artistic, and reflective.

Evaluative Bias

One problem in testing structural models of personality is that personality ratings are imperfect indicators of personality. Some of the measurement error in personality ratings is random, but other sources of variance are systematic. Two sources have been reliably identified, namely acquiescence and evaluative bias (Anusic et al., 2009; Biderman et al., 2019). DeYoung (2006) also found evidence for evaluative bias in a multi-rater study. Thus, there is agreement between DeYoung and me that some of the correlations among personality ratings do not reflect the structure of personality, but rather systematic measurement error. It is necessary to control for these method factors when studying the structure of personality traits and to examine the correlation among Big Five traits because method factors distort these correlations in mono-method studies. In two previous posts, I found no evidence of higher-order factors when I fitted hierarchical models to the 30 facets of the NEO-PI-R and another instrument with 24 facets (Schimmack, 2019b, 2019c). Here I take another look at this question by examining more closely the pattern of correlations among personality facets before and after controlling for method variance.

Data

From 2010 to 2012 I posted a personality questionnaire with 303 items on the web. Visitors were provided with feedback about their personality on the Big Five dimensions and specific personality facets. Earlier I presented a hierarchical model of these data with three items per facet (Schimmack, 2019). Subsequently, I examined the loadings of the remaining items on these facets. Here I presents results for 179 items with notable loadings on one of the facets (Item.Loadings.303.xlsx; when you open file in excel, selected items are highlighted in green). The use of more items per facets makes the measurement model of facets more stable and ensures more stable facet correlations that are more likely to replicate across studies with different item sets. The covariance matrix for all 303 items is posted on OSF (web303.N808.cov.dat) so that these results presented below can be reproduced.

Results

Measurement Model

I first constructed a measurement model. The aim was not to test a structural model, but to find a measurement model that can be used to test structural models of personality. Using CFA for exploration seems to contradict its purpose, but reading the original article by Joreskog shows that this approach is entirely consistent with the way he envisoned CFA to be used. It is unclear to me who invented the idea that CFA should follow an EFA analysis. This makes little sense because EFA may not fit some data if there are hierarchical relationships or correlated residuals. So, CFA modelling has to start with a simple theoretical model that then may need to be modified to fit some data, which leads to a new model to be tested with new data.

To develop a measurement model with reasonable fit to the data, I started with a simple model where items had fixed primary loadings and no secondary loadings, while all factors were allowed to be correlated with each other. This is a simple structure model. It is well known that this model does not fit real data. I then modified the model based on modification indices that suggested (a) secondary loadings, (b) relaxed the constraint of a primary loading, or (c) suggested correlated item residuals. This way a model with reasonable fit to the data was obtained, CFI = .775, RMSEA = .040, SRMR = .042 (M0.Measurement.Model.inp on OSF). Although CFI was below the standard criterion of .95, model fit was considered acceptable because the only source of misfit to the model would be additional small secondary loadings (< .2) or correlated residuals that have little influence on the magnitude of the facet correlations.

Facet Correlations

Below I present the correlations among the facets. The full correlation matrix is broken down into sections that are theoretically meaningful. The first five tables show the correlations among facets that share the same Big Five factor.

There are three main neuroticism facets: anxiety, anger/hostility, and depression. A fourth facet was originally intended to be an openness to emotions facet, but it correlated more highly with neuroticism (Schimmack, 2009c). All four facets show positive correlations with each other and most of these correlations are substantial, except the strong emotions and depression facets.

Results for extraversion show that all five facets are positively correlated with each other. All correlations are greater than .3, but none of the correlations are so high as to suggest that they are not distinct facets.

Openness facets are also positively correlated, but some correlations are below .2, and one correlation is only .16, namely the correlation between openness to activities and art.

The correlations among agreeableness facets are more variable and the correlation between modesty and trust is slightly negative, r = -.05. The core facet appears to be caring which shows high correlations with morality and forgiveness.

All correlations among conscientiousness facets are above .2. Self-discipline shows high correlations with competence beliefs and achievement striving.

Overall, these results are consistent with the Big Five model.

The next tables examine correlations among sets of facets belonging to two different Big Five traits. According to Digman and DeYoung’s alpha-beta model, extraversion and openness should be correlated. Consistent with this prediction, the average correlation is r = .16. For ease of interpretation all correlations above .10 are highlighted in grey, showing that most correlations are consistent with predictions. However, the value facet of openness shows lower correlations with extraversion facets. Also, the excitement seeking facet of extraversion is more strongly related to openness facets than other facets.

The alpha-beta model also predicts negative correlations among neuroticism and agreeableness facets. Once more, the average correlation is consistent with this prediction, r = -.15. However, there is also variation in correlations. In particular, the anger facet is more strongly negatively correlated with agreeableness facets than other neuroticism facets.

As predicted by the alpha-beta model, neuroticism facets are negatively correlated with conscientiousness facets, average r = -.21. However, there is variation in these correlations. Anxiety is less strongly negatively correlated with conscientiousness facets than other neuroticism facets. Maybe, anxiety sometimes has similar effects as conscientiousness by motivating people to inhibit approach motivated, impulsive behaviors. In this context, it is noteworthy that I found no strong loading of impulsivity on neuroticism (Schimmack, 2019c).

The last pair are agreeableness and conscientiousness facets, which are predicted to be positively correlated. The average correlation is consistent with this prediction, r = .15.

However, there is notable variation in these correlations. A2-Morality is more strongly positively correlated with agreeableness than other agreeableness facets, in particular trust and modesty which show weak correlations with conscientiousness.

The alpha-beta model also makes predictions about other pairs of Big Five facets. As alpha and beta are conceptualized as independent factors, these correlations should be weaker than those in the previous tables and close to zero. However, this is not the case.

First, the average correlation between neuroticism and extraversion is negative and nearly as strong as the correlation between neuroticism and agreeableness, r = -.14. In particular, depression is strongly negatively related to extraversion facets.

The average correlation between extraversion and agreeableness facets is only r = .07. However, there is notable variability. Caring is more strongly related to extraversion than other agreeableness facets, especially with warmth and cheerfulness. Cheerfulness also tends to be more strongly correlated with agreeableness facets than other extraversion facets.

Extraversion and conscientiousness facets are also positively correlated, r = .15. Variation is caused by stronger correlations for the competence and self-discipline facets of conscientiousness and the activity facet of extraversion.

Openness facets are also positively correlated with agreeableness facets, r = .10. There is a trend for the O1-Imagination facet of openness to be more consistently correlated with agreeableness facets than other openness facets.

Finally, openness facets are also positively correlated with conscientiousness facets, r = .09. Most of this average correlation can be attributed to stronger positive correlations of the O4-Ideas facet with conscientiousness facets.

In sum, the Big Five facets from different Big Five factors are not independent. Not surprisingly, a model with five independent Big Five factors reduced model fit from CFI = .775, RMSEA = .040 to CFI = .729, RMSEA = .043. I then fitted a model that allowed for the Big Five factors to be correlated without imposing any structure on these correlations. This model improved fit over the model with independent dimensions, CFI = .734, RMSEA = .043.

The pattern of correlations is consistent with a general evaluative factor rather than a model with independent alpha and beta factors.

Not surprisingly, fitting the alpha-beta model to the data reduced model fit, CFI = .730, RMSEA = .043. In comparison, a mode with a single evaluative bias factor had better fit, CFI = .732, RMSEA = .043.

In conclusion, the results confirm previous studies that a general evaluative dimension produces correlations among the Big Five factors. DeYoung’s (2006) multi-method study and several other multi-method studies demonstrated that this dimension is mostly rater bias because it shows no convergent validity across raters.

Facet Correlations with Method Factors

To remove the evaluative bias from correlations among facets, it is necessary to model evaluative bias at the item level. That is, all items load on an evaluative bias factor. This way the shared variance among indicators of a facet reflects only facet variance and no evaluative variance. I also included an acquiescence factor, although acquiescence has a negligible influence on facet correlations.

It is not possible to let all facets to be correlated freely when method factors are included in a model because this model is not identified. To allow for a maximum of theoretically important facet correlations, I freed parameters for facets that belong to the same Big Five factor, facets that are predicted to be correlated by the alpha-beta model, and additional correlations that were suggested by modification indices. Loadings on the evaluative bias factor were constraint to 1 unless modification indices suggested that items had stronger or weaker loadings on the evaluative bias factor. This model fitted the data as well as the original measurement model, CFI = .778 vs. 775, RMSEA = .040 vs. .040. Moreover, modification indices did not suggest any further correlations that could e freed to improve model fit.

The main effect of controlling for evaluative bias is that all facet correlations were reduced. However, it is particularly noteworthy to examine the correlations that are predicted by the alpha-beta model.

The average correlation for extraversion and openness facets is r = .07. This average is partially driven by stronger correlations of the excitement seeking facet with openness facets than other excitement facets. There are only four other correlations above .10, and 9 of the 25 correlations are negative. Thus, there is little support for a notable general factor that produces positive correlations between extraversion and openness facets.

The average correlation for neuroticism and agreeableness is r = -.06. However, the pattern shows mostly strong negative correlations for the anger facet of neuroticism with agreeableness facets. In addition, there is a strong positive correlation between anxiety and morality, r = .20. This finding suggests that anxiety may also serve the function to inhibit immoral behavior.

The average correlation for neuroticism and conscientiousness is r = -.07. While there are strong negative correlations, r = -.30 for anger and deliberation, there is also a strong positive correlation, r = .22 for self-discipline and anxiety. Thus, the relationship between neuroticism and conscientiousness facets is complex.

The average correlation for agreeableness and conscientiousness facets is r = .01. Moreover, none of the correlations exceeded r = .10. This finding suggests that agreeableness and conscientiousness are independent Big Five factors, which contradicts the prediction by the alpha-beta model.

The finding also raises questions about the small but negative correlations of neuroticism with agreeableness (r = -.06) and conscientiousness (r = -.07). If these correlations were reflecting the influence of a common factor alpha that influences all three traits, one would expect a positive relationship between agreeableness and conscientiousness. Thus, these relationships may have another origin, or there is some additional negative relationship between agreeableness and conscientiousness that cancels out a potential influence of alpha.

Removing method variance also did not eliminate relationships between facets that are not predicted to be correlated by the alpha-beta model. The average correlation between neuroticism and extraversion facets is r = -.05, which is small, but not notably smaller than the predicted correlations (r = .01 to .07).

Moreover, some of these correlations are substantial. For example, excitement seeking is negatively related to anxiety (r = -.24) and warmth is negatively related to depression (r = -.22). Any structural model of personality structure needs to take these findings into account.

A Closer Examination of Extraversion and Openness

There are many ways to model the correlations among extraversion and openness facets. Here I demonstrate that the correlation between extraversion and openness depends on the modelling of secondary loadings and correlated residuals. The first model allowed for extraversion and openness to be correlated. It also allowed for all openness facets to load on extraversion and for all extraversion facets to load on openness. Residual correlations were fixed to zero. This model is essentially an EFA model.

Model fit was as good as for the baseline model, CFI = .779 vs. 778, RMSEA = .039 vs. .040. The pattern of secondary loadings showed two notable positive loadings. Excitement seeking loaded on openness and open to activities loaded on E. In this model the correlation between extraversion and neuroticism was .08, SE = .17. Thus, the positive correlation in the model without secondary loadings was caused by not modelling the pattern of secondary loadings.

However, it is also possible to fit a model that produces a strong correlation between E and O. To do so, the loadings excitement seeking and openness to actions can be set to zero. This pushes other secondary loadings to be negative, which is compensated by a positive correlation between extraversion and openness. This model has the same overall fit as the previous model, both CFI = .779, both RMSEA = .039, but the correlation between extraversion and openness jumps to r = .70. The free secondary loadings are all negative.

The main point of this analysis is to show the importance of facet correlations for structural theories of personality traits. In all previous studies, including my own, the higher-order structure was examined using Big Five scales. However, the correlation between an Extraversion Scale and an Openness Scale provides insufficient information about the relationship between the Extraversion Factor and the Openness Factor because scales always confound information about secondary loadings, residual correlations, and factor correlations.

The goal for future research is to find ways to test competing structural models. For example, the second model suggests that any interventions that increase extraversion would decrease openness to ideas, while the first model does not make this prediction.

Conclusion

Personality psychologists have developed and tested structural models of personality traits for nearly a century. In the 1980s, the Big Five factors were identified. The Big Five have been relatively robust in future replication attempts and emerged also in this investigation. However, there has been little progress in developing and testing hierarchical models of personality that explain what the Big Five are and how they are related to more specific personality traits called facets. There have also been attempts to find even broader personality dimensions. An influential article by Digman (1997) proposed that a factor called alpha produces correlations among neuroticism, agreeableness, and conscientiousness, while a beta factor links extraversion and openness. As demonstrated before, Digman’s results could not be reproduced and ignored evaluative bias in personality ratings (Anusic et al., 2009). Here, I show that empirical tests of higher-order models need to use a hierarchical CFA model because secondary loadings create spurious correlations among Big Five scales that distort the pattern of correlations among the Big Five factors. Based on the present results, there is no evidence for Digman’s alpha and beta factors.

Fact-Checking Roy Baumeister

Roy Baumeister wrote a book chapter with the title “Self-Control, Ego Depletion, and Social Psychology’s Replication CrisisRoy” (preprint). I think this chapter will make a valuable contribution to the history of psychology and provides valuable insights into the minds of social psychologists.

I fact-checked the chapter and comment on 31 misleading or false statements.

https://replicationindex.files.wordpress.com/2019/09/ego-depletion-and-replication-crisis.docx

Comments are welcome.

Peer-Review is Censorship not Quality Control

The establishment in psychological science prides itself on their publications in peer-reviewed journals. However, it has been known for a long time that peer-review, especially at fancy journals with high rejection rates, is not based on the quality of an empirical contribution. Peer-review is mainly based on totally subjective criteria of quality. Rather than waiting for three month, editors should just accept or reject papers and state clearly that the reason for their decision is their subjective preference.

I resigned from the editorial board of JPSP-PPID after I received one of these action letters from JPSP.

The key finding of our study based on 450 triads (student, mother, father) who reported on their personality and well-being in a round-robin design (several years of data collection, 100,000 dollar in research funding) was that positive illusions about one’s personality (self-enhancement) did not predict informant ratings of well-being. Surely, we can debate the implications of this finding, but it is rather interesting that positive illusions about the self do not seem to make individuals happier in ways that others cannot perceive. Or so I thought. Not interesting at all because apparently, self-ratings of well-being are perfectly valid indicators of well-being. So, if informants don’t see that fools are happier, they are still happier, just in a way that others do not see. At least, that was the opinion of the editor and as it had the power to decide what gets published in JPSP, it was rejected.

I am mainly posting the editorial letter here because I think the review process should be transparent and open. After all, these decisions influence what gets published when and where. If we pride ourselves on the quality of the review process, we shouldn’t have a problem to demonstrate this quality by making decision letters public. Here everybody can judge for themselves how good the quality of the peer-review process at JPSP is. That is called open science.

Manuscript No. PSP-P-2019-1535

An Integrated Social and Personality Psychological Model of Positive Illusions, Personality, and Wellbeing Journal of Personality and Social Psychology:  Personality Processes and Individual Differences

Dear Dr. Schimmack,

I have now received the reviewers’ comments on your manuscript. I appreciated the chance to read this paper. I read it myself prior to sending it out and again prior to reading the reviews. As you will see below, the reviewers and I found the topic and dataset to be interesting. However, based on their analysis and my own independent evaluation, I am sorry to say that I cannot accept your manuscript for publication in Journal of Personality and Social Psychology:  Personality Processes and Individual Differences.

The bottom line is that the strongest theoretical contribution the model would appear to produce is not currently justified in the paper and the empirical evidence presented regarding that contribution does not rise to the occasion to support publication in JPSP.

I’ll start by mentioning that both Reviewers commented on the lack of clarity of the presentation, and unfortunately, I agree. Reviewer 1 commented overall, “There were passages with a lack of clarity.” Just focusing on the Introduction, I read it multiple times to try to understand what you see as the central theoretical contribution. I found that entire literatures were overlooked (well-being) or mischaracterized (social psychological perspectives on well-being), and illogical arguments were advanced(p. 4). Terms were introduced without definition (e.g., hedonic balance) but later included in the statistical model, and – when later comparing the terms in the model to the literature reviewed – I found a lack of discussion of or justification for the paths that were actually tested and that seem to be at the heart of what you see as the main contribution to the literature (as indicated by the first paragraphs of the Discussion section). Echoing Reviewer 1’s more general point, Reviewer 2 commented specifically on the model, stating: “The authors should describe the statistical model in much more detail to make the statistical analyses easier to follow, more transparent, and replicable.” Again, I agree.

I bring that up at the outset of the letter because you will see me refer to lack of clarity here and there, below. But now I’d like to set aside writing to focus on the theoretical and empirical contribution, which are the heart of the matter.

Through a careful reading of the terms in the model, analyses, and the first paragraphs of the Discussion section, here is what I understand are the claims about the central theoretical and empirical contributions (by “central” I mean the contributions that would make this work cross the threshold for publication in JPSP:PPID): you believe (subjective) well-being has a “truth” to it in the way personality traits might, and a bias to it. You think the “truth” (public view) estimate of well-being is a more important outcome than the “bias” estimate. As such, you draw the conclusion that you have overturned a seminal paper and the field of social psychology’s perspective on well-being because the measure you care about, the “truth” estimate of well-being, does not correlate with self-ratings of positive illusions. (This conclusion appears to be drawn despite the fact that positive illusions about the self and self-reported well-being are indeed correlated, which replicates prior positive illusions literature.)

Given that the argument turns on this interesting idea about truth and bias estimates of well-being, I’ll focus on well-being. There is a huge literature on well-being. Since Schwarz and Strack (1999), to take that arbitrary year as a starting point, there have been more than 11,000 empirical articles with “wellbeing” (or well-being or well being) in the title, according to PsychInfo. The vast majority of them, I submit, take the subjective evaluation of one’s own life as a perfectly valid and perhaps the best way to assess one’s own evaluation of one’s life. So if you are staking the conclusion of your paper on the claim that in fact others’ agreement with a person about whether that person’s life is good is the best representation of one’s well-being, and researchers in the field should dismiss the part about the evaluation that is unique to the evaluator, then that needs to be heartily justified in the paper’s Introduction. The onus is on the authors to do that and I do not believe it is there.

Instead, from what I can tell you appear to be relying on an assumption that, because well-being is consistent with statistical properties of personality in that “wellbeing judgments show agreement between self-ratings and informant ratings (Schneider & Schimmack, 2009; Zou, Schimmack, & Gere, 2013) and are much more stable than the heuristics-and-bias perspective suggests (Schimmack & Oishi, 2005)” (p. 7), therefore the conceptual problem is the same as for measures of personality. It is not. It is of course well-established on theoretical grounds why personality traits are useful to assess from multiple perspectives. But for the question of well-being, this is literally about my subjective feeling about my life; on what grounds do others’ perspectives take a higher priority than the self’s? I agree that it is an interesting question to know if others can see my well-being the way that I do, but this so-called “truth” estimate speaks to quite a different research question than what most of the well-being research field would consider to be an important question. If you think it is important or even more important than the way it has been traditionally done (which I surmise you might, based on what appears to be the dismissal of 30 years of research on positive illusions and well-being in the Discussion), it is up to you to (a) define and measure well-being as it relates to the contemporary psychological literature, (b) explain why this subjective assessment should not be taken at face value but instead needs multi-rater reports to make accurate or meaningful inferences, then (c) explain why each of your predictors would map on to each of these two estimates (i.e., truth and bias) and (d) why those paths matter for the broader literature.

I do see where you talked about positive illusions and “positive beliefs” (which I think you equate with wellbeing but it was unclear) side by side in the introduction (e.g., p. 4), but not where you (1) recognized (a) positive illusions about personality and (b) wellbeing estimates as distinct constructs and (2) justified why one would be associated with the other.

If you make those arguments – situated in the contemporary literature on well-being, and reviewers for a future submission agree with the logic and potential theoretical contribution – the next hurdle of course is the empirical contribution. Assuming the models are correct (see both reviewers’ comments on this), this paper would make empirical contributions in its conceptual replications of prior findings and a few other interesting observations. But the biggest theoretical contribution you appear to want to claim is that “Overall, these results challenge Taylor and Brown’s seminal claim that mental health and wellbeing are rooted in positive illusions.” Yet, (a) you do present evidence that the link between positive illusions about the self and well-being as assessed by the self are correlated, as has been done previously in that literature, and (b) this conclusion appears to be drawn based on null effects using a measure that is not established (i.e., “truth”). (And please see Reviewer 1’s concerns about the cross-sectional nature of the findings as well as the fact that measures use few items.)

Overall, this dataset is rich and the idea of considering convergence and bias in well-being estimates is interesting. To produce a paper that will have a strong impact, I suggest you take a close look at your modeling approach (Reviewer 2), take a close look at your conceptual model itself (not the results) and map it on to the points in the literature that it most closely addresses (e.g., novel questions about separating well-being into truth and bias), and consider what additional evidence might bolster that theoretical or methodological contribution.

Additionally, Reviewer 1 commented on the framing of the paper, on antagonistic language, and on editorializing, and I agree on all fronts. The frame is much too broad, sets up a false dichotomy between social and personality psychology, and the evidence does not rise to the occasion to either (a) take down the paper the Introduction sets up as the foil (i.e., Taylor and Brown, 1988) or (b) allow personality psychologists to “win” the false competition between social and personality psychology about whether positive illusions contribute to well-being.

Other comments:

–              Please justify the use of these two sets of life evaluations but not hedonic balance as indicators of well-being, based on contemporary literature and evidence on well-being and how these should relate to one another. (I note that, incidentally, Schwarz and Strack include happiness judgments in their review of well-being.)

–              In the Method section, what was the timescale of the “hedonic balance” assessment. Was it “right now”? The past 24 hours? Two weeks?

–              Both reviewers were experts in SEM methods and personality; please do take a close look at their methodological comments, which were quite thoughtful and helpful as I considered my decision.

–              I had similar questions as Reviewer 1 regarding the fact that student gender was lumped together relative to mother and father reports, where gender is naturally separated. I agree that there is low statistical power to address this empirically but just wanted to let you know that this thought independently came up for two of us.

In closing, I would like to thank the reviewers for their constructive comments, and I look forward to reading more about this research in the future.

For your guidance, I have appended the reviewers’ comments, and hope they will be useful to you as you prepare this work for another outlet.

Thank you for giving us the opportunity to consider your submission.

Sincerely,

Sara Algoe, Ph.D.

Associate Editor

Journal of Personality and Social Psychology:  Personality Processes and Individual Differences